Continuing on part 1 of this series (here), we continue working with StackOverflow's data. In serious circumstances, the resulting models are not very good. They poorly explain high earners from the resulting survey. I did see that providing a dataset with only earners of $200k or less resulted in a much cleaner model. However, I decided that there were too many high earners to ignore. Now, we still get some useful information with the models that we already have. Let's focus on what we can glean from these bozos!
Want to know more about each variable generally? Well, great! The standard metric you'll see is variable importance. Many, if not most, models come with inherent variable importance metrics. Random forests have a few choices, mainly mean decrease in accuracy (when you shuffle the values of one variable, how much accuracy is lost overall) and mean decrease in Gini impurity (a metric that measures how much "impurity" is lowered when a given variable is used in the forest's underlying trees--remember a random forest uses many CARTs). I suggest reading this post to get a better idea of these metrics. However, permutation is a model agnostic choice. I use the iml package in R for machine learning interpretability problems. Similar to how the MDA metric was laid out above, permutation looks at shuffled values for each column and finds how much error is added by doing this. So, if a variable is very important, changes in its values will strongly influence a prediction. Let's compare these buggers on our XGBoost model.
The format is slightly different, as Gini uses particular levels of categorical variables. This can be quite informative, but also skips over how important the entire variable is. If you condense the Gini importance down, it appears that the same variables come out as the most important: Country, YearsCoding, FormalEducation. After that, it seems there's some mild disagreement.
Another global diagnostic is partial dependence. Take a variable of interest, say FormalEducation, and force every observation to take a given unique value of FormalEducation. What you get is a distribution of predictions. For categorical variables like this one, you'll get a boxplot of predictions for each unique value of FormalEducation. This tells us how different values raise or lower predictions in the big picture. In our case,
Another diagnostic is to use the H-statistic to measure how interactions between variables affect predictions. I find this very interesting, but I've not used it in practice and do not have much experience with it. Without passing a specific variable to check, you'll get an interaction signal for all variables--a general measure of how much this variable's interactions with the others. You can focus in on the interactions of a given variable with all of the others individually. This is done by passing the name of a single variable. I won't do an example of this because the size of the dataset and the exponential nature of the procedure breaks my laptop.
Simpler models like regression trees and linear regression have highly interpretable outputs. This is one of the greatest strengths, often trading some predictive power in turn. Because of this, surrogate models are helpful simplifiers of behemoth "black box" models. A surrogate model can be a simpler one compared versus your black box. If there are very few differences in predictions, the simpler one can possibly straight-up replace the more complex one. To be a true surrogate, a simpler model can be trained on top of the complex one to try and explain its predictions as the surrogate's y-variable or outcome. I inherently start with simpler models and compare them with more complex ones along the way, so I don't find myself ever using surrogates outright. However, iml offers a very easy way to do this with the TreeSurrogate function.
Ever get asked why a given prediction is the way it is? Someone ever just walk right up to your face and just "wah"? Well, this can be quite important to know how to handle. Machine learning methods run into big issues of ethics if you don't inspect your models carefully. Models can have racial biases due to imbalances in the dataset, make illegal choices based on gender, or make predictions with data that wouldn't exist at the time of prediction. So, whiners can be on the side of ethics. One great method for seeing what your model is doing is LIME. LIME builds a linear model on a single observation under the assumption that in complex models, after accounting for deep interactions, you'll eventually achieve a linear relationship. For instance, if you live in the US, know the languages R, SQL, Python, and Tableau, have a Bachelor's, and are 28 years old: years of experience should be linearly related to salary (even if that relationship is flat). Here are the first three training points. The effects show how much the top metrics influence a given prediction. In this case, different ages, experience, and ethnicities change the prediction. Being from the US added over $60,000 to the third respondent. One thing to keep in mind is that these don't give an intercept. You can't just add these effects up and get a meaningful prediction. They're missing a baseline.
The way I like to use this approach is to aggregate across all observations. That way we can see how often an effect occurs, by how much, and in which direction. Getting a positive or negative direction is something that variable importance metrics don't give, so you can be misled on how something is important. Thus, if we aggregate across a subset of the data (for computation time), we can get a good sample of which variables are important and in what way. I look at the effects in two different ways: an average and as total lift. The average of the effect just tells us the effect size when this variable matters to a respondent. Lift also takes into account how often this variable mattered and occurred. If something has a huge effect, but only happened once in the dataset, it might not be worth talking about. On the other hand, it can be useful to understand rare situations and how to use them.
Given everything we've seen, here's a summary. Being from the US gives huge bumps to salary. Of course, being a non-student seems to help as it would be more likely to indicate full-time employment. Folks in the prime of their careers at large companies also make bank comparatively. Coding for 5 or less years, especially professionally, is detrimental to salary. Lastly, we see that white people get a noticeable boost. Since they likely compose an extremely large proportion of all people in computer fields (remember, this is also a subset who uses StackOverflow), that will exacerbate the distribution further. This either picks up on a bias for Western countries to pay more for these folks or some level of systemic racism. Note that gender doesn't show up probably because the data is so heavily imbalanced towards males: 86% of the training data is male. None of this is at all surprising. What is surprising is the lack of particular competencies. We see almost no languages used, no communication platforms, no databases, no frameworks, little notice of education, and so on.
My conclusion is that there are very large interactions with the different careers shoved into this data and the countries involved. There are also so many columns to work with, so finding the right parameters to learn in this large, complex space would be difficult. I'd recommend cutting this down to just respondents in a certain fields: databases, web development, data science. These all require different skills, but so many folks in these arenas carry partial skillsets that would be fawned over in another field. However, I'd have expected something related to skills and education to pop even with these complexities. I witness so many people without degrees lead extremely successful careers. Formal education seems less important as I get older. The only clue that experience matters is that we see general and professional coding years tie in with salary, but we get nothing more detailed than that. It seems to me that either our modeling procedures get muddled by the multiple fields conflicting in this highly dimensional data or that particular languages really don't have any effect on salary. Particular languages are useful to know in order to match better with a team's needs, but there are probably few to none that are so difficult and in such high demand that they will skyrocket your salary.
I hope that this analysis was helpful. I find that being able to explain your most elegant models is a highly necessary, but poorly taught skill. Until recently, it felt like the field focused more on the predictive accuracy than why it certain predictions were being made or how someone could use the model for strategic purposes. Kaggle is a big reason for this, as it has taught folks to spend all their efforts on making the most accurate predictive model at all costs. Please keep these procedures in mind, because without them, models exploit systemic biases: poor facial recognition when little of the data is Black, offering loans with poor terms simply because of someone's race, and stalking innocent people for potential crimes simply because of where they live.
Give this post a read for some more details on these procedures and how to do them. I regularly use only a few of these procedures, as my datasets have very similar outcomes and patterns.