As a student of data science, I recently learned how to model variable interactions using Ordinary Least Squares (OLS) linear regression. It struck me as strange that the common advice to avoid the Dummy Variable Trap when analyzing categorical variables is to simply drop the first column based on the alpha-numeric category labels.
My intuition was that it must matter to some degree which column we choose to drop. And if it does matter, dropping a column because its label comes first seems very arbitrary and not especially scientific.
I found that while there are plenty of web resources describing why we drop a column in this scenario, few attempted to address the question of which to choose. So, to satisfy my curiosity and strengthen my understanding of OLS linear regression, I created some experimental data and tested the results of dropping different columns. My goal was to determine:
- Whether it mattered which we dropped (my hypothesis was Yes); and
- if it does matter, what are the factors data scientists should consider to choose an appropriate column.
For those of you who want to get right to the point, my analysis revealed:
While the column dropped does not appear to affect an OLS linear regression model’s performance, it can have a significant impact on the interpretability of the model’s coefficients.
Data scientists preparing an Ordinary Least Squares multiple linear regression model should carefully consider which columns they drop from each category if their goals for analysis include:
- Ranking standardized coefficients to infer relative magnitude or importance of predictors’ impacts on the target variable (i.e. “Does square footage or the number of bedrooms add more to a home’s sale price?”)
- An accurate understanding of how a unit change of predictor variable impacts the target (i.e. “What is the average increase in home sale price for each square foot of living space added?”)
- Explaining either of the above results to non-technical stakeholders.
Arbitrarily dropping the first column without considering what it represents can make it difficult to interpret the model coefficients not only with accuracy, but also in a way that will be intuitive for non-technical stakeholders.
Instead of relying on the convention of dropping the first column, data scientists should consider dropping columns that represent logical baseline reference points, from which the model will assume all included predictor variables to deviate. In some cases, a minimum value may be appropriate, in other cases an average value may be appropriate, and in still others the appropriate reference category may depend entirely on the goal of the analysis.
Below, I first provide some background, then outline the steps I took to come to these conclusions. The full details of my approach and analysis can be found in my github repo.
What is one-hot encoding?
When preparing data for a linear regression model, it is necessary to dummy, or one-hot encode (OHE), categorical variables into separate columns to represent the non-numerical values as numbers.
A categorical variable which represents gender, for instance, will be transformed from a single column into multiple one-hot encoded columns, so named since a value of 1 represents membership in that category with the rest of the category columns being filled with zeros.
Why drop any columns at all?
When learning about linear regression, students of data science are taught to drop one column from each categorical variable group to act as the reference category, and to avoid the “Dummy Variable Trap:” perfect multicollinearity between the predictors.
In our example, we could drop Female, Male, or NonBinary and leave the other two columns in our model.
As this article illustrates it would not be possible to solve the OLS problem at all using linear algebra if we didn’t drop a column, since we would have a singular matrix. Although both scikit-learn and statsmodels libraries use a different algorithm to solve OLS, and so technically can generate models if we don’t drop a column from each category, it’s still considered best practice to do so. In my experiment, I included models with no columns dropped to illustrate the strange results.
If the dataset contains multiple categorical variables, one column should be dropped from each group of resulting OHE columns. Throughout this article, I will discuss dropping ‘a column,’ which should be understood to mean one column from each categorical variable’s group.
Note that it’s necessary to drop one column from an OLS Linear Regression model, but may not be necessary (or even ill-advised) with other model types. The conclusions reached here are applicable to OLS linear regression models.
Arbitrarily dropping “first”
Python libraries such as Pandas and sckikit-learn have parameters built in to their one-hot-encoding methods which allow us to drop a column from each categorical group. A common approach is to drop first, meaning drop whichever column represents the category value name that comes first alpha-numerically in the set. In our gender example, the first column would be Female since F comes before M and N.
But is it really the best way to choose a variable to drop in all situations? In the following sections, I’ll walk through how I created data to test this, and concluded that dropping the first column is not always the best option.
Creating the test data set
I created a test data set of n=20,000 based on home characteristics and sales, where my OLS linear regression model would be trying to predict a home’s eventual sale price.
First, I generated the independent (predictor) variables:
- Square Feet of Living Space: A randomly generated continuous variable following a normal distribution, with a mean of ~2,800. To generate price, I applied a flat coefficient of $100 per square foot.
- Zip Code: A categorical variable with a large number of categories (70). I created multipliers to control how much each zip code affected Price using random numbers from a normal distribution, and randomly assigned them to fairly even groups of houses. The ‘first’ Zip Code category of 30000 was set up to detract most from average Price so that it represented either the minimum contribution, or the maximum detraction, depending on your perspective. The average Zip Code category (30066) contributed little to Price, and the highest added the most.
- Condition: A categorical variable with a small number of variables (5): Poor, Below Average, Average, Above Average, and Excellent. I engineered the multipliers so that Average Condition added nothing to Price; Poor and Below Average detracted; and Above Average and Excellent increased. Instead of splitting the houses into 5 fairly even groups and assigning Condition (as I did with Zip Codes), 70% of houses were assigned to the Average condition category, 10% of houses were each assigned to Below and Above average, and 5% of houses were each assigned to Poor and Excellent.
- Day of the Week (DOTW): A capricious variable that contributes to the price without any particular pattern (I just made up the coefficients for each day). I tested excluding this from my model to reduce the variability in the target that the model should be able to explain, to make it somewhat more realistic.
With my Sq Ft values randomly generated, and houses assigned to categories as described above, I applied the multipliers to generate Price. I used a starting baseline of $100,000 (which became the expected y-intercept), and applied categorical multipliers to a constant of $50,000.
I now had a target variable, Price, which had been generated based on known coefficients of each predictor variable.
I noted the expected coefficients for each predictor category in original units, as well as the coefficient in standardized units (standard deviations) for the continuous variable, Square Feet. This allowed me to compare the results from each model against expected coefficients.
My primary question was whether the category column dropped from the model (i.e. used as the model’s reference point) would affect the model’s results. I wanted to test dropping the first column, since it’s the common convention, and as the alternative I wanted to drop a column representing the average category: the category in which mean home Price most closely matched the population mean.
There were three different types of results I was interested in:
- Performance — R-squared and RMSE for train and test
- Interpretability — Accurate ranking of standardized coefficients
- Interpretability — Accurate AND intuitive original unit coefficients
So I needed at minimum:
- Two different models to compare dropping the first column versus the average
- Another two models to compare standardized coefficients versus coefficients in original units
I decided to also run models where no categorical columns were dropped just to see what happened. And finally, I varied whether I included the DOTW variable in the model or not, where including it should allow the model to predict 100% of the target’s variability, and excluding it would introduce some errors and should be slightly more realistic.
Ultimately, I ended up iterating through 12 models with these different parameters, and then reviewed the results.
Model Interpretation and Conclusions
To evaluate performance, I used a train-test split and generated the R-squared, Residual Sum of Squares (RSS), and Root Mean Squared Error (RMSE) for both train and test.
To evaluate interpretability, I compared the test model coefficients to expected, reviewed how test models differed from each other, and considered how accurate my natural conclusions would be.
Being able to predict the target given the independent variables is one of the key measures of success for a linear regression model.
In the first 6 models, I included the ‘day of the week’ variable, and in the last 6 I left it out, so it would act as a confounding variable.
Although other aspects of how the data was processed were varied, such as whether the data was standardized and which category was dropped, these factors do not have an appreciable affect on R-Squared, Root Mean Squared Errors, or Residual Sum of Squares (not shown above for brevity). In other words, the first 6 models look the same, and the last 6 models look the same.
The models which were run without the ‘day of the week’ variable were not perfect, which makes intuitive sense. But there doesn’t appear to be any difference between these models where I excluded DOTW; they have the same R-squared and RMSE.
I concluded that varying which category column is dropped from the model does NOT affect the model’s performance.
Interpretability — Accurately Ranking Standardized Coefficients
Another way we might want to be able to use our model is to rank the standardized coefficients to compare the magnitude of their effects on the target, or their importance. The goal would be to determine which variables increase or decrease the target to a greater degree. It’s important to standardize continuous variables so their coefficients are in standard deviation units that can be compared.
These insights, combined with domain knowledge, could be instrumental for business stakeholders to decide on an appropriate action plan. If their goal is to tweak predictors to affect the target, knowing which predictors have the largest impact is key.
To measure how accurate the ranked coefficients of our models would be, consider the heatmaps below. Although each model’s heatmap uses its own scale for the color gradient, we would expect the general gradient order of the test models to match the gradient order of the expected model.
In both models, Zip Codes look to follow the expected order pretty well. But in the Dropped First mode, the Condition categories and Square Feet don’t look right; their colors pop out of the gradient because they have much lower standardized coefficients than we expected.
In the Dropped Average model, all of the predictor variables look pretty close.
Note that the y-intercepts of both test models are quite a bit different than expected. In fact, the Dropped Average model’s y-intercept is quite different. But it’s unlikely we would be concerned with ranking the y-intercept accurately; we would probably be most concerned with ranking the actual predictor variable coefficients.
Let’s remove the Zip Code categories to zoom in on the others.
If we take them out of the context of Zip Codes, it does look like the order of Condition categories matches expected for both models. In other words, Excellent Condition adds more than Above Average, which are above Average, etc.
However, Square Feet should add about $55k more to Price than Excellent Condition. We see that this difference is pretty accurate in the Dropped Average model, but in the Dropped First model, Excellent Condition is ranked above Square Feet.
If I were to use the standardized Dropped Average model to estimate which variables were more important to increasing price, I’d be pretty accurate.
But if I were to use the standardized Dropped First model to estimate this, I’d assume that Zip Codes were by and large the most important, followed by Condition Excellent, and then Square Feet. This would not be very accurate.
Accurately AND Intuitively Interpreting Original Unit Coefficients
We also might be interested in using the coefficients in original units, from the model where we did not standardize the data, to understand how the target variable changes per unit of each predictor variable.
For instance, we would want to be able to use the coefficient associated with Square Feet to say “For each sq ft added, the price goes up by an estimated $100.”
I’ve created similar heatmaps to review original unit coefficient accuracy. Note that in this visualization, the color scales of both test models were forced to match the expected model, so we can compare the values by comparing the shades directly. Here, we care about how accurate the coefficients are, as opposed to with standardized coefficients, where we cared more about the ranking order.
We can see that the Dropped Average model’s coefficients are very close to expected values. Its colors are very similar to expected, and in the right order.
However, in the Dropped First model, the y-intercept was very low, which caused most of the other coefficients to be too high to make up for it.
We can see that the Square Feet coefficient is pretty close, but Zip Code values are quite high, and it’s difficult to determine where Condition falls. Let’s remove Zip Codes to zoom in on the other variables.
We’ll take some time to interpret the differences we see here, as the fact that the Dropped First model’s unit coefficients do not exactly match expected do not necessarily mean they are inherently wrong or not useful.
First, let’s notice that both test models have very accurate unit coefficients for Square Feet, our only continuous variable.
We can see clearly that the Dropped Average model has unit coefficients for Condition variables that very closely match expected, while the Dropped First model coefficients do not. Practically, what does this mean?
It’s important to put these unit coefficients in the context of each model’s baseline, or reference point. The coefficients must be interpreted with respect to this baseline.
For our continuous variable, Square Feet, the reference point was 0, and this would be the case for any continuous variables we included.
But for our categorical variables, the reference point in the baseline became whichever category column we dropped from the model. So we can say that:
The differences in unit coefficients between the two test models is due to the models having different reference points for their categorical values.
Now it makes more sense why the coefficient for Square Feet is the same for both models, but the categorical coefficients differ: Square Feet assumed a baseline of 0 in both models.
The next logical question to ask is: Is one baseline inherently better than the other? Since we saw that model performance doesn’t change, it’s not immediately clear. To answer this question, let’s consider how we would interpret the Condition unit coefficients for the Dropped First model, assuming we didn’t know what was expected.
In the Dropped First model, we dropped the Poor Condition column. So our baseline assumes a reference point of having the worst possible condition. We would interpret the coefficients like:
- Having a Condition of Below Average adds $50k to a house’s baseline price
- Having a Condition of Average adds $66k to a house’s baseline price
- Having a Condition of Above Average adds $71k to a house’s baseline price
- Having a Condition of Excellent adds $81k to a house’s baseline price
In the Dropped First model, we’re interpreting each Condition level as adding to the baseline, even Below Average. Whereas with the Dropped Average model, the baseline assumes a condition of Average, so we would interpret the coefficients like:
- Having a Condition of Poor subtracts $66k from a house’s baseline price
- Having a Condition of Below Average subtracts $16k from a house’s baseline price
- Having a Condition of Above Average adds $5k to a house’s baseline price
- Having a Condition of Excellent adds $15k to a house’s baseline price
The steps between the coefficient values are the same in both models, but the positive/negative aspect is different.
One is not objectively better than the other, but:
If the goal of analysis is to provide actual coefficient values to non-technical stakeholders, dropping the average category for Condition yields the most intuitive coefficients that are more likely to make sense to stakeholders.
Imagine the difficulty of explaining that although we have a positive number for the impact of Below Average Condition, this is within the context of a baseline house that actually has a negative price… it really doesn’t make much sense. It’s not inherently wrong, but without the right framing, people may come to misleading conclusions.
Although the category you choose to drop won’t affect the model’s performance, it can have a significant impact on the interpretability of the model.
If you plan to use the coefficients in your model to make accurate inferences about:
- impact on the target per predictor unit,
- or the magnitude of impact each predictor has in relation to others,
then you should carefully consider which categorical values you drop from your model.
A linear regression model’s coefficients are interpreted in the context of a baseline model. For continuous variables, the baseline uses a reference point of 0. But for categorical variables, whichever column is dropped becomes the reference point, which has a significant impact on how coefficients are interpreted.
It’s important for data scientists to consider which columns represent the most intuitive reference points for each category, and drop those. Simply dropping the first column is arbitrary: the first column will not necessarily represent the minimum, and also the minimum does not necessarily always make a good reference point.
A thoughtful choice of which column to drop will yield coefficients that are much easier to interpret accurately.
How should you choose which column to drop?
Instead of assuming that the first or minimum category value is the most appropriate, consider which category represents the most intuitive reference point for your stakeholders, or to help answer the questions at the root of your analysis.
For example, if you were modeling predictors of salary you might have a categorical predictor for level of schooling with the following values:
- Less than High School
- High School or GED
- Associate’s degree
- Bachelor’s degree
- Master’s degree
- Doctorate degree
If you relied on the alpha-numeric first column, then Associate’s degree would be your reference point. This could make sense if your stakeholders would be interested in how much people with less than an Associate’s degree earn compared to people with higher than an Associate’s degree.
But it could also be reasonable to use High School or GED as the reference point, if you felt that was a more accurate representation of the population’s average education attainment. You could even figure out which of these categories seems to have a mean or median salary that most closely matches the mean or median of the population, to have a statistical reason to choose a particular category as the reference point.
In this experiment, I dropped the categories that represented the average. That made sense because of what my categories represented. To select the appropriate column for a given scenario, data scientists will need to consider which column represents an intuitive reference point for their stakeholders, in the context of which, positive and negative coefficients will have relevant meaning.
And of course, if your primary goal is prediction, you don’t have to worry about which category value you drop!
I hope others find this useful and informative. As I’m still a student of data science, I’d welcome any thoughts on my testing approach or applicability of my conclusions.