Dummy Variable: Meaning, Uses, Interpretation, and Precautions

Are all the variables quantitative in nature? The answer is obvious, it's $\text{No}$. Some variables are qualitative in nature; it is difficult to give them a number but we can know whether this quality is present or absent. They can not be measured in numbers. Such non-quantifiable variables are measured through dummy variables. If we are dealing with quality variables in the model, then we are dealing with dummy variables. Dummy variables are called qualitative variables, categorical variables as they have certain categories, and also called binary variables as they take only two values 0 and 1. 

Dummy variables can be used in cross-sectional, time-series, and panel data. Here I will be demonstrating the use of dummy variables in cross-sectional data and time-series data.

You can read the PDF version of this article: PDF Version

In Cross-sectional data

A dummy variable is widely used in cross-sectional data analysis. Let's begin with the example of the Nepal Living Standard Survey. If we want to assess the relationship between Food expenses and remittances, then we will take several variables that represent individual characteristics, household characteristics, community characteristics, and so on.

For simplicity, we are using only three variables. Food expenses as the dependent variable, and Gender of the respondent and Remit (Remittance received in log scale in rupees) as the independent variables.

Intercept dummy

Our model, let's call it Model 1, is

$\text(Food expenses)=\alpha+\beta_1Gender+\beta_2Remit+\mu---(1)$

Gender is a qualitative variable. We cannot give numbers to it as other continuous. Gender defines the characteristics of an individual, that is, either male or female. So, we assign numbers to categories of a dummy variable. For our example, 1 denotes Male and 0 denotes Female.

In Model 1, $\alpha$ is an intercept term, $\beta_1$ is a differential intercept term (it is not slope), $\beta_2$ is slope.

Let's assume that the value of $\hat\alpha = 1.9 (0.07)$, $\hat\beta_1 = 0.1 (0.00)$, and $\hat\beta_2=1.5 (0.04)$. Inspecting the p-value in parenthesis, we interpret $\hat\beta_1$ and $\hat\beta_2$ as the p-value is below 5 percent, but we do not interpret $\hat\alpha$ as the p-value is above percent. The interpretation of $\hat\beta_2$ will be: 1 percent increase in remittances inflow increases food expenses of household by 1.5 percent. The interpretation of $\hat\beta_1$ will be: households headed by males spend 10 percent more on food on average.

Let's sketch its graph for a clear understanding of the result of Model 1.

Figure 1: Intercept dummy

The average food expense for males is

$\text(Food expenses)=2+1.5\times Remit$

The average food expense for females is

$\text(Food expenses)=1.9+1.5\times Remit$

What is $\hat\beta_1$? It is the average difference in food expenses between households headed by males and females.

Interactive dummy

The example presented in Model 1 is the intercept dummy. The use of a dummy is not limited to it. We can add the interactive term to examine the relationship between food expenses and remittances between households headed by males and females.

The coefficient on the interaction between the dummy variable and some variable X tells us the extent to which the dummy variable changes the regression function for that regressor (MIT Lecture Note).

We will continue with our Model 1 and add an interactive term to it.

$\text(Food expenses)=\alpha+\beta_1Gender+\beta_2Remit+\beta_3(Gender\times Remit)+\mu---(2)$

Our new model is Model 2.

Let's assume that the value of $\hat\alpha = 1.5 (0.06)$, $\hat\beta_1 = 0.12 (0.00)$,  $\hat\beta_2=1.2 (0.04)$, and $\beta_3=0.8$. Inspecting the p-value in parenthesis, we interpret $\hat\beta_1$, $\hat\beta_2$ and $\hat\beta_3$ as p-value is below 5 percent, but we do not interpret $\hat\alpha$ as the p-value is above percent. The interpretation of $\hat\beta_2$ will be: 1 percent increase in remittances inflow increases food expenses of household by 1.2 percent and 2.0 percent for households headed by female and male respectively. The interpretation of $\hat\beta_1$ will be: households headed by males spend 12 percent more on food on average.

Figure 2: Intercept and Interactive dummy

The average food expenses for male is

$\text(Food expenses)=1.62+2.0\times Remit$

The average food expenses for female is

$\text(Food expenses)=1.5+1.2\times Remit$

What is $\hat\beta_1$ and $\hat\beta_3$? $\hat\beta_1$ is the average difference in food expenses between households headed by males and females and $\hat\beta_3$ is the differential slope.

Many of the readers may wonder why the slope for male households is 2.0. For this let's perform differentiation with respect to remit in Model 2.

\begin{align}(\partial\text(Food expenses))/(\partial Remit)=\partial/(\partial Remit)(\alpha+\beta_1 Gender+\beta_2 Remit+\beta_3(Gender\times Remit)+\mu)\end{align}

$(\partial\text(Food expenses))/(\partial Remit)=\beta_2+\beta_3Gender$

If Gender is Male, then Gender = 1

$(\partial\text(Food expenses))/(\partial Remit)=\beta_2+\beta_3$

If Gender is Female, then Gender=0

$(\partial\text(Food expenses))/(\partial Remit)=\beta_2$

In Time-series data

No wonder, a dummy variable is equally used in time-series data to capture structural breaks or some form of unique events that have a significant effect on the macroeconomic variables or variables of interest. Such as COVID impact on macroeconomic variables, a downturn of macroeconomic variables during the great recession of 2007, and an effect of other incidents on macroeconomic variables cannot be captured through data as they are qualitative in nature. So, we introduce dummy variables to capture it. But we must have a valid reason to include a dummy variable in our model.

Let's start with a model concerning the relationship between C02 emissions and world economic growth. Let's derive a simple model.

$\text{C02}=\alpha+\beta_1\text{Growth}+\mu--3$

We assume that C02 emission and Growth (world economic growth) are stationary at level $I(0)$.

Model 3 seems okay, but there may be breaks in GDP growth and C02 emission. We can inspect these breaks from sketching graphs or by performing formal tests such as the Chow test.

Figure 3: C02 Emission

The yellow line in Figure 3 is the period of crisis in the world. In these periods, C02 emission has decreased due to global meltdown. In these periods, world GDP growth must have slowed down. Hence, we cannot find the relationship between C02 emissions and world GDP growth in the presence of external shocks or disturbances. Neither we can remove it from data nor we can estimate it. So, we use a dummy variable, named $\text{Crisis}$ such that 1 denotes crisis period and 0 denotes normal period.

So, we need to respecify our model (Model 3), Model 4 is obtained as

$\text{C02}=\alpha+\beta_1\text{Growth}+\beta_2 Crisis+\mu--4$

The interpretation is similar to that of Model 1.

If we want to strengthen our analysis, then we need to add interactive term also, we add interactive dummy in Model 4.

$\text{C02}=\alpha+\beta_1\text{Growth}+\beta_2 Crisis+\beta_3(Growth\times Crisis)+\mu--4$

The interpretation is similar to that of Model 2.

The discussion of the dummy variables in the case of Panel data is left to the reader. We will discuss it later.

Precautions

We must be careful in using dummy variables. In the case of the dummy variable trap, the model suffers from perfect multicollinearity. A simple way to remove this problem is to use `k-1` categories if there are `k` categories. The category for which no dummy variable is assigned is known as the base, benchmark, control, comparison, reference, or omitted category. Watch this video.


Cite this article as: 
Byanjankar, R. (2021). Dummy Variable: Meaning, Uses, Interpretation, and Precautions. https://www.rohanbyanjankar.com.np/2022/02/dummy-variable-meaning-uses.html

Post a Comment

Previous Post Next Post