For this correlation and regression practice, I use a county-level voter turnout dataset. You can get this dataset by request.
Load the dataset first.
The research question for this tutorial is: Are counties with more millennials likely to have lower turnout?
Let’s make a scatter plot.
scatter perc_turnout perc_mille
- make it prettier using
- add the best fitted line to the scatter plot.
grstyle init grstyle set plain scatter perc_turnout perc_mille twoway (scatter perc_turnout perc_mille)(lfit perc_turnout perc_mille)
Is there a linear relationship?
Let’s look at the correlation between thses two variables.
pwcorr perc_turnout perc_mille pwcorr perc_turnout perc_mille, sig
Interpret the result. The linear relationship between voter turnout and the percentage of millennials is statistically significant?
Let’s start with a simple regression. Our DV is voter turnout and our IV is the percentage of millennials.
reg perc_turnout perc_mille
What happens if we add a control variable, percent_college to the regression model? The effect of the percentage of millennials is still significant?
reg perc_turnout perc_mille perc_college
Let’s visualize the predicted voter turnout. Look at our IV (perc_mille) and choose the range for the plot. I choose the range from 10 to 40 with 5 percent gap.
sum perc_mille, d margins, at(perc_mille=(10(5)40)) atmeans marginsplot, xtitle("Percentage of Millennials") ytitle("Voter Turnout") /// ti(Effect of Millennials on Voter Turnout)