Regressions
For this correlation and regression practice, I use a county-level voter turnout dataset. You can get this dataset by request.
Load the dataset first.
use "county_turnout.dta"
The research question for this tutorial is: Are counties with more millennials likely to have lower turnout?
Let’s make a scatter plot.
scatter perc_turnout perc_mille
Optional:
- make it prettier using
grstyle
. - add the best fitted line to the scatter plot.
grstyle init
grstyle set plain
scatter perc_turnout perc_mille
twoway (scatter perc_turnout perc_mille)(lfit perc_turnout perc_mille)
Is there a linear relationship?
Correlation
Let’s look at the correlation between thses two variables.
pwcorr perc_turnout perc_mille
pwcorr perc_turnout perc_mille, sig
Interpret the result. The linear relationship between voter turnout and the percentage of millennials is statistically significant?
Regressions
Let’s start with a simple regression. Our DV is voter turnout and our IV is the percentage of millennials.
reg perc_turnout perc_mille
What happens if we add a control variable, percent_college to the regression model? The effect of the percentage of millennials is still significant?
reg perc_turnout perc_mille perc_college
Visualization using margins
Let’s visualize the predicted voter turnout. Look at our IV (perc_mille) and choose the range for the plot. I choose the range from 10 to 40 with 5 percent gap.
sum perc_mille, d
margins, at(perc_mille=(10(5)40)) atmeans
marginsplot, xtitle("Percentage of Millennials") ytitle("Voter Turnout") ///
ti(Effect of Millennials on Voter Turnout)