For this correlation and regression practice, I use a county-level voter turnout dataset. You can get this dataset by request.

Load the dataset first.

use "county_turnout.dta"

The research question for this tutorial is: Are counties with more millennials likely to have lower turnout?

Let’s make a scatter plot.

scatter perc_turnout perc_mille


  1. make it prettier using grstyle.
  2. add the best fitted line to the scatter plot.
grstyle init
grstyle set plain

scatter perc_turnout perc_mille

twoway (scatter perc_turnout perc_mille)(lfit perc_turnout perc_mille)

Is there a linear relationship?


Let’s look at the correlation between thses two variables.

pwcorr perc_turnout perc_mille

pwcorr perc_turnout perc_mille, sig

Interpret the result. The linear relationship between voter turnout and the percentage of millennials is statistically significant?


Let’s start with a simple regression. Our DV is voter turnout and our IV is the percentage of millennials.

reg perc_turnout perc_mille

What happens if we add a control variable, percent_college to the regression model? The effect of the percentage of millennials is still significant?

reg perc_turnout perc_mille perc_college

Visualization using margins

Let’s visualize the predicted voter turnout. Look at our IV (perc_mille) and choose the range for the plot. I choose the range from 10 to 40 with 5 percent gap.

sum perc_mille, d
margins, at(perc_mille=(10(5)40)) atmeans

marginsplot, xtitle("Percentage of Millennials") ytitle("Voter Turnout") ///
	ti(Effect of Millennials on Voter Turnout)