Summary Statistics

In this page, I’ll share Stata commands for summary statistics. The dataset I use here is Militarized Interstate Disputes (v4.3) dataset from the Correlates of War Project. You can download the dataset here.

Opening data with use command

use "MIDA 4.3.dta", clear

[,clear] is telling Stata to clear any other data that you have opened in Stata.

When you open an excel file (.csv or .xlsx), use ‘import delimited’ instead.

import delimited "MIDA 4.3.csv", clear 

Looking at variables

What variables do we have in the example dataset?

describe

When you type ‘describe’, You can see the names and labels of all variables in the dataset.

Measures of central tendancy: mean, median, mode

Let’s look at ‘outcome’ variable in the MIDA dataset. ‘outcome’ shows the outcome of dispute like below.

Value Description
1 Victory for side A
2 Victory for side B
3 Yield by side A
4 Yield by side B
5 Stalemate
6 Compromise
7 Released
8 Unclear
9 Joins ongoing war
-9 Missing

Since the value doesn’t have specific meaning (nominal), mean and median are not meaningful. Let’s find the mode of this variable. ‘tabulate’ or simply ‘tab’ is useful to figure out the mode.

tabulate outcome

tab outcome, missing

We can see that 5 (Stalemate) is the mode of this variable.

Let’s fidn the mean and median of ‘MaxDur’ variable. ‘MaxDur’ is the maximum duration of dispute. First, you can use ‘codebook’ command. It will show you the range, number of unique values, number of missing values, mean, standard deviation, and percentiles. You can see that there is no missing values in this variable!

codebook maxdur

summarize maxdur
sum maxdur, detail

Second, you can use summarize or simply sum to look at the number of observations, mean, standard deviation, minimum and maximum values. If you add ‘, detail’ or ‘, d’, then you can see more detailed information such as percentiles, variance, skewness, and kurtosis.

The mean of the maximum duration of disputes is 161.72 and the median is 41.

If you have missing values such as -9 in the previous example, you might want to exclude those values. Then, use ‘if’ option.

Visualization

Let’s make histograms and box plots! Histograms are helpful to look at the distribution of variables. The command is ‘histogram’ or ‘hist’. You can add more options to make a prettier histogram.

histogram maxdur

hist maxdur, percent bin(15) xtitle("Maximum Duration of Dispute")
hist maxdur, percent width(10) xtitle("Maximum Duration of Dispute")
hist maxdur, freq xtitle("Maximum Duration of Dispute")

graph box maxdur

You can change the number and width of bins (bin and width respectively). The default setting will show you the density. You can change that by adding ‘frequency’ or ‘percent’ as an option to the command.

You can draw a box plot using the command, ‘graph box’. Since this variable is right-skewed, we can see many outside values above the box (dots above the upper adjacent line).

Next