Class 4: Analyzing the Data (Part 1)
Up to this class, we have learned how to go from a collection of raw databases to a database that is ready for the analysis. In this class and the next we are going to take a dataset that is ready for the analysis and learn how to make publishable-quality plots to clearly show the relations of interest, as well as statistical tests and regressions to draw statistical inference from our data.
As you will see, even with a data that is ready for the analysis, we will need to make use of the tools we have learned so far to make minor modifications for specific analyses. Moreover, you will learn that your code can improve its efficiency significantly (in terms of useful output per line of code) by making use of global variables, local variables, temporal variables, temporal files, loops, and conditional execution..
Global and local variables will allow your do file to automatically modify its behavior depending on conditions that may change, as for example removing observations that lie in the tails of the distribution of a particular variable (that is, below its p percentile or above its 1-p percentile, this operation is called Winzoring the data). No matter how your sample changes, the code will consistently delate the observations that meet this condition without the need of manually indicating the cutoffs.
Temporal files and temporal variables are very easy to use and keep your code clean. They are databases and variables that only exist while the code is being executed. Hence, you don't end up with a database full of auxiliary variables, or a folder full of .dta files used for intermediate steps.
Loops will allow you to apply the same procedure repeatedly to different variables or to different groups of observations as defined by the unique values of a variable. To continue with the example above, a loop will allow you to winzorize your data by groups, that is, calculate the relevant cutoffs by groups defined by the unique values of a variable (e.g: calculate the p and 1-p percentiles for males and females in your sample, and winzorize males and females separately using their respective cutoffs).
Conditional execution allows you to execute different lines of codes for different conditions of your data. For example, you may need to change the way in which you calculate a variable depending on the year. In most cases, this can be accomplished by the if part of a command, but in some cases you need more flexibility than that.
We are going to learn these skills today through a brief explanation of how global variables, local variables, temporal variables, temporal files, and loops work, followed by the basics of statistical tests. Plots and regressions are left for the next class.
During this class and the next, we will use data from the World Bank to analyze trends in CO2 emissions between different countries. The goal is to disentangle the role of population and economic growth in countries' CO2 emissions growth, and compare the relative performance of different groups of countries.Macros: Global Variables, Local Variables, Temporal Variables, and Temporal Files
Macros are variables that are not attached to a database, where you can store information in Stata for later use. They differ regarding the type of data they store and the scope where the data is defined.
Global Variables
Open Stata and type the following command into the command line:
global fruit apple
Now type:
display "My favorite fruit is $fruit"
As you can see, Stata stored the value apple into a variable called fruit, and we can retrieve this value using the $ character before the name of our new global variable.
Now try:
display "I like all $fruit-shaped fruits"
This didn't work as expected, because Stata looked for a non-declared global variable named fruit-shaped. This happens when we want to use our global variable within a word. To indicate Stata where the name of our global variable within a word ends, we have to use the following syntax:
display "I like all ${fruit}-shaped fruits"
Local Variables
Now type:
local fruit "water melon"
and then:
display "My favorite fruit is `fruit'"
macro drop _all
Now, open a do file and copy the following lines:
global fruit "apple"
local fruit "water melon"
Run the do file and then type in the console:
display "I like both $fruit and `fruit'"
As you can see, Stata was able to retrieve the value of the global variable but not the value of the local variable, because the local variable was defined in the scope of the do file where it was defined. Append the following line at the end of the do file:
display "I like both $fruit and `fruit'"
And run it again. Now you can see that within the scope of the do file, Stata can get the value of both variables.
To be more precise, the scope of a local variable is not the do file where it was defined, but the execution instance of that do file. Try running only the first two lines of our do file, and then running the third line. Stata fails to retrieve the value of the local variable because the variable was defined in a previous execution of the do file.
Why does this matter
Global and local variables are are extremely useful tools. To see this, download the material for this class and and open the do file downloading-wb-data.do. This do file downloads the data from the world bank using a world bank Stata API you can download using
ssc install wbopendata
and make some modifications to the data to have a dataset ready for the analysis. Firstly, I want you to pay attention to lines 15, 16 and 48, reproduced here:
global ps /
cd "${ps}Users${ps}felipe${ps}projects${ps}stata${ps}classFolders${ps}class4"
save "data${ps}processed${ps}wb_data.dta", replace
In line 15, we are declaring a global variable called ps that is equal to a forward slash. ps stands for path separator, and I put a forward slash there because I use a UNIX machine that uses forward slash separators. Note that when declaring the working directory in line 16, and when saving the data in line 48, we are retrieving the value of this global variable to use as path separators. So, if you are using a Windows machine, just change the forward slash by a backward slash in line 15 and the do files will work just fine on your machine!
Now, look at lines 43-45:
count if co2_pc!=. & gdp_pc !=. & pop!=.
local frac_non_missing = r(N) / _N
display "Fraction of non missing observations is `frac_non_missing'"
We want to display the fraction of observations that have no missing values for all variables of interest (CO2 per capita, GDP per capita and population). To do that we are using two local variables. The first one is r(N), which was created by the count command run in line 43 (you can see what local variables are created by a command at the end of the help page of a command). This variable stores the output of the count command, which in this case is equal to the number of observations that meet the conditions that follow the if. The second one is _N, which is a local variable that contains the total number of observations in the database. This local variable is created by default every time you execute a command or a do file, and is automatically updated when the database is modified. Using both local variables, we are creating a new local variable called frac_non_missing that calculates the fraction of non missing observations in the database, by simply dividing r(N) by _N. Then we print that variable to the console.
Temporal Variables
Temporal variables are very similar to local variables, with the exception that they are usually linked to a database. That is, they store a value for every observation in a database, and they can be used basically in the same way you would use a variable in a database. The difference with a regular variable is that they are never incorporated into the database, they stand by themselves and are removed once the do file completes its execution. Hence, they are ideal to use as variables required for intermediate steps, where you are not interested in preserving them.
Please open the tempvar-template.do file, pay attention to the example, and solve the exercise.
Temporal Files
Temporal files will allow you to save an auxiliary file into memory for later use without needing to save it to the hard drive. This is extremely useful, as it keeps your data folders clean.
Please open the tempfile-template.do file, pay attention to the example, and solve the exercise.
Loops
In this class we are going to see two types of loops in Stata, the foreach and the forvalues loops.
Forvalues
Open a do file and copy the following lines:
forvalues x=1(2)20{
local x_squared = `x'^2
display "The value of `x' squared is `x_squared'"
}
Now run the do file. What Stata is doing is that it creates a local variable called x for every other value between 1 and 20 (the number between parenthesis is called the step). Then, it runs the code between the curl brackets for each one this values. We are generating another local variable that is equal to the value squared, and then printing the result of our calculation the console. We could have told Stata to run the loop for every value every three numbers between 3 and 30 (3,6,...,30) by using 3(3)30 instead of 1(2)20.
Lets try now something more useful. Open the file loops-exercises-template.do in the course materials and solve the first exercise, which asks you to run a loop that prints to the console the fraction of observations with non missing values in all variables for each year between 1960 and 2014. Consider that when the step is equal to 1, the notation simplifies from 1960(1)2014 to 1960/2010. Also, you can suppress the outcome of the commands you use inside the loop by putting qui before the command (e.g qui count). This is useful in loops, as you want to keep the results window clean.
Foreach
A foreach loop works in a very similar way, but instead of looping through numerical values, it will loop through all the elements of a list. Open a do file and copy the following lines:
foreach string in "Hello" "There" "" "This" "is" "a" "foreach" "loop"{
display "`string'"
}
Lets try now something more useful. The second exercise of loops-exercises-template.do asks you to display in the console the fraction of non-missing observations for each country. Checkout the use of the command levelsof (there is a brief example in the template), which saves to a local variable the unique levels of a variable. Also, note that to use the local variable generated by the loop in a conditional statement with a string variable, you need to put the local variable of the loop between quotes. So, if in each iteration you want to restrict a command to all observations for which the countryname variable is equal to the value of the local variable of the loop in the current iteration (lets says is x), you will have to use if countryname=="`x'".
Nested Loops
You can nest loops inside loops to loop through different values. Copy this lines of code in a dofile and run the dofile:
forvalues number=1/10{
forvalues exponent=1/4{
local result = `number'^`exponent'
display " The value of `number' to the `exponent' is `result'"
}
}
As you can appreciate from the output, Stata will go through all the iterations of the inner loop in each iteration of the outer loop. It is not necessary to use indentation (given an additional level of indentation to each inner loop), although it is recommended to make your code more readable.
To see nested loops in action with real data, solve the third exercise of loops-exercises-template.do, that asks you to display to the console the average of each variable in each year. Note that now we will use levelsof to get the unique values of year. For the case of the variables, you can loop through them using the same structure used in the example given for the foreach loop. You should use the local variable generate by summarize, r(n) (see the template for a brief example).
Using Matrices to Save the Results From Your Loops
In some cases, you want your loop to do some operation like dropping some observations that depend on the value of a variable or create a variable whose conditions varies by specific groups. In that cases, there are no results we would like to keep after our loop has run. But what if we want to run a test for each year in a sample and save the results to see how they have evolved through time? In that case, we can use Stata matrices to store the results and save them to a database you can export. Although there a lot of things you can do with matrices, we will focus on this exclusive use here.
Lets start with a silly example. We want to make a multiplication table (yes, the one you learned in primary school). We are going to use a nested loop and a matrix to save our result to an excel file. Run the multiplication-table.do file in the script folder of this class, and pay attention to the comments that explain what is going on. After running the do file, check out the result in the results folder. You will be asked to use this skill for a more useful purpose in the homework of this class.
A Brief Comment About Loops
Loops are great, and can save you a lot of lines of code by running the same operations for different groups of observations or variables. But do not overuse them! Remember that Stata has a very powerful command to do operations on groups of variables: egen. This will certainly be even more efficient, as the internal algorithms Stata use under the hood are far more efficient than loops coded into do files. So always stop for a second before using a loop and ask your self: Could I do this using egen or bysort?
Conditional Execution: if, else, else if
Conditional execution is straightforward: Stata will run some lines of code when a condition is meet, and another group of lines when it is not meet. This can be extended to more cases using else if. Copy the following lines in a do file and run it:
forvalues x=1/30{
if (mod(`x',2)==0){
display "`x' is a multiple of 2"
}
else if (mod(`x',3)==0) {
display "`x' is a multiple of 3"
}
else if (mod(`x',5)==0) {
display "`x' is a multiple of 5"
}
else {
display "`x' is not a multiple of 2,3 or 5"
}
}
This simple example shows how conditional execution works. For each value, Stata first checks whether the condition in the first logical test is true or not. If it is, runs the lines of code within the curly brackets that follow the logical test and goes to the next iteration of the loop, that is, it does not check the rest of the conditions that follow (that is why it prints that 6 is a multiple of 2, and does not print that 6 is a multiple of 3 although it is true). If the condition is not true, if checks whether the next condition is true and follows the same procedure explained above. If none of the conditions is true, it will execute the code inside the curly brackets that follow else case. The else case is not necessary, if absent Stata simply do nothing for numbers that are not multiples of 2 or 3.
There are not a lot of cases where you need a conditional execution on Stata, most of the times the conditional parts of commands is enough (the if part). But in some cases (as in the first homework of this class), conditional executions can be very useful.
Statistical Tests in Stata
There is a large variety of statistical tests in Stata. The goal of this class is not to make a comprehensive review of all available tests, but to show you what is the basic syntax used to perform an statistical test. As an example, we check in the file ttest.do whether countries that had an income below the median income in 1970 are different from countries with an income above the median in 1970 in terms of the increase in CO2 emissions per dollar of GDP from 1970 to 2014. Please open the do file and pay attention to the comments that explain the syntax.
You can download the solutions to these exercises here (please try to solve the exercises by yourself before looking at the solutions).
Homework 1
In the solution to the exercise 4 of class 3, lines 70-94 go like this:
gen max_education = education if year==1979
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1980
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1981
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1982
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1983
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1984
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1985
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1986
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1987
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1988
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1989
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1990
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1991
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1992
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1992
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==1994
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==1996
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==1998
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2000
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2002
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2004
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2006
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2008
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2010
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==2012
This is an excellent example where a loop makes your do file more readable, easier to write and easier to modified if needed. So, go ahead and transform these long lines of code into a compact loop that does the job. Use a conditional execution to take care of the difference for years 1996 to 2012.
gen max_education = education if year==1979
forvalues year=1980/2012{
if (`year'<1996){
replace max_education = (education>=l1.max_education)*education + (education<l1.max_education)*l1.max_education if year==`year'
}
else {
replace max_education = (education>=l2.max_education)*education + (education<l2.max_education)*l2.max_education if year==`year'
}
}
Homework 2
You have just finished the first year of your program, and you are anxious to put into practice what you have learned. You passed through the first selection filter to be a summer intern at the United Nations Office for Sustainable Development, and you receive the following email asking you to perform a brief empirical analysis for the final selection:
Dear Colluegue:
Congratulations on passing the first selection for being a summer intern at the United Nations Office for Sustainable Development!
As you know, we are interested in candidates with a strong background in statistical analysis, and an advance use of Stata. We need you to send us as soon as possible a do file that downloads the necessary data from the World Bank's Open Data and runs for each available year a t-test on the difference in CO2 emissions per dollar of GDP between high and low income countries, as defined by the median per capita GDP of each year. The results of the t-test should be saved into an excel file that shows, for each year in the rows, the average value for each group, the difference in means, the standard deviation of the difference, and the p-value of the test. Something like this:
Average CO2 per GDP Dollar, High Income Countries | Average CO2 per GDP Dollar, Low Income Countries | Difference in Means | Standard Deviation Difference | P-Value |
---|---|---|---|---|
1960 | ||||
⋮ | ||||
2014 |
Good luck, I hope to see you here this summer!
You almost got the job! You just need to write that do file quickly and send it. You should try to show in this brief exercise how much you know of Stata, making use of global variables, local variables, temp variables, loops, and matrices.
Use the homework-template.do file to get some guidance of how to tackle this challenge.
You can download the solution to homework 2 here (please try to solve it by yourself before looking at the solutions).