Class 3: Wrapping Up What We Have Learned so Far
In this class you will solve five short exercises that apply most of the commands we have learned so far. I do not expect you to finish all the exercises during this class, but try to advance as much as possible. The homework for this class is to finish the exercises you did not complete during the class.
Set up
You are starting your first day at the World Bank after completing your graduate studies. Your first assignment is to analyze poverty dynamics in the US since the late 1980's, for which you are given access to a project folder you can download here.
After being ushered to your office, you notice you already have a email from your boss in your inbox:
Hi Colluegue:
Welcome on board! You have been commissioned to work on an analysis about poverty dynamics in the US since the late 1970's.
Tomorrow I will introduce you to the rest of the team in our weekly meeting. It would be great if you could have some early results to share with the rest of the team by then. Using the data from the NLSY79 survey available in the project's folder, you should calculate the poverty rate, the probability that a poor person in year t-1 moves out of poverty in year t (conditional probability of moving out of poverty), and the probability than a not poor person moves into poverty in year t (conditional probability of falling into poverty).
Besides general trends, we would also like to have the results by groups defined by gender and race (hispanic males, hispanic females, black males, black females, non-hispanic non-black males, non-hispanic non-black females), as well as by groups defined by educational achievement (high school dropouts, high school graduates, incomplete college, and college graduates).
Thank you, see you tomorrow!
Exercise 1: Importing and saving the data
In your project folder, all the data you need is in the data/original directory. This data has been downloaded from the NLSY79 webpage, and comes in a Stata-friendly format. You should write a do file that imports each data and label de variables by running the do file that does this job in the folder of each database. Take a look at one of these do files to see how do you label variables and variables' values. Save each database in your data/processed directory. Inspect the databases and ask your instructor if you have any doubt.
Tips for exercise 1:
- When importing the data, you want to make sure that the name of the variables are exactly the same than those in the do file that label the variables. Remember that Stata is case sensitive! See the help page for the import delimited command to check how do you deal with capital letters in variables' names when importing the data.
- If you don't like to have variables names with capital letters, you can use the tolower command before saving your data. You have to install this command with ssc install tolower.
Exercise 2: Calculating the general trend
In this part you have to write a do file that calculates the poverty rate and both conditional probabilities (out of poverty and into poverty) for each year in the sample. The output of the do file should be a .dta file and an excel sheet file that shows fo each year what is the faction of poor, the probability than a non-poor falls into poverty, and the probability than a poor gets out of poverty.
Tips for exercise 2:
- You will need to reshape the data to long and declare it as longitudinal to identify when a person transitioned to poverty or came out of poverty.
- Drop observations with no information about the poverty status (you can identify them when the poverty status variables is negative).
- You can use the collapse command to calculate the probabilities. Exploit the fact that collapse ignores na values in its calculations. So, if you create a variable that is equal to 1 when a person was not poor in t-1 and is poor in t, 0 when a person was not poor in t-1 and is not poor in t, and na when a person was poor in t-1, the mean of this variable calculated with collapse will yield the conditional probability of falling into poverty given being not poor in t-1.
- Note that individuals are surveyed every year between 1979 and 1994, and only in even years from 1996 to 2012. Consider this when constructing your variables.
- Use the label command to give meaningful labels to your variables.
Exercise 3: Calculating trends for groups defined by race and gender
This exercise asks you to repeat what you did for exercise 2, but now the results should be calculated for each group defined by race and the gender. The output of the do file should be a .dta file and an excel sheet that shows in separate columns the results for each group.
Tips for exercise 3:
- Use the by() option of collapse to calculate the probabilities for each group.
- You will need to reshape wide after the collapse to present the results for different groups in different columns as required.
- You can use the group() function of the egen command to create a variable with a unique value for each group defined by the cross-product of race and gender. This will be required by reshape wide, as it accepts only one j variable.
- Use the label command to give meaningful labels to your variables, for which you will have to check before the reshape wide what are the labels of the variable your created with egen.
- Order the variables in the database in a way that facilitates comparisson between groups, that is, the same variable for different groups should be in contiguous columns.
Exercise 4: Calculating trends for groups defined by educational achievement
This exercise is similar to the previous exercise, but the difference is that the educational achievement variable require some additional adjustments. The output of your do file is analogous to the required in the last exercise.
Tips for exercise 4:
- You will have to rename the variables of educational achievement before reshaping the data to long, so they resemble the structure of the poverty status variables.
- When you need to write a very long command (as is the case with the required renaming), you can change the default delimiter Stata uses to tell when a command ends, so you can write your command in multiple lines (the default delimiter is a new line). Check out how to do that here.
- The educational achievement data has too many missing values. It is reasonable (to a first approximation) to assume that if in a year a person did not report her educational level, then she has the last educational level reported. To implement this, create a variable that is equal to the educational level if this variable is positive (negative values indicate not available data) and equal to the last registered educational level otherwise. You can start by creating a variable equal to the educational level only for 1979. Then, for 1980, replace the value of this variable to the educational level of 1980 if this value is positive, and otherwise use the value of the created variable in the previous survey year (1970). Then you proceed for the following years in the same fashion (remember that individuals are surveyed every other year after 1994).
- You need to go from the educational achievement data, which are integers representing years of schooling, to a categorical variable that represent educational groups. To do this, use the recode command. The map from years of schooling to educational groups is the following one:
- 0-11: High School Dropouts
- 12: High School Graduates
- 13-15: Incomplete College
- Over 16: College Graduates
- The educational achievement variable is null for all observations in 2012. Drop that year from your analysis.
Exercise 5: Comparing the general trend between the NLSY79 and the NLSY97 samples
Good job! You arrive to your meeting with all the information that was required and gave an excellent first impression. During the meeting, a former classmate from your graduate program raised a smart concern about your analysis: does this numbers reflect a general trend in the US economy or the life-cycle of the NLSY79 generation? To partially address this point, you said you would run the same analysis with the NLSY97 sample and compare how different are the results for the same years in which both samples overlap. If they are very similar, it is probably the case that your analysis is reflecting the general trend in the US economy. If they are very different, and they are actually similar to the figures that the NLSY79 generation had when they were young, it is likely that your analysis is reflecting the dynamics of poverty throughout the lifecycle.
Write a do file that calculates the general trend for the NYLS97 survey (download the data here) and merge this result with the results you calculated for the NLSY79 survey.
Tips for exercise 5:
- The poverty variable in the NLSY97 survey is different from the one in the NLSY79 survey. Instead of a dichotomous poor/non-poor variable, this variable indicates the ratio between family income and the poverty line. A value over 100 is above poverty, while a value below 100 is below poverty. You will have to transform this variable to calculate the poverty rate and the conditional probabilities.
- Use the label command to give meaningful labels to your variables and distinguish which variables come from which survey.
You can download the solutions to these exercises here (please try to solve the exercises by yourself before looking at the solutions).