Class 3: Wrapping Up What We Have Learned So Far
In this class, you will solve five short exercises that apply most of the commands we have learned so far. I do not expect you to finish all the exercises during the class, but try to advance as much as possible. The homework for this class is to finish any exercises you did not complete during class.
Set up
You are starting your first day at the World Bank after completing your graduate studies. Your first assignment is to analyze poverty dynamics in the US since the late 1980s, for which you are given access to a project folder you can download here.
After being ushered to your office, you notice you already have an email from your boss in your inbox:
Hi Colluegue:
Welcome on board! You have been commissioned to work on an analysis of poverty dynamics in the US since the late 1970s.
Tomorrow I will introduce you to the rest of the team in our weekly meeting. It would be great if you could have some early results to share with them by then. Using the data from the NLSY79 survey available in the project folder, you should calculate the poverty rate, the probability that a poor person in year t-1 moves out of poverty in year t (conditional probability of moving out of poverty), and the probability that a non-poor person moves into poverty in year t (conditional probability of falling into poverty).
Besides general trends, we would also like to have the results by groups defined by gender and race (Hispanic males, Hispanic females, Black males, Black females, non-Hispanic non-Black males, non-Hispanic non-Black females), as well as by groups defined by educational achievement (high school dropouts, high school graduates, incomplete college, and college graduates).
Thank you, see you tomorrow!
Exercise 1: Importing and saving the data
In your project folder, all the data you need is in the data/original directory. These data was downloaded from the NLSY79 webpage and comes in a Stata-friendly format. You should write a do file that imports each dataset and labels the variables by running the do file included in each folder. Take a look at one of these do files to see how variables and value labels are assigned. Save each dataset in your data/processed directory. Inspect the files and ask your instructor if you have any questions.
Tips for Exercise 1:
- When importing the data, make sure that the variable names are exactly the same as those used in the labeling do file. Remember that Stata is case-sensitive! See the help page for the import delimited command to check how to manage capital letters in variable names.
- If you prefer variable names in lowercase, you can use the tolower command before saving your data. You need to install this command with ssc install tolower.
Exercise 2: Calculating the general trend
In this part, you have to write a do-file that calculates the poverty rate and both conditional probabilities (moving out of poverty and falling into poverty) for each year in the sample. The output of the do-file should be a .dta file and an Excel sheet that shows for each year the fraction of poor individuals, the probability that a non-poor person falls into poverty, and the probability that a poor person gets out of poverty.
Tips for Exercise 2:
- You will need to reshape the data to long format and declare it as longitudinal to identify when a person transitioned into or out of poverty.
- Drop observations with no information about poverty status (you can identify them when the poverty status variable is negative).
- You can use the collapse command to calculate the probabilities. Take advantage of the fact that
collapse
ignores missing values in its calculations. For example, if you create a variable that is equal to 1 when a person was not poor in t−1 and is poor in t, 0 when the person was not poor in both years, and missing when the person was poor in t−1, the mean of this variable will yield the conditional probability of falling into poverty given being non-poor in t−1. - Note that individuals were surveyed every year between 1979 and 1994, and only in even years from 1996 to 2012. Take this into account when constructing your variables.
- Use the label command to give meaningful labels to your variables.
Exercise 3: Calculating trends for groups defined by race and gender
In this exercise, you are asked to repeat what you did for Exercise 2, but now the results should be calculated for each group defined by race and gender. The output of the do-file should be a .dta file and an Excel sheet that shows the results for each group in separate columns.
Tips for Exercise 3:
- Use the by() option of collapse to calculate the probabilities for each group.
- You will need to reshape wide after the collapse to present the results for different groups in separate columns.
- Use the group() function of the egen command to create a variable with a unique value for each group defined by the cross-product of race and gender. This will be required by reshape wide, as it only accepts one j variable.
- Use the label command to assign meaningful labels to your variables. Before the reshape wide, check the labels of the variable you created with egen.
- Order the variables in the dataset in a way that facilitates comparison between groups—that is, place the same variable for different groups in contiguous columns.
Exercise 4: Calculating trends for groups defined by educational achievement
This exercise is similar to the previous one, but the educational achievement variable requires some additional adjustments. The output of your do-file should follow the same structure as in the last exercise.
Tips for Exercise 4:
- Rename the educational achievement variables before reshaping the data to long format so that they match the structure of the poverty status variables.
- When writing a very long command (as required for the renaming), you can change Stata's default delimiter so you can write the command across multiple lines. See how to do that here.
- The educational achievement data contains many missing values. A reasonable assumption (as a first approximation) is that if someone did not report their educational level in a given year, they still have the last reported level. To implement this, create a variable equal to the educational level if the value is positive (negative values indicate missing data), and equal to the last recorded educational level otherwise. Start by creating this variable using the 1979 data. Then, for 1980, update it with the 1980 value if available; otherwise, carry forward the 1979 value. Continue this logic for the following years (note that individuals are surveyed every other year after 1994).
- You need to convert the educational achievement data—represented as years of schooling—into a categorical variable representing education groups. Use the recode command to do this. The mapping is as follows:
- 0–11: High School Dropouts
- 12: High School Graduates
- 13–15: Incomplete College
- 16 or more: College Graduates
- The educational achievement variable is missing for all observations in 2012. Drop that year from your analysis.
Exercise 5: Comparing the general trend between the NLSY79 and the NLSY97 samples
Good job! You arrive at your meeting with all the information that was requested and make an excellent first impression. During the meeting, a former classmate raises an insightful concern: do these numbers reflect a general trend in the U.S. economy, or are they driven by the life cycle of the NLSY79 generation? To partially address this question, you propose running the same analysis using the NLSY97 sample and comparing the results for the years in which both samples overlap. If the numbers are similar, it is likely that they reflect broader economic trends. If they differ—and if the NLSY97 figures resemble what the NLSY79 generation experienced at similar ages—then your original results likely reflect life-cycle dynamics.
Write a do-file that calculates the general trend for the NLSY97 survey (download the data here) and merge it with the results you calculated for the NLSY79 survey.
Tips for Exercise 5:
- The poverty variable in the NLSY97 survey differs from the one in the NLSY79 survey. Instead of a binary poor/non-poor indicator, it reports the ratio of family income to the poverty line. A value over 100 indicates the person is above the poverty line, and a value below 100 indicates poverty. You will need to transform this variable to calculate the poverty rate and conditional probabilities.
- Use the label command to assign meaningful labels to your variables and clearly indicate which survey they come from.
You can download the solutions to these exercises here (please try to solve the exercises by yourself before looking at the solutions).