Class 2: Manipulating your Data
In this class, we will learn how to manipulate data. Although this is clearly the less glamorous part of an empirical analysis, it is just as important as the analysis itself — because there can be no analysis without data. In your professional life, you have probably encountered raw data that could have been very useful for informing decision-makers, if someone had gone through the process of standardizing the data and putting the different sources together to build a useful dataset. Feeling comfortable with that process will open the door to an almost infinite set of exciting data sources waiting to be analyzed, without needing an army of RAs crunching data in Excel through an irreproducible process. You are on the right track to master this process, as Stata has powerful tools to go through it without losing your mind along the way.
You will be introduced to the main tools for manipulating data through three interconnected short in-class exercises and a homework. All these exercises will build on top of each other to complete the processing of real data on bills paid by Chilean hospitals to their suppliers, from which we will obtain relevant information to assess the extent of the problem of late payment in this context.
Set up
Imagine you are a young professional at the Chilean Office of Budget Planning. You work in a recently funded task force devoted to promoting prompt payment throughout the public sector. You turn on your computer on a Monday morning after a refreshing weekend and receive the following email from your counterpart in the Ministry of Health:
Dear Colluegue:
The Minister of Health read your last report on the relative performance of different agencies on prompt payment to suppliers with great interest. As you know, public hospitals have had a hard time paying their suppliers on time, and he would like more information from your task force to understand the potential consequences of late payment and how to improve hospital performance on this dimension. Could you please send us a database that shows, for each hospital and available year, the average number of days they take to pay their suppliers and the fraction of bills paid after the limit established by law (60 days for EAR hospitals and 30 days for the rest)? Please also provide the results for small, medium, and large suppliers, as defined by the tertiles of the distribution of total sales to hospitals. Finally, please include a column showing, for each hospital-year, its relative rank compared to other hospitals in that year, using the fraction of bills paid late as the standard (the smaller, the better), and whether the hospital improved, maintained, or dropped its position compared to the previous year.
The table should look like this:
Hospital | Year | EAR | Rank | Change in Rank | Average Payment (Days) | Late Payments (Fraction) | Average Payment, small suppliers | Average Payment, medium suppliers | Average Payment, large suppliers | Late Payments, small suppliers | Late Payments, medium suppliers | Late Payments, large suppliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Hospital A | 2010 | |||||||||||
Hospital A | 2011 | |||||||||||
⋮ | ⋮ |
Thank you. We really appreciate your efforts to improve prompt payment throughout the government.
All the best,
P.S.: EAR stands for Establecimiento Autogestionado en Red, which translates to “in-network self-managed establishment.”
Road Map
Download the data for this class. You have all the data needed to achieve your goal in the data/original directory of the folder for this class, which contains data for a sample of hospitals. In broad terms, this is what we need to do to achieve our goal:
- Stack related databases:
- Vertically combine the information on bills from different hospitals.
- Reshape the CPI data to resemble the structure of the bill data.
- Extract month and year from dates in both databases.
- Horizontally combine bill data with:
- The CPI series constructed above.
- Information on hospitals to identify which are EAR.
- Build the final database:
- Create a new database that identifies small, medium, and large suppliers.
- Horizontally combine the bill database with the one above that identifies small, medium, and large suppliers.
- Identify bills that were paid late.
- Collapse and reshape the database to obtain the desired result.
- Rank hospitals in each year and note whether their position improved, remained the same, or declined.
Wow! This looks like a lot. Do not feel overwhelmed by this task — we will chop it into small pieces and see that, with Stata, this is not difficult at all!
In the scripts folder for this class, there are four do-files. The files exercise1-template.do, exercise2-template.do, and exercise3-template.do provide templates where you can code the answers to each of the exercises in this class, which correspond to the three main steps indicated above.
Exercise 1
Open the file exercise1-template.do and pay attention to your instructor.
Exercise 2
Open the file exercise2-template.do and pay attention to your instructor.
Exercise 3
Open the file exercise3-template.do and pay attention to your instructor.
Bonus
Run the three do-files in consecutive order using the file master-template.do. You will need to edit this file by inserting the paths to your do-files.
You can download the solutions to these exercises here (please try to solve the exercises by yourself before looking at the solutions).
Homework 1
You’ve had an intense afternoon of Stata coding and feel very satisfied with your work. Just a second before sending the email to your counterpart at the Ministry of Health, you receive a new email from him:
Dear Colluegue:
I hope you haven't started building the database yet! I forgot to attach the information from three hospitals we received yesterday (you can download it here). Could you please include these hospitals?
Also, I would like to try another ranking. Hospitals with poor performance may feel discouraged by the ranking I proposed, so I discussed another option with my team that we liked better. The idea is to calculate the change (in percentage points) in the average number of days to payment for each hospital-year. So, for example, if a hospital paid its average bill in 2011 and 2012 in 30 and 27 days respectively, its score in 2012 will be -0.1 ([27–30]/30). The lower the score, the better the ranking (lower numbers are better). Please add this ranking and indicate whether the hospital’s position improved, remained the same, or worsened—without dropping the previous ranking.
I hope this doesn’t add too much work to your evening.
All the best,
Other people might have their dinner plans ruined by this new email—but not you. As an experienced professional working with data, you have the entire procedure coded in do-files. Adding the additional hospitals just means repeating a few lines of code, and the new ranking will require only a few more. This highlights one of the main advantages of doing your work in do-files: if you need to run a similar analysis, you don’t need to start from scratch. Go ahead and send the database before your counterpart changes his mind and sends another email!
To add the new datasets, include the following lines after the database for the fifth hospital has been saved in the do-file for Exercise 1 (make sure to move the new data files into the data/original/bills folder):
* Sixth hospital
insheet using "data/original/bills/Hospital F.csv", clear
save "data/processed/F.dta", replace
* Seventh hospital
insheet using "data/original/bills/Hospital G.csv", clear
save "data/processed/G.dta", replace
* Eighth hospital
insheet using "data/original/bills/Hospital H.csv", clear
save "data/processed/H.dta", replace
Then add the following lines after the data for Hospital A has been appended in the same do-file:
append using "data/processed/E.dta"
append using "data/processed/G.dta"
append using "data/processed/F.dta"
To add the new ranking, include the following lines of code after declaring the dataset as longitudinal (xtset) in the do-file for Exercise 3:
gen score = (payment_delay - l1.payment_delay) / l1.payment_delay
egen rank2 = rank(score) if score != ., by(year)
gen rank2_improved = rank2 < l1.rank2 if l1.rank2 != .
replace rank2_improved = -1 if rank2 > l1.rank2 & l1.rank2 != .
Finally, create labels for these new variables and include them in the order command at the end.
Homework 2
A couple of weeks after finishing the database, you receive a call from the director of Cenabast, a Chilean public agency that acts as a retailer for drug purchases. They are worried that hospitals might be paying them particularly late because they do not have strong legal mechanisms to enforce prompt payment. You have been given this data, which shows which suppliers were selected in Cenabast's auctions between 2008 and 2015. You should combine this information with the data you already have to produce a table that shows, for each hospital-year:
- the average time hospitals took to pay Cenabast,
- the fraction of bills paid late to Cenabast,
- and the same two figures for the rest of the suppliers.
The group of suppliers considered for each year should be restricted to those that were selected in a Cenabast auction for that year. So, for example, if in 2011 Cenabast selected only suppliers X and Y, we want to include bills to suppliers X and Y when calculating both the Cenabast and non-Cenabast figures for 2011. This ensures we are comparing the same set of suppliers (and hopefully similar products).
The final table should look like this:
Hospital | Year | EAR | Average Payment to Cenabast (Days) | Late Payments to Cenabast (Fraction) | Average Payment to Other Suppliers (Days) | Late Payments to Other Suppliers (Fraction) |
---|---|---|---|---|---|---|
Hospital A | 2010 | |||||
Hospital A | 2011 | |||||
⋮ | ⋮ |
Keep in mind the following points to complete the homework:
- The supplier code for Cenabast is
61608700
. - In the Cenabast auction data, each row represents a supplier selected in an auction. Use the
auction_date
variable to determine the auction year. - The supplier ID in this file has a different name and structure. You will need to use the
split
command to standardize the IDs across datasets before merging, and userename
to align the variable names.
You can download the solution to this exercise here (please try to solve the exercise by yourself before looking at the solution).