In this tutorial we are going to review the five major steps in the process of analyzing data. Then we will look at some tips for handling data and data files. Finally, using the polling station list from Module 2, we will run through a practical exercise where we use what you learned in Module 1 to summarize the "Registered_Voters" variable.
Take some time to think about why you are looking at the dataset. Think about what you expect to see and what you do not expect to see in the data. Force yourself to be specific about your purpose and that will help you be much more efficient about the following steps in the process. For example, here are some of the questions we might want to answer by looking at the polling station list from Module 2:
Now that you have thought about the reason you are starting the data analysis process, it's time to get the data you need. Remember to save an untouched copy of the data somewhere before you begin to do anything with the data.
Experienced data analysts know that 80% of the entire data analysis process is actually spent converting, cleaning and/or reformatting data. First, you should look through the data and try to determine if data appears to be missing or if there appear to be any mistakes in the data. Then, you often need to 'wrangle' the data by making the structure more consistent. For example, some or all the variables for each row might be merged into the same column (e.g., the data for "city" and "province" might be included in the same column, so you would need to separate the city and province data into two separate columns). The data may not be in a format that is analyzable. Put another way, the data may not be "machine readable," so you may need to figure out how to convert it. Recall from Module 2 that the 2008 polling station data was originally stored in several PDF files. We converted the information in those files into a machine-readable format for you (e.g., CSV and XLS files). It could be that you need to merge or join the data from two different datasets into one before you can analyze it.
For each variable of interest, you want to be able to answer: what is typical for the data, and what is the spread of the data? In Module 1 we covered how to calculate what was typical (i.e., center or average) for the animal height variable and how the data was spread using the standard deviation.
The data analysis process is usually an iterative process. Once you finish your first pass at summarizing some variables, you may realize that you need to wrangle the data (e.g., merge it with another dataset) before you can do further analysis. Summarizing the data may lead you to a new purpose for analysis which, in turn, takes you back through the process.
In order to keep this module short and digestible, we are going to skip right to Step 4 in the data analysis process. We are going to assume that you know your specific questions and purpose, you have the data, and you have it in the right format (lucky you). Before we jump into the practical exercise, let's review some tips for handling data.
Now that you know how to practice good data hygiene, let's re-examine that polling station list from Module 2.
In the second module of Election Data Academy, we examined a polling station list (in XLS format). In theory, the file is supposed to contain the list of all polling stations in a country.[1] Open that file and take another look at the data. As you scroll through it, ask yourself these questions:
The first row of the file contains the names of the variables. There are six variables in this dataset: PS_Code, PS_Name, Region_Name, District_Name, Constituency_Name, and Registered_Voters.
Next, let's figure out how many observations there are in this dataset. To do this we can either scroll to the bottom and subtract 1 (for the row with the variable names):
Or, we can highlight a column (e.g., one of the variables, like registered voters) and look at the lower right of screen to see the number after "Numerical Count." The "Numerical Count" does just what you would guess: it only counts the numbers and ignores any text.
If, for some reason, "Numerical Count" is not listed on that bottom status bar, right click on the bar and choose "Numerical Count" as one of the statistics to display.
There are 20,928 observations for the variable called "Registered_Voters." Next, we're going to focus on that Registered Voters variable.
Column F contains the number of registered voters for each polling station (i.e., row). Before we summarize the Registered Voters variable, let's take a moment to think about what we expect to see. Ask yourself some questions, such as:
If you highlight the column with the "Registered_Voters" variable (Column F), you can have Excel automatically calculate and show certain values for the highlighted/selected cells in the lower right status bar. Again, you may need to right click on this bar and customize the statistics it shows by checking each one.
In this 2008 Polling Station List, you can see that the average number of registered voters for all of the polling stations is 547.8 voters. You can also see that the total number of cells filled in (i.e., the "Count") is 20,929. Excel can also show you that the minimum value for the highlighted cells is 1 and the maximum value is 55,515. Excel has an individual formula to calculate each of these values as well. You can create a new worksheet tab and put in the individual functions for average, minimum, maximum, count, and standard deviation. We did this for you with the heights data. Note that the standard deviation is not one of the options in that lower status bar, but you can enter the formula in another tab.
If you have the "Analysis Toolpak" installed[2], go to the Data tab where you should see "Data Analysis" listed on the far right.
From that menu, click on "Data Analysis," which will bring up a new window.
In the list under "Analysis Tools," choose "Descriptive Statistics" and then click "OK." It will bring up a new menu. In that new menu, click the box to the right of "Input Range:"
Now you will be able to select the range you want to describe. We are interested in the Registered Voters variable which is in column F. So, we select the top of column F to select the whole column.
Then, click back on that box to the right of input ().
It will bring you back to the "Descriptive Statistics" window. In that window, make sure to click the checkbox next to "Labels in First Row" and click the checkbox next to "Summary Statistics." Also, make sure "New Worksheet Ply" is selected so that it puts your statistics in a new sheet. Then click "OK."
Now it will show you the new tab with all of the summary statistics for Registered Voters.
Given what you now know about the average, standard deviation and range, how would you describe the Registered Voter variable? Do you think the distribution is spread out a lot or do you think it is tightly clustered?
Now that you have the summary statistics for the Registered Voters variable, how would you answer the below questions? How do the answers compare to your expectations from the beginning?