- Data from the Hubble Space Telescope.
- Dataset about the wages of economically active people in the USA.
Hubble data
Let’s start with the Hubble data. First we will talk about how to read a simple csv file and build the data: Let’s start with data from the Hubble Space Telescope, one of the most famous telescopes. The data is very simple. The file is called hubble_data.csv. You can even open it in Microsoft Excel or OpenOffice. Here is how it will look in these programs:Data in CSV format. This is a very popular format primarily because of its simplicity. You can open it in any text editor. Try it. You will see that CSV files are just different values separated by a comma (which is actually what the name implies – comma-separated values). This is the main reason for the popularity of the format. It does not require any special software, not even Excel. It also means that the data can be read in 10 years, when new versions of spreadsheets appear. Let’s get started. Open an instance of Ipython (Jupyter) and run the following code.
import pandas as pd
import matplotlib.pyplot as plt
%pylabinline
This imports pandas
, the main library in Python for analyzing data. It also imports matplotlib for plotting. %pylan inline
is an Ipython command that allows you to use graphs in your work.
data = pd.read_csv("hubble_data.csv")
data.head()
Pandas makes life much easier. You can read a csv file with one function: read_csv()
. Now you can call head()
to output the first five lines.
distance | distance_velocity | |
---|---|---|
0 | 0.032 | 170 |
1 | 0.034 | 290 |
2 | 0.214 | -130 |
3 | 0.263 | -70 |
4 | 0.275 | -185 |
Pandas is a pretty smart library. This is shown by the fact that it understands that the first line of the file is the header. This is what the first 3 lines of the CSV file look like:
distance,velocity
.032,170
.034,290
Now you can see that the header at the top is really there. It names two columns: distance
and recession_velocity
.
But what to do if there is no header? You can read the file by manually specifying the headers. There is one more file hubble_data_no_headers.csv without headers. It is not different from the previous one except for the missing headers. Here’s how to read such a file:
headers = ["dist","rec_vel"]
data_no_headers = pd.read_csv("hubble_data_no_headers.csv", names=headers)
data_no_headers.head()
Here you declare your own headers(headers
). They have different names(dist
and rec_vel
) to make it obvious that it is a different file. The data is read in the same way, but this time new variables names=headers
are passed. This tells Pandas to use them, because there are no headers in the file. Then the first five lines are printed.
dist | rec_vel | |
---|---|---|
0 | 0.032 | 170 |
1 | 0.034 | 290 |
2 | 0.214 | -130 |
3 | 0.263 | -70 |
4 | 0.275 | -185 |
Pandas only allows you to see one column:
data_no_headers["dist"]
0 0.032
1 0.034
2 0.214
30.263
4 0.275
Now that you have the data, you need to plot it. The easiest way to accomplish this is to get rid of indices. Pandas adds numbers by default (just like Excel). If you look at the data structure, you will see that the left line has values 0,1,2,3,4....
If you replace the numbers with distance
, then graphing becomes even easier. distance
becomes the x-axis, and velocity
becomes the y-axis. But how to replace the indices?
data.set_index("distance", inplace=True)
data.head()
distance | distance |
---|---|
0.032 | 170 |
0.034 | 290 |
0.214 | -130 |
0.263 | -70 |
0.275 | -185 |
If you compare with the last example, you can see that the numbers are gone. Moreover, the data are now arranged in an x – y relation. Creating a graph is now even easier:
data.plot()
plt.show()
Payroll data
Now payroll data. This example builds on the previous one and shows how to add custom headers, work with indented files, and extract columns from the data: This example is more complicated.
import pandas as pd
import matplotlib.pyplot as plt
%pylabinline
data = pd.read_csv("wages_hours.csv")
data.head()
Everything is as before. We just need to read the file, don’t we? But the result is as follows:
HRS RATE ERSP ERNO NEIN ASSET AGE DEP RACE SCHOOL | |
---|---|
0 | 2157t2.905t1121t291t380t7250t38.5t2.340… |
1 | 2174t2.970t1128t301t398t7744t39.3t2.335… |
2 | 2062t2.350t1214t326t185t3068t40.1t2.851… |
3 | 2111t2.511t1203t49t117t1632t22.4t1.159… |
4 | 2134t2.791t1013t594t730t12710t57.7t1.22… |
It looks incomprehensible. And it doesn’t look like the original file at all. So what happened? There are no commas in the CSV file Although the name implies “Comma Separated Values”, the data can be separated by anything. For example, it could be indented. t
in the text means indentations. Pandas can’t parse the file because the library counted on commas, not indents. You have to read the file again, this time passing a new variable sep='t'
. This will tell it that the separators are indents, not commas.
data = pd.read_csv("wages_hours.csv", sep="t")
data.head()
HRS | RATE | ERSP | ERNO | NEIN | ASSET | AGE | DEP | RACE | SCHOOL | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2157 | 2.905 | 1121 | 291 | 380 | 7250 | 38.5 | 2.340 | 32.1 | 10.5 |
1 | 2174 | 2.970 | 1128 | 301 | 398 | 7744 | 39.3 | 2.335 | 31.2 | 10.5 |
2 | 2062 | 2.350 | 1214 | 326 | 185 | 3068 | 40.1 | 2.851 | * | 8.9 |
3 | 2111 | 2.511 | 1203 | 49 | 117 | 1632 | 22.4 | 1.159 | 27.5 | 11.5 |
4 | 2134 | 2.791 | 1013 | 594 | 730 | 12710 | 57.7 | 1.229 | 32.5 | 8.8 |
It worked. But there’s quite a lot of data. Do I need all of it? Only AGE
(age) and RACE
(wage rate) are needed in this project. Only these two columns need to be extracted first.
data2 = data[["AGE", "RATE"]]
data2.head()
AGE | RATE | |
---|---|---|
0 | 38.5 | 2.905 |
1 | 39.3 | 2.970 |
2 | 40.1 | 2.350 |
3 | 22.4 | 2.511 |
4 | 57.7 | 2.791 |
To make a competent chart, you need to arrange the ages in ascending or descending order. Let’s do it in descending order (because this is the default behavior for sort()
).
data_sorted = data2.sort(["AGE"])
data_sorted.head()
Like last time, we need to remove the numbers and use age values instead, to simplify the process of plotting.
data_sorted.set_index("AGE", inplace=True)
data_sorted.head()
AGE | RATE |
---|---|
22.4 | 2.511 |
37.2 | 3.015 |
37.4 | 1.901 |
37.5 | 1.899 |
37.5 | 3.009 |
And the graph itself:
data_sorted.plot()
plt.show()
You can see that the rate goes up until age 35, and then it starts to change a lot. Of course, this is a general universal rate. Only individual conclusions can be drawn from this data set.