Pandas on practical examples

by Alex
This is a supplement to Pandas Basics. Instead of a theoretical introduction to a million features of Pandas, here are 2 examples:

  1. Data from the Hubble Space Telescope.
  2. Dataset about the wages of economically active people in the USA.

Hubble data

Let’s start with the Hubble data. First we will talk about how to read a simple csv file and build the data: Let’s start with data from the Hubble Space Telescope, one of the most famous telescopes. The data is very simple. The file is called hubble_data.csv. You can even open it in Microsoft Excel or OpenOffice. Here is how it will look in these programs:Pandas on practical examplesData in CSV format. This is a very popular format primarily because of its simplicity. You can open it in any text editor. Try it. You will see that CSV files are just different values separated by a comma (which is actually what the name implies – comma-separated values). This is the main reason for the popularity of the format. It does not require any special software, not even Excel. It also means that the data can be read in 10 years, when new versions of spreadsheets appear. Let’s get started. Open an instance of Ipython (Jupyter) and run the following code.

import pandas as pd
import matplotlib.pyplot as plt

This imports pandas, the main library in Python for analyzing data. It also imports matplotlib for plotting. %pylan inline is an Ipython command that allows you to use graphs in your work.

data = pd.read_csv("hubble_data.csv")


Pandas makes life much easier. You can read a csv file with one function: read_csv(). Now you can call head() to output the first five lines.

distance distance_velocity
0 0.032 170
1 0.034 290
2 0.214 -130
3 0.263 -70
4 0.275 -185

Pandas is a pretty smart library. This is shown by the fact that it understands that the first line of the file is the header. This is what the first 3 lines of the CSV file look like:


Now you can see that the header at the top is really there. It names two columns: distance and recession_velocity.

Pandas recognizes this correctly.

But what to do if there is no header? You can read the file by manually specifying the headers. There is one more file hubble_data_no_headers.csv without headers. It is not different from the previous one except for the missing headers. Here’s how to read such a file:

headers = ["dist","rec_vel"]

data_no_headers = pd.read_csv("hubble_data_no_headers.csv", names=headers)


Here you declare your own headers(headers). They have different names(dist and rec_vel) to make it obvious that it is a different file. The data is read in the same way, but this time new variables names=headers are passed. This tells Pandas to use them, because there are no headers in the file. Then the first five lines are printed.

dist rec_vel
0 0.032 170
1 0.034 290
2 0.214 -130
3 0.263 -70
4 0.275 -185

Pandas only allows you to see one column:


0 0.032
1 0.034
2 0.214
4 0.275

Now that you have the data, you need to plot it. The easiest way to accomplish this is to get rid of indices. Pandas adds numbers by default (just like Excel). If you look at the data structure, you will see that the left line has values 0,1,2,3,4.... If you replace the numbers with distance, then graphing becomes even easier. distance becomes the x-axis, and velocity becomes the y-axis. But how to replace the indices?

data.set_index("distance", inplace=True)
distance distance
0.032 170
0.034 290
0.214 -130
0.263 -70
0.275 -185

If you compare with the last example, you can see that the numbers are gone. Moreover, the data are now arranged in an x – y relation. Creating a graph is now even easier:


Pandas on practical examples

Payroll data

Now payroll data. This example builds on the previous one and shows how to add custom headers, work with indented files, and extract columns from the data: This example is more complicated.

Open the notebook. Let’s start, as before, by importing the necessary modules and reading the CSV file. In this case, we are talking about the file wages_hours.csv.
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("wages_hours.csv")

Everything is as before. We just need to read the file, don’t we? But the result is as follows:

0 2157t2.905t1121t291t380t7250t38.5t2.340…
1 2174t2.970t1128t301t398t7744t39.3t2.335…
2 2062t2.350t1214t326t185t3068t40.1t2.851…
3 2111t2.511t1203t49t117t1632t22.4t1.159…
4 2134t2.791t1013t594t730t12710t57.7t1.22…

It looks incomprehensible. And it doesn’t look like the original file at all. So what happened? There are no commas in the CSV file Although the name implies “Comma Separated Values”, the data can be separated by anything. For example, it could be indented. t in the text means indentations. Pandas can’t parse the file because the library counted on commas, not indents. You have to read the file again, this time passing a new variable sep='t'. This will tell it that the separators are indents, not commas.

data = pd.read_csv("wages_hours.csv", sep="t")


0 2157 2.905 1121 291 380 7250 38.5 2.340 32.1 10.5
1 2174 2.970 1128 301 398 7744 39.3 2.335 31.2 10.5
2 2062 2.350 1214 326 185 3068 40.1 2.851 * 8.9
3 2111 2.511 1203 49 117 1632 22.4 1.159 27.5 11.5
4 2134 2.791 1013 594 730 12710 57.7 1.229 32.5 8.8

It worked. But there’s quite a lot of data. Do I need all of it? Only AGE (age) and RACE (wage rate) are needed in this project. Only these two columns need to be extracted first.

data2 = data[["AGE", "RATE"]]
0 38.5 2.905
1 39.3 2.970
2 40.1 2.350
3 22.4 2.511
4 57.7 2.791

To make a competent chart, you need to arrange the ages in ascending or descending order. Let’s do it in descending order (because this is the default behavior for sort()).

data_sorted = data2.sort(["AGE"])

Like last time, we need to remove the numbers and use age values instead, to simplify the process of plotting.

data_sorted.set_index("AGE", inplace=True)
22.4 2.511
37.2 3.015
37.4 1.901
37.5 1.899
37.5 3.009

And the graph itself:


Pandas on practical examplesYou can see that the rate goes up until age 35, and then it starts to change a lot. Of course, this is a general universal rate. Only individual conclusions can be drawn from this data set.

Related Posts