8 examples of using value_counts from Pandas

by Alex

Before you start working on a data project, you need to look at the data set. Exploratory Data Analysis (EDA) is a very important step, because the data can be confusing and a lot can go wrong in the process. The Pandas library has several functions to solve this problem, and value_counts is one of them. It returns an object containing unique values from the Pandas dataframe in sorted order. However, many people forget about this feature and use the default parameters. In this piece, let’s see how to get the most out of value_counts by changing the default parameters.

What is value_counts()?

The value_counts() function is used to get a Series containing unique values. It will return the result sorted in descending order, so that the first item in the collection is the most encountered. NA-values are not included in the result.

Syntax df['your_column'].value_counts()– will return the number of unique matches in a particular column.

It is important to note that value_counts only works with series, not dataframe. So you need to specify single square brackets df['your_column'] rather than a pair of df[['your_column']]

Parameters

  • normalize (bool, False by default) – if True, the returned object will contain values relative to the frequency of encountered values.
  • sort (bool, True by default) – sorting by frequency.
  • ascending (bool, False by default) – ascending sorting.
  • bins (int) – groups the values together into segments, but this only works with numeric data.
  • dropna (bool, defaults to True) – does not include NaN counts.

Load data for demonstration

Let’s see how to use this method on real data. Let’s take a dataset from a Coursera course on Kaggle as an example. First, we import the required libraries and the data itself. This is needed in any project. After that, we analyze the data in the Jupyter notebook.

# library import
import pandas as pd

# Loading data
df = pd.read_csv('Downloads/coursea_data.csv', index_col=0)

# check csv data
df.head(10)

8 examples of using value_counts from PandasCheck how many records are in the dataset and if we have any gaps.

The result shows that there are 981 records in the dataset, and no NA.

<class 'pandas.core.frame.DataFrame'>

Int64Index: 891 entries, 134 to 163
Data columns (total 6 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 course_title 891 non-null object
1 course_organization 891 nonnull object
2 course_Certificate_type 891 non-null object
3 course_rating 891 non-null float64
4 course_difficulty 891 non-null object
5 course_students_enrolled 891 non-null object
dtypes: float64(1), object(5)
memory usage: 48.7+ KB

1. value_counts with default parameters

Now we can start using the function value_counts. Let’s start with a basic use of the function.

Syntax: df['your_column'].value_counts(). Get the amount of each value for the “course_difficulty” column. The value_counts function will return the number of matches of all unique values for the given index without gaps. This will allow you to see that the most courses with a difficulty level are “Beginner”, followed by “Intermediate” and “Mixed”. And “Difficult” is in last place.

df['course_difficulty'].value_counts()
---------------------------------------------------
Beginner 487
Intermediate 198
Mixed 187
Advanced 19
Name: course_difficulty, dtype: int64

Now it’s time to work with the parameters.

2. Sorting in ascending order

By default value_counts() returns data in descending order. You can change the behavior by setting the ascending parameter to True.

Syntax: df['your_column'].value_counts(ascending=True).

df['course_difficulty'].value_counts(ascending=True)
---------------------------------------------------
Advanced 19
Mixed 187
Intermediate 198
Beginner 487
Name: course_difficulty, dtype: int64

3. Alphabetical sorting

In certain cases, there may be a need to sort records alphabetically. This is done by adding sort_index(ascending=True) after value_counts(). By default, the function sorts “course_difficulty” by the number of matches, but with sort_index it sorts by the index (the name of the column for which the function is used):

df['course_difficulty'].value_counts().sort_index(ascending=True)
---------------------------------------------------
Advanced 19
Beginner 487
Intermediate 198
Mixed 187
Name: course_difficulty, dtype: int64

If, however, you want to display value_counts() in reverse alphabetical order, you need to reverse the sort direction: .sort_index(ascending=False).

4. Sort by value and then alphabetically

We use a different dataset for this example.

df_fruit = pd.DataFrame({
   'fruit':
      ['persimmons']*5 + ['apples']*5 + ['bananas']*3 + 
      ['peaches']*3 + ['carrots']*3 + ['apricots'] + ['mango']*2
})

So, we need to get the output sorted first by the number of matching values, and then alphabetically. This can be done by combining value_counts() with sort_index(ascending=False) and sort_values(ascending=False).

df_fruit['fruit'].value_counts()
                 .sort_index(ascending=False)
                 .sort_values(ascending=False)
-------------------------------------------------
persimmon 5
apples 5
bananas 3
carrots 3
peaches 3
mango 2
apricots 1
Name: fruit, dtype: int64

5. Relative frequency of unique values

Sometimes you need to get relative values, not just numbers. With the parameter normalize=True, the object will return the relative frequency of unique values. The default value of this parameter is False. Syntax: df['your_column'].value_counts(normalize=True).

df['course_difficulty'].value_counts(normalize=True)
-------------------------------------------------
Beginner 0.546577
Intermediate 0.222222
Mixed 0.209877
Advanced 0.021324
Name: course_difficulty, dtype: float64

6. value_counts() to break the data into discrete intervals

Another trick that often gets overlooked. value_counts() can be used to break up data into discrete intervals using the bin parameter. This only works with numeric data. The principle is similar to pd.cut. Let’s see how it works using the “course_rating” column as an example. Let’s group the column values into 4 groups. The syntax: df['your_column'].value_counts(bin=number of groups).

df['course_rating'].value_counts(bins=4)
-------------------------------------------------
(4.575, 5.0] 745
(4.15, 4.575] 139
(3.725, 4.15] 5
(3.297, 3.725] 2
Name: course_rating, dtype: int64

Binnig makes it easy to get insights. So, you can see that most people rate the course a 4.5. And only a few courses are rated below 4.15.

7. value_counts() with omissions

By default, the number of NaN values is not included in the result. But you can change this behavior by setting the dropna parameter to False. Since there are no zero values in the dataset, this will not affect anything in this example. But the parameter itself should be remembered. Syntax: df['your_column'].value_counts(dropna=False).

8. value_counts() as dataframe

As noted, value_counts() returns a Series, not a Dataframe. If, however, you want to get results in the latter form, you can use .to_frame() after . value_counts(). Syntax: df['your_column'].value_counts().to_frame(). This will look like this:

8 examples of using value_counts from PandasIf you want to name a column or rename an existing one, this conversion can be done in a different way.

value_counts = df['course_difficulty'].value_counts()

# conversion to df and giving new names to the columns
df_value_counts = pd.DataFrame(value_counts)
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = ['unique_values', 'counts for course_difficulty']
df_value_counts

Groupby and value_counts

Groupby is a very popular method in Pandas. It can be used to group the result of one column and calculate the values in another. Syntax: df.groupby('your_column_1')['your_column_2'].value_counts(). So, with groupby and value_counts you can calculate the number of certificate types for each level of course difficulty.

df.groupby('course_difficulty')['course_Certificate_type'].value_counts()
-------------------------------------------------
course_difficulty course_Certificate_type 
Advanced SPECIALIZATION 10
                   COURSE 9
Beginner COURSE 282
                   SPECIALIZATION 196
                   PROFESSIONAL CERTIFICATE 9
Intermediate COURSE 104
                   SPECIALIZATION 91
                   PROFESSIONAL CERTIFICATE 3
Mixed COURSE 187
Name: course_Certificate_type, dtype: int64

This is a multi-index, allowing you to have multiple levels of indexes in the dataframe. In this case, the course complexity corresponds to index level zero, and the certificate type corresponds to index level one.

Filtering values by minimum and maximum

Working with a dataset, you may need to return the number of limited occurrences using value_counts(). Syntax: df['your_column'].value_counts().loc[lambda x : x > 1]. This code will filter out all unique data values and show only those where the value is greater than one. For the example, let’s limit the course rating to a value of 4.

df.groupby('course_difficulty')['coudf['course_rating']
 .value_counts().loc[lambda x: x > 4] 
-------------------------------------------------
4.8 256
4.7 251
4.6 168
4.5 80
4.9 68
4.4 34
4.3 15
4.2 10
Name: course_rating, dtype: int64

value_counts() is a handy tool that allows you to do convenient one-line analysis.

Related Posts

LEAVE A COMMENT