Before you start working on a data project, you need to look at the data set. Exploratory Data Analysis (EDA) is a very important step, because the data can be confusing and a lot can go wrong in the process. The Pandas library has several functions to solve this problem, and value_counts
is one of them. It returns an object containing unique values from the Pandas dataframe in sorted order. However, many people forget about this feature and use the default parameters. In this piece, let’s see how to get the most out of value_counts
by changing the default parameters.
Table of Contents
What is value_counts()?
The value_counts()
function is used to get a Series
containing unique values. It will return the result sorted in descending order, so that the first item in the collection is the most encountered. NA-values are not included in the result.
Syntax df['your_column'].value_counts()
– will return the number of unique matches in a particular column.
It is important to note that value_counts
only works with series, not dataframe. So you need to specify single square brackets df['your_column']
rather than a pair of df[['your_column']]
Parameters
normalize
(bool, False by default) – ifTrue
, the returned object will contain values relative to the frequency of encountered values.sort
(bool, True by default) – sorting by frequency.ascending
(bool, False by default) – ascending sorting.bins
(int) – groups the values together into segments, but this only works with numeric data.- dropna (bool, defaults to True) – does not include NaN counts.
Load data for demonstration
Let’s see how to use this method on real data. Let’s take a dataset from a Coursera course on Kaggle as an example. First, we import the required libraries and the data itself. This is needed in any project. After that, we analyze the data in the Jupyter notebook.
# library import
import pandas as pd
# Loading data
df = pd.read_csv('Downloads/coursea_data.csv', index_col=0)
# check csv data
df.head(10)
Check how many records are in the dataset and if we have any gaps.
df.info()
The result shows that there are 981 records in the dataset, and no NA.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 134 to 163
Data columns (total 6 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 course_title 891 non-null object
1 course_organization 891 nonnull object
2 course_Certificate_type 891 non-null object
3 course_rating 891 non-null float64
4 course_difficulty 891 non-null object
5 course_students_enrolled 891 non-null object
dtypes: float64(1), object(5)
memory usage: 48.7+ KB
1. value_counts with default parameters
value_counts
. Let’s start with a basic use of the function.Syntax: df['your_column'].value_counts()
. Get the amount of each value for the “course_difficulty” column. The value_counts
function will return the number of matches of all unique values for the given index without gaps. This will allow you to see that the most courses with a difficulty level are “Beginner”, followed by “Intermediate” and “Mixed”. And “Difficult” is in last place.
df['course_difficulty'].value_counts()
---------------------------------------------------
Beginner 487
Intermediate 198
Mixed 187
Advanced 19
Name: course_difficulty, dtype: int64
Now it’s time to work with the parameters.
2. Sorting in ascending order
value_counts()
returns data in descending order. You can change the behavior by setting the ascending
parameter to True
.Syntax: df['your_column'].value_counts(ascending=True)
.
df['course_difficulty'].value_counts(ascending=True)
---------------------------------------------------
Advanced 19
Mixed 187
Intermediate 198
Beginner 487
Name: course_difficulty, dtype: int64
3. Alphabetical sorting
In certain cases, there may be a need to sort records alphabetically. This is done by adding sort_index(ascending=True)
after value_counts()
. By default, the function sorts “course_difficulty” by the number of matches, but with sort_index
it sorts by the index (the name of the column for which the function is used):
df['course_difficulty'].value_counts().sort_index(ascending=True)
---------------------------------------------------
Advanced 19
Beginner 487
Intermediate 198
Mixed 187
Name: course_difficulty, dtype: int64
If, however, you want to display value_counts()
in reverse alphabetical order, you need to reverse the sort direction: .sort_index(ascending=False)
.
4. Sort by value and then alphabetically
We use a different dataset for this example.
df_fruit = pd.DataFrame({
'fruit':
['persimmons']*5 + ['apples']*5 + ['bananas']*3 +
['peaches']*3 + ['carrots']*3 + ['apricots'] + ['mango']*2
})
So, we need to get the output sorted first by the number of matching values, and then alphabetically. This can be done by combining value_counts()
with sort_index(ascending=False)
and sort_values(ascending=False)
.
df_fruit['fruit'].value_counts()
.sort_index(ascending=False)
.sort_values(ascending=False)
-------------------------------------------------
persimmon 5
apples 5
bananas 3
carrots 3
peaches 3
mango 2
apricots 1
Name: fruit, dtype: int64
5. Relative frequency of unique values
Sometimes you need to get relative values, not just numbers. With the parameter normalize=True
, the object will return the relative frequency of unique values. The default value of this parameter is False
. Syntax: df['your_column'].value_counts(normalize=True)
.
df['course_difficulty'].value_counts(normalize=True)
-------------------------------------------------
Beginner 0.546577
Intermediate 0.222222
Mixed 0.209877
Advanced 0.021324
Name: course_difficulty, dtype: float64
6. value_counts() to break the data into discrete intervals
Another trick that often gets overlooked. value_counts()
can be used to break up data into discrete intervals using the bin
parameter. This only works with numeric data. The principle is similar to pd.cut
. Let’s see how it works using the “course_rating” column as an example. Let’s group the column values into 4 groups. The syntax: df['your_column'].value_counts(bin=number of groups)
.
df['course_rating'].value_counts(bins=4)
-------------------------------------------------
(4.575, 5.0] 745
(4.15, 4.575] 139
(3.725, 4.15] 5
(3.297, 3.725] 2
Name: course_rating, dtype: int64
Binnig makes it easy to get insights. So, you can see that most people rate the course a 4.5. And only a few courses are rated below 4.15.
7. value_counts() with omissions
By default, the number of NaN values is not included in the result. But you can change this behavior by setting the dropna
parameter to False
. Since there are no zero values in the dataset, this will not affect anything in this example. But the parameter itself should be remembered. Syntax: df['your_column'].value_counts(dropna=False)
.
8. value_counts() as dataframe
As noted, value_counts()
returns a Series, not a Dataframe. If, however, you want to get results in the latter form, you can use .to_frame()
after .
value_counts()
. Syntax: df['your_column'].value_counts().to_frame()
. This will look like this:
If you want to name a column or rename an existing one, this conversion can be done in a different way.
value_counts = df['course_difficulty'].value_counts()
# conversion to df and giving new names to the columns
df_value_counts = pd.DataFrame(value_counts)
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = ['unique_values', 'counts for course_difficulty']
df_value_counts
Groupby and value_counts
Groupby is a very popular method in Pandas. It can be used to group the result of one column and calculate the values in another. Syntax: df.groupby('your_column_1')['your_column_2'].value_counts()
. So, with groupby
and value_counts
you can calculate the number of certificate types for each level of course difficulty.
df.groupby('course_difficulty')['course_Certificate_type'].value_counts()
-------------------------------------------------
course_difficulty course_Certificate_type
Advanced SPECIALIZATION 10
COURSE 9
Beginner COURSE 282
SPECIALIZATION 196
PROFESSIONAL CERTIFICATE 9
Intermediate COURSE 104
SPECIALIZATION 91
PROFESSIONAL CERTIFICATE 3
Mixed COURSE 187
Name: course_Certificate_type, dtype: int64
This is a multi-index, allowing you to have multiple levels of indexes in the dataframe. In this case, the course complexity corresponds to index level zero, and the certificate type corresponds to index level one.
Filtering values by minimum and maximum
Working with a dataset, you may need to return the number of limited occurrences using value_counts()
. Syntax: df['your_column'].value_counts().loc[lambda x : x > 1]
. This code will filter out all unique data values and show only those where the value is greater than one. For the example, let’s limit the course rating to a value of 4.
df.groupby('course_difficulty')['coudf['course_rating']
.value_counts().loc[lambda x: x > 4]
-------------------------------------------------
4.8 256
4.7 251
4.6 168
4.5 80
4.9 68
4.4 34
4.3 15
4.2 10
Name: course_rating, dtype: int64
value_counts() is a handy tool that allows you to do convenient one-line analysis.