Seaborn is a library for making statistical plots in Python. It is built on top of matplotlib and tightly integrates with pandas data structures. Seaborn helps you explore and understand the data. Its graphing functions work with datasets and perform all the necessary transformations to create informative graphs. Its dataset-oriented syntax allows you to focus on the graphs rather than the details of their construction. The official documentation is in English: https://seaborn.pydata.org/index.html.
Table of Contents
Installing seaborn
Official seaborn releases can be installed from PyPI:
pip install seaborn
The library is also part of the Anaconda distribution:
conda install seaborn
The library works with Python version 3.6+. If not already present, these libraries will be loaded when you install seaborn: numpy, scipy, pandas, matplotlib. Once you have installed seaborn, you can download and build a test chart for one of the built-in datasets:
import seaborn as sns
df = sns.load_dataset("penguins")
sns.pairplot(df, hue="species")
By running this code in Jupyter Notebook, you will see a graph like this. If you are not working with Jupyter, you may need to call
matplotlib.pyplot.show()
explicitly:
import matplotlib.pyplot as plt
plt.show()
Let’s take a closer look at building popular types of graphs. All further code will be executed in the Jupyter Notebook
Drawing a Bar Plot in Seaborn
Histograms show numeric values on one axis and category variables on the other. They allow you to see the parameter values for each category. Histograms can be used to visualize time series as well as categorical data only.
Drawing a histogram
To draw a histogram in Seaborn, you need to call the barplot()
function, and pass it the categorical and numeric variables you want to visualize, as in the example:
import matplotlib.pyplot as plt
import seaborn as sns
x = ['A', 'B', 'C']
y = [10, 50, 30]
sns.barplot(x=x, y=y);
In this case, we have several categorical variables in a list – A, B and C. And also continuous variables (numbers) in another list – 10, 50 and 30. The dependence between these two elements is visualized in a histogram, for which the two lists are passed to the function sns.barplot()
. The result is a clear and simple histogram: More often than not, you will be working with datasets that contain much more data than the one in the example. Sometimes you need to sort to these datasets, or count how many times this or that value is repeated. When you work with data, you may encounter errors and omissions in the data. Fortunately, Seaborn protects us and automatically applies a filter that is based on calculating the average of the data provided. Let’s import the classic Titanic dataset and visualize the Bar Plot with this data:
# Import data
titanic_dataset = sns.load_dataset("titanic")
# Plotting the bar graph
sns.barplot(x="sex", y="survived", data=titanic_dataset;)
In this case, we have assigned "sex"
and "survived"
columns to the X and Y axes, instead of the hard-coded ones. If we display the first lines of the dataset(titanic_dataset.head()
), we see a table like this:
survived pclass sex age sibsp parch fare ...
0 0 3 male 22.0 1 0 7.2500 ...
1 1 1 female 38.0 1 0 71.2833 ...
2 1 3 female 26.0 0 0 7.9250 ...
3 1 1 female 35.0 1 0 53.1000 ...
4 0 3 male 35.0 0 0 8.0500 ...
Make sure that the column names are the same as the ones you assigned to the x
and y
variables. Finally, we use this data and pass it as argument to the function we’re working with. And we get this result:
Drawing a Horizontal Histogram
To draw a horizontal histogram instead of a vertical one, just swap the variables passed to x
and y
. In this case, the categorical variable will be displayed along the y-axis, which leads to the construction of a horizontal graph:
x = ['A', 'B', 'C']
y = [10, 50, 30]
sns.barplot(x=y, y=x);
The graph will look like this:
How to change the color in barplot()
Changing the color of the bars is easy. To do this, you have to set the color
parameter of the barplot
function and then the color of all the columns will change to the color you specify. Let’s change it to blue:
x = ['A', 'B', 'C']
y = [10, 50, 30]
sns.barplot(x=x, y=y, color='blue');
Then the graph will look like this: Or, better yet, set the
pallete
argument, which can take a large number of colors. A fairly common value for this parameter is hls
:
sns.barplot(
x="embark_town",
y='survived',
palette='hls',
data=titanic_dataset
);
Which will lead to this result:
Bar Plot Grouping in Seaborn
Often you want to group columns in plots by a single attribute. Let’s say you want to compare some general data, passenger survival rates, and group them according to given criteria. We may want to visualize the number of surviving passengers according to class (first, second, and third), but also take into account the city from which they arrived. All this information can be easily displayed on a bar graph. To group the columns together, we use the hue
argument. This argument groups the corresponding data and tells the Seaborn library how to color the columns. Let’s look at the example we just discussed:
sns.barplot(x="class", y="survived", hue="embark_town", data=titanic_dataset);
We get a graph like this:
Configuring the order in which the groups of columns are displayed on the histogram
You can change the default order of the columns. This is done with the order
argument, which takes a list of values and the order in which they are placed. For example, so far it has ordered classes one through three. What if we want to do the opposite?
sns.barplot(
x="class",
y="survived",
hue="embark_town",
order=["Third", "Second", "First"],
data=titanic_dataset
);
We would get a schedule like this:
Change the confidence interval in barplot()
You can also experiment with the confidence interval by specifying the ci
argument. For example, you can disable it by setting it to None
, or use the standard deviation instead of the mean by setting sd
, or even set the upper limit on the error bars by setting capsize
. Let’s experiment a bit with the confidence interval attribute:
sns.barplot(
x="class",
y="survived",
hue="embark_town",
ci=None,
data=titanic_dataset
);
We get this result: Or we can use the standard deviation:
sns.barplot(
x="class",
y="survived",
hue="who",
ci="sd",
capsize=0.1,
data=titanic_dataset
);
We have looked at several ways to build a histogram in Seaborn using examples. Now let’s move on to heatmaps.
Building Heatmaps in Seaborn
Let’s see how we can work with the Seaborn library in Python to create a basic correlation heatmap. For our purposes, we will use the Ames housing dataset available at Kaggle.com. It contains over 30 metrics that can potentially affect property values. Since Seaborn was written based on the Matplotlib data visualization library, they are fairly easy to use together. Therefore, in addition to the standard modules, we are also going to import Matplotlib.pyplot.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
The following code creates a correlation matrix between all studied indicators and our variable y
(real estate value).
dataframe.corr()
A correlation matrix with only 13 variables. That’s not to say it’s not readable at all. But why not make life easier with visualization?
A simple heat map in Seaborn
sns.heatmap(dataframe.corr());
About as beautiful as it is useless. Seaborn is easy to use, but rather difficult to navigate. The library comes with many built-in features and extensive documentation. It can be hard to figure out which arguments to use if you don’t want every possible trick. Let’s make a basic heatmap more useful with a minimum of effort. Take a look at the list of
heatmap
arguments:
seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robust=False, annot=None, fmt='.2g', annot_kws=None, linewidths=0, linecolor='white', cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels='auto', yticklabels='auto', mask=None, ax=None, **kwargs)
vmin
,vmax
– sets the range of values that is the basis of a color map (colormap).cmap
– defines the particular colormap we want to use (check the full range of color palettes here).center
– takes a real number to center the color map; ifcmap
is not specified the default colormap is used; if set toTrue
all colors are replaced by blue.annot
– if set toTrue
, numeric correlation values are displayed inside the cells.cbar
– if set toFalse
, the color bar (serving as a legend) disappears.
# increase size
heatmap plt.figure(figsize=(16, 6))
# Save the heatmap object in a variable so you can access it easily,
# when you want to include additional features (like a header display).
# Set the range of values to display on the color map to -1 to 1 and set the annotation (annot) to True,
# to display the numeric correlation values on the heatmap.
heatmap = sns.heatmap(dataframe.corr(), vmin=-1, vmax=1, annot=True)
# Give the heatmap a name. The padding parameter defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
The best way to work with
heatmap
is to use a divergent color palette. It has two very different dark (saturated) colors at the corresponding ends of the interpolated value range with a pale, almost colorless midpoint. Let’s illustrate this statement and deal with one more small detail: how to save the created heatmap to a png file with all the necessary x and y labels(xticklabels
and yticklabels
).
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(dataframe.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
# Save the map as a png file
# The dpi parameter sets the resolution of the saved image in dots per inch
# bbox_inches, when set to 'tight', does not allow labels to be cropped
plt.savefig('heatmap.png', dpi=300, bbox_inches='tight')
The stronger correlation at both ends of the spectrum appears as dark (saturated) cells, the weaker one as light cells.
Triangular thermal correlation map
Take a look at any of the above heat maps. If you discard one of its halves along the diagonal marked with units, you won’t lose any information. So, let’s reduce the heat map, leaving only the bottom triangle. The mask
argument of the heatmap
comes in handy to hide part of the heatmap. The mask takes as its argument an array of boolean values or a table data structure(dataframe
). If it is provided, heatmap cells for which the mask value is True
are not displayed. Let’s use the np.triu()
function of the numpy library to isolate the upper triangle of the matrix, turning all values in the lower triangle into 0. np.tril()
will do the same, only for the lower triangle. In turn, np.ones_like()
will change all isolated values to 1.
np.triu(np.ones_like(dataframe.corr()))
When we convert the data type to logical, all 1s will turn True and all 0s will turn False.
plt.figure(figsize=(16, 6))
# Define a mask to set the values in the top triangle to True
mask = np.triu(np.ones_like(dataframe.corr(), dtype=np.bool))
heatmap = sns.heatmap(dataframe.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
Correlation of Independent Variables with Dependent Variables
Quite often we want to create a colored map that shows the severity of the relationship between each independent variable included in our model and the dependent variable. The following code returns the correlation of each parameter with “selling price,” the only dependent variable, in descending order.
dataframe.corr()[['Sale Price']].sort_values(by='Sale Price', ascending=False)
Let’s use the resulting list as data to display on the heat map.
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(dataframe.corr()[['Sale Price']].sort_values(by='Sale Price', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Sales Price', fontdict={'fontsize':18}, pad=16);
Features Correlation with the selling price These examples demonstrate the basic functionality of the heatmap in Seaborn. Now let’s move on to scatter plots.
Building a Scatter Plot in Seaborn
Let’s look at the process of creating a Scatter Plot in Seaborn. We will build simple and three-dimensional scatter plots, as well as group plots based on FacetGrid.
Importing Data
We will use a dataset based on world happiness. Comparing its index to other indicators will reflect the factors that influence the level of happiness in the world.
Drawing a dot plot
Let’s show on the chart the correlation of the happiness index to the country economy (GDP per capita):
dataframe = pd.read_csv('2016.csv')
sns.scatterplot(data=dataframe, x="Economy (GDP per Capita)", y="Happiness Score");
With Seaborn, it is very easy to make simple graphs like scatter plots. We don’t have to use a Figure
object and Axes
instances or customize anything. Here we passed the dataframe
as an argument with the data, and the signs with the information we want to visualize in x
and y
. The axes of the diagram are signed by default with column names that correspond to headers from the loaded file. Below we look at how to change this. After executing the code, we get the following: The result showed a direct correlation between GDP per capita and the estimated level of happiness of the inhabitants of a particular country or region.
Constructing a group of scatterplots using FacetGrid
If you want to compare many variables with each other, for example, average life expectancy along with estimates of happiness and level of economy, there is no need to build a 3D graph. While two-dimensional charts exist to visualize the relationship between sets of variables, not all of them are easy to use. With the FacetGrid
object, the Seaborn library allows you to process the data and build group correlated graphs from it. Let’s take a look at the following example:
grid = sns.FacetGrid(dataframe, col="Region", hue="Region", col_wrap=5)
grid.map(sns.scatterplot, "Economy (GDP per Capita)", "Health (Life Expectancy)")
grid.add_legend();
In this example, we created an instance of the
FacetGrid
object with the dataframe
parameter as the data. When we pass the "Region"
value to the col
argument, the library groups the dataset into regions and builds a scatterplot for each of them. The hue
parameter sets each region to its own hue. Finally, the col_wrap
argument limits the width of the Figure
region to 5 charts. Once this limit is reached, the next graphs will be plotted on a new line. We use the map()
method to prepare the grid before displaying it on the screen. The diagram type is passed in the first argument with the value sns.scatterplot
, and the x
and y
variables serve as axes. The result will be 10 plots for each region with their corresponding axes. Immediately before printing, we call the method that adds a legend with the color labeling.
Drawing a 3D scatter plot
Unfortunately, Seaborn lacks its own 3D engine. Being only an add-on to Matplotlib, it relies on the graphical capabilities of the main library. Nevertheless, we can still apply the Seaborn style to a 3D diagram. Let’s see how it would look with sampling by levels of happiness, economy, and health:
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv('Downloads/2016.csv')
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = df['Happiness Score']
y = df['Economy (GDP per Capita)']
z = df['Health (Life Expectancy)']
ax.set_xlabel('Happiness')
ax.set_ylabel('Economy')
ax.set_zlabel("Health")
ax.scatter(x, y, z)
plt.show()
Running the code will produce an interactive 3D visualization that can be rotated and scaled in three dimensions:
Customizing the Scatter Plot
With Seaborn you can easily customize the various elements of the created diagrams. For example, it is possible to change the color and size of each point on the chart. Let’s try to set some parameters and see how its appearance changes:
sns.scatterplot(
data=dataframe,
x="Economy (GDP per Capita)",
y="Happiness Score",
hue="Region",
size="Freedom"
);
Here we have applied a tint to the regions – this means that the data for each of them will be colored differently. In addition, using the size
argument we have set the proportions of points depending on the level of freedom. The larger the value, the larger the dot on the diagram: Or you can simply set the same color and size for all points:
sns.scatterplot(
data=dataframe,
x="Economy (GDP per Capita)",
y="Happiness Score,
color="red",
sizes=5
);
Great, you’ve learned a few ways to build a scatter plot in Seaborn. Let’s move on to another popular plot.
Drawing a Box Plot in Seaborn
Box Plot, also called:
- by rectangle graphs,
- box plots,
- spread charts
- or whisker boxes for their appearance.
They are used to visualize summary statistics of a dataset. Box Plot display distribution attributes such as range and distribution of data within a range (rectangle, “whiskers”, median).
Data Import
You need continuous numeric data to create a box plot because such a plot displays summary statistics – median, range, and outliers. For example, let’s use forestfires.csv data set (information about forest litter moisture index, precipitation, temperature, wind, etc.). We import pandas to load and analyze the dataset, seaborn and the pyplot module from matplotlib for visualization:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
Let’s use pandas to read the CSV file in the dataframe and output the first 5 lines. We also check if the dataset contains missing values(Null
, NaN
):
# specify your path to the file
dataframe = pd.read_csv("Downloads/forestfires.csv")
print(dataframe.isnull().values.any())
dataframe.head()
The code will return False
and the top of the table. XYmonthdayFFMCDMCDCISItempRHwindrainarea075marfri86.226.294.35.18.2516.70.00.0174octtue90.635.4669.16.718.0330.90.00.0274octsat90.643.7686.96.714.6331.30.00.0386marfri91.733.377.59.08.3974.00.20.0486marsun89.351.3102.29.611.4991.80.00.0 Print printed False
, so there are no missing values. If there were, we would have to additionally process the missing values. After checking the data, we need to select the signs that will be visualized. For convenience, we will store them in variables with the same names.
FFMC = dataframe["FFMC"]
DMC = dataframe["DMC"]
DC = dataframe["DC"]
RH = dataframe["RH"]
ISI = dataframe["ISI"]
temp = dataframe["temp"]
These are the columns that contain continuous numeric data.
The box plot
We will use the boxplot
function in Seaborn to create a diagram, passing the variables for the visualization as arguments:
sns.boxplot(x=DMC);
To visualize the distribution of only one attribute, we pass it to the x
variable. In this case, Seaborn will automatically calculate the values on the y-axis
, as seen in the following image. If you want a particular distribution segmented by type, you can pass a categorical variable to
x
and a continuous variable to y
as arguments for the boxplot
function.
sns.boxplot(x=dataframe["day"], y=DMC);
Now we have a block diagram created for each day of the week. If you want to visualize several columns at the same time, the
x
and y
arguments will not be enough. For this purpose, the data
argument is used, to which a data set containing the required variables and their values is passed. Create a new dataset containing only the data we want to visualize. Then apply the melt()
function to it. The resulting dataset is passed to the data
argument. The default values from melt (value and variable) are passed to the x
and y
arguments in this case:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
Changing the color of the boxplot
Seaborn automatically assigns different colors to different variables so that you can easily distinguish them visually. You can change the color of the charts by providing your color list. After defining a list of colors as HEX values or names of available Matplotlib colors, you can pass them to boxplot()
as a palette
argument:
colors = ['#78C850', '#F08030', '#6890F0', '#F8D030', '#F85888', '#705898', '#98D8D8']
sns.boxplot(x=DMC, y=dataframe["day"], palette=colors);
Configuring Axis Signatures
With Seaborn, you can easily customize the X- and Y-axis captions. For example, change the font size, captions, or rotate them to make them easier to read.
df = pd.DataFrame(data=dataframe, columns=['FFMC', 'DMC', 'DC', 'ISI'])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df))
boxplot.axes.set_title("Forest fire indicator distribution", fontsize=16)
boxplot.set_xlabel("Indicators", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14);
Changing the order in which blocks are displayed
To display block diagrams in a certain order, use the order
argument, which is passed a list of column names in the order in which you want them to be arranged:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df), order=["DC", "DMC", "FFMC", "ISI"])
boxplot.axes.set_title("Forest fire distribution", fontsize=16)
boxplot.set_xlabel("Indicators", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14);
Creating subplots with Matplotlib
If you want to divide a common box plot into several for individual features, you can do so. Define the drawing area(fig
) and the desired number of coordinate axes(axes
) using the subplots
function from Matplotlib. The desired area of the axes
object can be accessed through its index. The boxplot()
function takes an ax
argument which, by the index of the axes object, gets the area to be plotted:
fig, axes = plt.subplots(1, 2)
sns.boxplot(x=day, y=DMC, orient='v', ax=axes[0])
sns.boxplot(x=day, y=DC, orient='v', ax=axes[1]);
Box Plot with Scatter Plot
For a clearer view of the distribution, you can overlay the boxplot on the boxplot. To this end, we create two diagrams one after the other. The diagram created by stripplot()
will be superimposed on top of the box plot, since they are displayed in the same area:
df = pd.DataFrame(data=dataframe, columns=["DC", "DMC"])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df), order=["DC", "DMC"])
boxplot = sns.striplot(x="variable", y="value", data=pd.melt(df), marker="o", alpha=0.3, color="black", order=["DC", "DMC"])
boxplot.axes.set_title("Forest fire distribution", fontsize=16)
boxplot.set_xlabel("Indicators", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14);
We looked at several ways to build a Box Plot using Seaborn and Python. We also learned how to set up colors, axis captions, diagram order, overlaying point diagrams, and separating diagrams for individual values. The last type of chart worth mentioning is the Violin Plot.
Building a Violin Plot in Seaborn
Violin Plot or violin plots are used to visualize the distribution of data by displaying the data range, median, and area of data distribution. Such plots, like whisker boxes, show summary statistics. Additionally, they include distribution density plots, which is what determines the shape/distribution of the data when visualized.
Importing Data
For our example, let’s use the Gapminder dataset, which contains information on population, life expectancy, and other data by country and year since 1952. We import pandas, seaborn, and the pyplot module from matplotlib:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Next, we load the dataset and see what it consists of.
dataframe = pd.read_csv(
"Downloads/gapminder_full.csv",
error_bad_lines=False,
encoding="ISO-8859-1"
)
dataframe.head()
The result is: countryyearpopulationcontinentlife_expgdp_cap0Afghanistan19528425333Asia28.801779.4453141Afghanistan19579240934Asia30.332820.8530302Afghanistan196210267083Asia31.997853.1007103Afghanistan196711537966Asia34.020836.1971384Afghanistan197213079460Asia36.088739.981106 Let’s define the features that we are going to visualize. For convenience, we store them in variables with the same names.
country = dataframe.country
continent = dataframe.continent
population = dataframe.population
life_exp = dataframe.life_exp
gdp_cap = dataframe.gdp_cap
Drawing a simple violin diagram
Now that we have loaded the data and chosen the values we want to visualize, we can create a violin diagram. Let’s use the function violinplot()
, to which we pass the variable for visualization as argument x
. The y-axis values will be calculated automatically.
sns.violinplot(x=life_exp);
I note that you could not pre-select the data by the column name and not save it in the variable
life_exp
. Using the argument data
, to which our dataset is passed, and the argument x
, to which the variable name “life_exp” is assigned, we get exactly the same result.
sns.violinplot(x="life_exp", data=dataframe);
Note that in this image, Seaborn plots the distribution of life expectancy for all countries at once, because only one life_exp
variable was used. In most cases this type of variable is considered based on other variables, such as country
or continent in our case.
Construction of the Violin Plot with the X and Y axes
In order to get a visualization of data distribution segmented by type, you must use a categorical variable for x
and a continuous variable for y
as function arguments. There are many countries in this data set. If you plotted all the countries, there would be too many to consider. You could, of course, select a subset of the dataset and just plot, say, 10 countries. Instead, let’s build a violinplot
for the continents.
sns.violinplot(x=continent, y=life_exp, data=dataframe);
Changing the Chart Header Axis Caption
Suppose we need to change some of the headers and captions of our chart to make it easier to analyze it. Although Seaborn automatically captures the X and Y axes, you can change the captions with the set_title()
and set_label()
functions after creating the axes
object. We just have to pass the name we want to give to our graph to set_title()
. To sign the axes, use the set()
function with the arguments xlabel
and ylabel
or the wrapper functions set_xlabel()/set_ylabel()
:
ax = sns.violinplot(x=continent, y=life_exp)
ax.set_title("Life expectancy by continent")
ax.set_ylabel("Life expectancy")
ax.set_xlabel("Continent");
Changing the color of a chart
To change the color of diagrams, you can create a list of pre-selected colors and pass this list with the pallete
parameter to violinplot()
:
colors_list = [
'#78C850', '#F08030', '#6890F0',
'#A8B820', '#F8D030', '#E0C068',
'#C03028', '#F85888', '#98D8D8'
]
ax = sns.violinplot(x=continent, y=life_exp, palette=colors_list)
ax.set_title("Life expectancy by continent")
ax.set_ylabel("Life expectancy")
ax.set_xlabel("Continent");
Violin Plot with Scatter Diagram
A scatterplot can be overlaid on a violin diagram to see the placement of the points that make up that distribution. To do this, simply create one drawing area, and then create two diagrams in it in sequence.
colors_list = [
'#78C850', '#F08030', '#6890F0',
'#A8B820', '#F8D030', '#E0C068',
'#C03028', '#F85888', '#98D8D8'
]
plt.figure(figsize=(16,8))
sns.violinplot(x=continent, y=life_exp,palette=colors_list)
sns.swarmplot(x=continent, y=life_exp, color="k", alpha=0.8)
plt.title("Life expectancy by continent")
plt.ylabel("Life expectancy")
plt.xlabel("Continent");
Changing the style of the violin diagram
You can easily change the style and color of our diagram using set_style()
and set_palette()
, respectively. Seaborn supports several different options for changing the style and color palette of charts:
plt.figure(figsize=(16,8))
sns.set_palette("RdBu")
sns.set_style("darkgrid")
sns.violinplot(x=continent, y=life_exp)
sns.swarmplot(x=continent, y=life_exp, color="k", alpha=0.8)
plt.title("Life expectancy by continent")
plt.ylabel("Life expectancy")
plt.xlabel("Continent");
Constructing a Violin Plot for Different Traits
If you want to separate the visualization of the columns from the dataset into their own plots, you can do so. Create a drawing area and a grid with plots in the cells. The add_subplot()
function, which is passed the address of this cell, is used to display the diagram in the corresponding cell. Create the diagram as usual, using the axes
object. You can use y=variable
, or data=variable
.
fig = plt.figure(figsize=(6, 4))
gs = fig.add_gridspec(1, 3)
ax = fig.add_subplot(gs[0, 0])
sns.violinplot(data=population)
ax.set_xlabel("Population")
ax = fig.add_subplot(gs[0, 1])
sns.violinplot(data=life_exp)
ax.set_xlabel("Life span")
ax = fig.add_subplot(gs[0, 2])
sns.violinplot(data=gdp_cap)
ax.set_xlabel("GDP volume")
fig.tight_layout()
Grouping violinplots by categorical variable
A really useful thing for violinplot
is grouping by categorical variable values. For example, if there is a categorical variable that has two values (usually True/False
), then you can group the charts by those values. Suppose there is a population employment dataset with an employment column and its values employed and unemployed. Then you can group the graphs by type of employment. Since there is no column in the Gapminder dataset suitable for such grouping, it can be done by calculating the average life expectancy for a certain subset of countries, for example, European countries. We assign a Yes/No value to the new above_average_life_exp
column for each country. If the average life expectancy is higher than the dataset average, then the value is Yes, and vice versa:
# Separate the European countries from the original dataset
europe = dataframe.loc[dataframe["continent"] == "Europe"]
# Calculate the average value of "life_exp" variable
avg_life_exp = dataframe["life_exp"].mean()
# Add new column
europe.loc[:, "above_average_life_exp"] = europe["life_exp"] > avg_life_exp
europe["above_average_life_exp"].replace(
{True: "Yes", False: "No"},
inplace=True
)
Now, if we output our dataset, we get the following: countryyearpopulationcontinentlife_expgdp_capabove_average_life_exp12Albania19521282697Europe55.231601.056136No13Albania19571476505Europe59.281942.284244No14Albania19621728137Europe64.822312.888958Yes15Albania19671984060Europe66.222760.196931Yes16Albania19722263554Europe67.693313.422188Yes We can now plot the fiddle charts, grouped by the new column we inserted. Given that there are many European countries, let’s select the last 50 rows using europe.tail()
for easy visualization:
europe = europe.tail(50)
ax = sns.violinplot(x=europe.country, y=europe.life_exp, hue=europe.above_average_life_exp)
ax.set_title("life expectancy by country")
ax.set_ylabel("Life expectancy")
ax.set_xlabel("Countries");
The result will be: Now the countries with life expectancy less than the average, expected life expectancy differ by color.
Separating violinplots by categorical variable
If you use the hue
argument for a categorical variable that has two values, then by applying the split
argument in violinplot()
and setting it to True
, you can split violinplots in half with the hue
value. In our case, one side of the violin (the left side) will represent records with above-average life expectancy, while the right side will be used to plot below-average life expectancy:
europe = europe.tail(50)
ax = sns.sninplot(
x=europe.country,
y=europe.life_exp,
hue=europe.above_average_life_exp,
split=True
)
ax.set_title("Life expectancy by country")
ax.set_ylabel("Life expectancy")
ax.set_xlabel("Countries");
We have looked at several ways to construct a Violin Plot in Seaborn. This is the last type of plots worth paying attention to. In this article, we looked at examples of plotting:
- Bar Plot
- Scatter Plot
- Box Plot
- Heatmap
- Violin Plot