Example of python data analysis: Coronavirus

by Alex
Example of python data analysis: Coronavirus

This is a little analytics to get some idea of the chaos caused by the coronavirus. Some graphics and statistics for a general idea.

Data – Novel Corona Virus 2019 Dataset

Importing the necessary libraries.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV afqk I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import random
def random_colours(number_of_colors):
    A simple function for generating random colors.  
	number_of_colors is an integer value indicating
	 number of colors that will be generated.  
	Color in the following format: ['#E86DA4'].
    colors = []
   for i in range(number_of_colors):
        colors.append('#'+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
   return colors

Statistical analysis

data = pd.read_csv('/novel-corona-virus-2019-dataset/2019_nCoV_data.csv')

Sno Province/State Country Last Update Confirmed Deaths Recovered
0 1 Anhui China 1/22/2020 12:00 1.0 0.0 0.0
1 2 Beijing China 1/22/2020 12:00 14.0 0.0 0.0
2 3 Chongqing China 1/22/2020 12:00 6.0 0.0 0.0
3 4 Fujian China 1/22/2020 12:00 1.0 0.0 0.0
4 5 Gansu China 1/22/2020 12:00 0.0 0.0 0.0
<class 'pandas.core.frame.DataFrame'>

RangeIndex: 497 entries, 0 to 496
Data columns (7 columns in total):
Sno 497 non-null int64
Province/State 393 non-null object
Country 497 non-null object
Last Update 497 non-null object
Confirmed 497 non-null float64
Deaths 497 non-null float64
Recovered 497 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 27.3 KB

The metrics on the numeric columns are growing.

Sno Confirmed Deaths Recovered
count 434.000000 434.000000 434.000000 434.000000
mean 217.500000 80.762673 1.847926 1.525346
std 125.429263 424.706068 15.302792 9.038054
min 1.000000 0.000000 0.000000 0.000000
25% 109.250000 2.000000 0.000000 0.000000
50% 217.500000 7.000000 0.000000 0.000000
75% 325.750000 36.000000 0.000000 0.000000
max 434.000000 5806.000000 204.000000 116.000000

Metrics for nonnumeric columns are growing.

Province/State Country Last Update
count 393 497 497
unique 45 31 13
top Ningxia Mainland China 1/31/2020 19:00
freq 10 274 63

Convert Last Update data to datetime

data['Last Update'] = pd.to_datetime(data['Last Update'])

Add Day and Hourcolumns

data['Day'] = data['Last Update'].apply(lambda x:x.day)
data['Hour'] = data['Last Update'].apply(lambda x:x.hour)

Data only for the 30th day of January.

data[data['Day'] == 30]
data[data['Day'] == 30].sum()
Sno 23895
Country Mainland ChinaMainland ChinaMainland ChinaMain...
Confirmed 9776
Deaths 213
Recovered 187
Day 1770
Hour 1239
dtype: object

We can see that the number of confirmed cases for Hubei Province in China is 5,806 as of the 30th. The number of deaths, recovered and injured corresponds to the official number on January 30. This means that Confirmed already includes people affected on previous dates. We create a dataset with data for January 30 only.

latest_data = data[data['Day'] == 30]
print('Confirmed cases (entire world): ', latest_data['Confirmed'].sum())
print('Deaths (entire world): ', latest_data['Deaths'].sum())
print('Recovered (entire world): ', latest_data['Recovered'].sum())
Confirmed cases (entire world): 9776.0
Deaths (world-wide): 213.0
Recoveries (entire world): 187.0

The dataset data corresponds to the official data. Let’s see how the coronavirus spread over time.


Example of python data analysis: Coronavirus Over time, there has been an exponential increase in the number of coronavirus victims.


Example of python data analysis: Coronavirus

Deep exploratory data analysis (EDA)

  • Mainland China has non-zero values for recoveries and deaths, which can be examined later by creating a separate data set

Provinces and regions with no reported cases.

  • Interestingly, there are parts of mainland China that have not yet been affected by the virus.
  • There are countries without confirmed cases of infection, and we will discard them.

Provinces and regions with at least 1 reported case.

data = data[data['Confirmed'] != 0]

Number of people infected in different countries.


Example of python data analysis: Coronavirus

  1. The graph shows what we all know. The virus has affected mainland China the most, but there are reports of casualties in neighboring countries, indicating that the virus is spreading.
  2. There are also cases confirmed in countries as far away as the US, Thailand, Japan, etc. D. I wonder how the virus got there. My guess is that someone was in Wuhan or nearby at the time the virus spread and took it home with them, this outbreak is really dangerous.

The number of infected in different regions.

import plotly.express as px
fig = px.bar(data, x='Province/State', y='Confirmed')

Example of python data analysis: Coronavirus

Analysis of coronavirus growth in each country

pivoted = pd.pivot_table(data, values='Confirmed', columns='Country', index='Day')

Example of python data analysis: Coronavirus

Visualization of the outbreak in provinces/regions

pivoted = pd.pivot_table(data, values='Confirmed', columns='Province/State', index='Day')

Example of python data analysis: Coronavirus

  • Hubei, the most affected province.
  • There is also an upward trend in confirmed cases, and the condition seems to be getting worse.

Now let’s look at the countries that were originally affected and the countries that have now been infiltrated by the crown virus.

data[data['Day'] == 22]['Country'].unique()
array(['China', 'US', 'Japan', 'Thailand', 'South Korea'], dtype=object)

So, on the first day, January 22, infections were found in China, the United States, Japan, Thailand

temp = data[data['Day'] == 22]

Example of python data analysis: Coronavirus Let’s look at the latest data.

data[data['Day'] == 30]['Country'].unique()
array([['Mainland China', 'Hong Kong', 'Taiwan', 'Macau', 'US', 'Japan',
       'Thailand', 'South Korea', 'Singapore', 'Vietnam', 'France',
       'Nepal', 'Malaysia', 'Canada', 'Cambodia', 'Sri Lanka',
       australia', 'Germany', 'Finland', 'United Arab Emirates',
       'Philippines', 'India', 'Italy'], dtype=object)

Here we see that the outbreak had spread to 23 countries by January 30.

Consider only mainland China

data_main_china = latest_data[latest_data['Country']=='Mainland China']

Let’s calculate the percentage of deaths.

(data_main_china['Deaths'].sum() / data_main_china['Confirmed'].sum())*100

Now the percentage of recoveries.

(data_main_china['Recovered'].sum() / data_main_china['Confirmed'].sum())*100
  • We can see that the mortality rate from coronavirus is 2%, so it is not as deadly as other viral outbreaks.
  • Because there are few reported cures, the cure rate is 1.87, which is scary. Although the figure could go up a lot, because 96% now do not fall into either group.

Example of python data analysis: CoronavirusWhere most of the deaths occurred

Province/State Deaths
12 Hubei 204.0
10 Heilongjiang 2.0
11 Henan 2.0
9 Hebei 1.0
1 Beijing 1.0

Number of deaths by day.


Example of python data analysis: CoronavirusGraph of diseases in mainland China.

pivoted = pd.pivot_table(data[data['Country']=='Mainland China'] , values='Confirmed', columns='Province/State', index='Day')

Example of python data analysis: Coronavirus

pivoted = pd.pivot_table(data, values='Deaths', columns='Province/State', index='Day')

What’s next

Download coronavirus.ipynb and the data from the link at the beginning of this article. Try building your own graphs and tables.

Related Posts