This is a little analytics to get some idea of the chaos caused by the coronavirus. Some graphics and statistics for a general idea.
Importing the necessary libraries.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV afqk I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
import random
def random_colours(number_of_colors):
'''
A simple function for generating random colors.
Input
number_of_colors is an integer value indicating
number of colors that will be generated.
Output
Color in the following format: ['#E86DA4'].
'''
colors = []
for i in range(number_of_colors):
colors.append('#'+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
return colors
Table of Contents
Statistical analysis
data = pd.read_csv('/novel-corona-virus-2019-dataset/2019_nCoV_data.csv')
data.head()
Sno | Province/State | Country | Last Update | Confirmed | Deaths | Recovered | |
---|---|---|---|---|---|---|---|
0 | 1 | Anhui | China | 1/22/2020 12:00 | 1.0 | 0.0 | 0.0 |
1 | 2 | Beijing | China | 1/22/2020 12:00 | 14.0 | 0.0 | 0.0 |
2 | 3 | Chongqing | China | 1/22/2020 12:00 | 6.0 | 0.0 | 0.0 |
3 | 4 | Fujian | China | 1/22/2020 12:00 | 1.0 | 0.0 | 0.0 |
4 | 5 | Gansu | China | 1/22/2020 12:00 | 0.0 | 0.0 | 0.0 |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (7 columns in total):
Sno 497 non-null int64
Province/State 393 non-null object
Country 497 non-null object
Last Update 497 non-null object
Confirmed 497 non-null float64
Deaths 497 non-null float64
Recovered 497 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 27.3 KB
The metrics on the numeric columns are growing.
data.describe()
Sno | Confirmed | Deaths | Recovered | |
---|---|---|---|---|
count | 434.000000 | 434.000000 | 434.000000 | 434.000000 |
mean | 217.500000 | 80.762673 | 1.847926 | 1.525346 |
std | 125.429263 | 424.706068 | 15.302792 | 9.038054 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 109.250000 | 2.000000 | 0.000000 | 0.000000 |
50% | 217.500000 | 7.000000 | 0.000000 | 0.000000 |
75% | 325.750000 | 36.000000 | 0.000000 | 0.000000 |
max | 434.000000 | 5806.000000 | 204.000000 | 116.000000 |
Metrics for nonnumeric columns are growing.
data.describe(include="O")
Province/State | Country | Last Update | |
---|---|---|---|
count | 393 | 497 | 497 |
unique | 45 | 31 | 13 |
top | Ningxia | Mainland China | 1/31/2020 19:00 |
freq | 10 | 274 | 63 |
Convert Last Update
data to datetime
data['Last Update'] = pd.to_datetime(data['Last Update'])
Add Day
and Hour
columns
data['Day'] = data['Last Update'].apply(lambda x:x.day)
data['Hour'] = data['Last Update'].apply(lambda x:x.hour)
Data only for the 30th day of January.
data[data['Day'] == 30]
data[data['Day'] == 30].sum()
Sno 23895
Country Mainland ChinaMainland ChinaMainland ChinaMain...
Confirmed 9776
Deaths 213
Recovered 187
Day 1770
Hour 1239
dtype: object
We can see that the number of confirmed cases for Hubei Province in China is 5,806 as of the 30th. The number of deaths, recovered and injured corresponds to the official number on January 30. This means that Confirmed
already includes people affected on previous dates. We create a dataset with data for January 30 only.
latest_data = data[data['Day'] == 30]
latest_data.head()
print('Confirmed cases (entire world): ', latest_data['Confirmed'].sum())
print('Deaths (entire world): ', latest_data['Deaths'].sum())
print('Recovered (entire world): ', latest_data['Recovered'].sum())
Confirmed cases (entire world): 9776.0
Deaths (world-wide): 213.0
Recoveries (entire world): 187.0
The dataset data corresponds to the official data. Let’s see how the coronavirus spread over time.
plt.figure(figsize=(16,6))
data.groupby('Day').sum()['Confirmed'].plot();
Over time, there has been an exponential increase in the number of coronavirus victims.
plt.figure(figsize=(16,6))
sns.barplot(x='Day',y='Confirmed',data=data);
Deep exploratory data analysis (EDA)
latest_data.groupby('Country').sum()
- Mainland China has non-zero values for recoveries and deaths, which can be examined later by creating a separate data set
Provinces and regions with no reported cases.
data[data['Confirmed']==0]
- Interestingly, there are parts of mainland China that have not yet been affected by the virus.
- There are countries without confirmed cases of infection, and we will discard them.
Provinces and regions with at least 1 reported case.
data = data[data['Confirmed'] != 0]
Number of people infected in different countries.
plt.figure(figsize=(18,8))
sns.barplot(x='Country',y='Confirmed',data=data)
plt.tight_layout()
- The graph shows what we all know. The virus has affected mainland China the most, but there are reports of casualties in neighboring countries, indicating that the virus is spreading.
- There are also cases confirmed in countries as far away as the US, Thailand, Japan, etc. D. I wonder how the virus got there. My guess is that someone was in Wuhan or nearby at the time the virus spread and took it home with them, this outbreak is really dangerous.
The number of infected in different regions.
import plotly.express as px
fig = px.bar(data, x='Province/State', y='Confirmed')
fig.show()
Analysis of coronavirus growth in each country
pivoted = pd.pivot_table(data, values='Confirmed', columns='Country', index='Day')
pivoted.plot(figsize=(16,10));
Visualization of the outbreak in provinces/regions
pivoted = pd.pivot_table(data, values='Confirmed', columns='Province/State', index='Day')
pivoted.plot(figsize=(20,15));
- Hubei, the most affected province.
- There is also an upward trend in confirmed cases, and the condition seems to be getting worse.
Now let’s look at the countries that were originally affected and the countries that have now been infiltrated by the crown virus.
data[data['Day'] == 22]['Country'].unique()
array(['China', 'US', 'Japan', 'Thailand', 'South Korea'], dtype=object)
So, on the first day, January 22, infections were found in China, the United States, Japan, Thailand
temp = data[data['Day'] == 22]
temp.groupby('Country').sum()['Confirmed'].plot.bar()
Let’s look at the latest data.
data[data['Day'] == 30]['Country'].unique()
array([['Mainland China', 'Hong Kong', 'Taiwan', 'Macau', 'US', 'Japan',
'Thailand', 'South Korea', 'Singapore', 'Vietnam', 'France',
'Nepal', 'Malaysia', 'Canada', 'Cambodia', 'Sri Lanka',
australia', 'Germany', 'Finland', 'United Arab Emirates',
'Philippines', 'India', 'Italy'], dtype=object)
Here we see that the outbreak had spread to 23 countries by January 30.
Consider only mainland China
data_main_china = latest_data[latest_data['Country']=='Mainland China']
Let’s calculate the percentage of deaths.
(data_main_china['Deaths'].sum() / data_main_china['Confirmed'].sum())*100
2.205425553944916
Now the percentage of recoveries.
(data_main_china['Recovered'].sum() / data_main_china['Confirmed'].sum())*100
1.8533857941602818
- We can see that the mortality rate from coronavirus is 2%, so it is not as deadly as other viral outbreaks.
- Because there are few reported cures, the cure rate is 1.87, which is scary. Although the figure could go up a lot, because 96% now do not fall into either group.
Where most of the deaths occurred
data_main_china.groupby('Province/State')['Deaths'].sum().reset_index(
).sort_values(by=['Deaths'],ascending=False).head()
Province/State | Deaths | |
---|---|---|
12 | Hubei | 204.0 |
10 | Heilongjiang | 2.0 |
11 | Henan | 2.0 |
9 | Hebei | 1.0 |
1 | Beijing | 1.0 |
Number of deaths by day.
plt.figure(figsize=(16,6))
data.groupby('Day').sum()['Deaths'].plot();
Graph of diseases in mainland China.
pivoted = pd.pivot_table(data[data['Country']=='Mainland China'] , values='Confirmed', columns='Province/State', index='Day')
pivoted.plot(figsize=(20,15))
pivoted = pd.pivot_table(data, values='Deaths', columns='Province/State', index='Day')
pivoted.plot(figsize=(20,15));
What’s next
Download coronavirus.ipynb and the data from the link at the beginning of this article. Try building your own graphs and tables.