2. EDA¶

In the previous chapter, we examined the structure of deep learning models and evaluation metrics for time series analysis. In this chapter, we will explore and visualize the datasets we will be using to perform time series forecasting in the next chapter.

We will use COVID-19 daily case datasets for the time series forecasting. The datasets we are going to use are COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University and Kaggle: Novel Corona Virus 2019 Dataset.

COVID-19 daily cases by country are updated in the repository of Johns Hopkins University every day, along with various types of datasets, which are provided by Johns Hopkins University on its website and Kaggle.

In chapter 2.1, we will download the COVID-19 daily case datasets to see what features are included in them. In 2.2, we will perform EDA (Exploratory Data Analysis) on the datasets, focusing on world countries, and then in 2.3, we will dive into the COVID-19 daily cases in South Korea.

2.1 Download Datasets¶

Firstly, we will download the COVID-19 daily case datasets. Using the data-loader function coded by Pseudo-Lab, we can easily download the datasets.

Download the datasets from the repository of Tutorial-Book-Utils on GitHub by using git clone to save them in the Colab enviroment.

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils

Cloning into 'Tutorial-Book-Utils'...
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 24 (delta 6), reused 14 (delta 3), pack-reused 0
Unpacking objects: 100% (24/24), done.

dataset example

Figure 2-1. Downloaded Files in Colab

In Figure 2-1, you can see PL_data_loader.py in the directory, in addition to README.md and utils_ObjectDetection.py. PL_data_loader.py contains the functions to download Google Drive datasets. It is possible to download the COVID-19 daily case datasets with the input of COVIDTimeSeries if we use the parameter –data.

!python Tutorial-Book-Utils/PL_data_loader.py --data COVIDTimeSeries

COVIDTimeSeries.zip is done!

Next, let’s unarchive the downloaded file by using unzip, which is a Linux command. Unnecessary outputs are not printed if we use the command -q.

!unzip -q COVIDTimeSeries.zip

dataset example

Figure 2-2. Downloaded Datasets

Once the COVIDTimesSeries.zip is unarchived, you can see two downloaded files: covid_19_data.csv and time_series_covid19_confirmed_global.csv (Figure 2-2).

We will visualize the global COVID-19 daily cases with covid_19_data.csv based on time series, and we will visualize the daily cases of COVID-19 in South Korea with time_series_covid19_confirmed_global.csv. Let’s first view the saved values after loading each dataset.

import pandas as pd

all = pd.read_csv('covid_19_data.csv')
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')

all contains daily confirmed cases, daily deaths, and daily recovered data. The dataframe of the dataset consists of ObservationDate, Province/State, Country/Region, Confirmed, Deaths, and Recovered.

all

	SNo	ObservationDate	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered
0	1	01/22/2020	Anhui	Mainland China	1/22/2020 17:00	1.0	0.0	0.0
1	2	01/22/2020	Beijing	Mainland China	1/22/2020 17:00	14.0	0.0	0.0
2	3	01/22/2020	Chongqing	Mainland China	1/22/2020 17:00	6.0	0.0	0.0
3	4	01/22/2020	Fujian	Mainland China	1/22/2020 17:00	1.0	0.0	0.0
4	5	01/22/2020	Gansu	Mainland China	1/22/2020 17:00	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...
172475	172476	12/06/2020	Zaporizhia Oblast	Ukraine	2020-12-07 05:26:14	36539.0	337.0	6556.0
172476	172477	12/06/2020	Zeeland	Netherlands	2020-12-07 05:26:14	6710.0	104.0	0.0
172477	172478	12/06/2020	Zhejiang	Mainland China	2020-12-07 05:26:14	1295.0	1.0	1288.0
172478	172479	12/06/2020	Zhytomyr Oblast	Ukraine	2020-12-07 05:26:14	31967.0	531.0	22263.0
172479	172480	12/06/2020	Zuid-Holland	Netherlands	2020-12-07 05:26:14	154813.0	2414.0	0.0

172480 rows × 8 columns

confirmed is sequential data showing daily cases by country. Country/Region and Province/State are location data. Long and Lat are longitude and latitude, respectively. MM/DD/YYYY column records daily confirmed cases. The dataframe looks like below.

confirmed

	Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	1/30/20	1/31/20	2/1/20	2/2/20	2/3/20	2/4/20	2/5/20	2/6/20	2/7/20	2/8/20	2/9/20	2/10/20	2/11/20	2/12/20	2/13/20	2/14/20	2/15/20	2/16/20	2/17/20	2/18/20	2/19/20	2/20/20	2/21/20	2/22/20	2/23/20	2/24/20	2/25/20	2/26/20	...	11/9/20	11/10/20	11/11/20	11/12/20	11/13/20	11/14/20	11/15/20	11/16/20	11/17/20	11/18/20	11/19/20	11/20/20	11/21/20	11/22/20	11/23/20	11/24/20	11/25/20	11/26/20	11/27/20	11/28/20	11/29/20	11/30/20	12/1/20	12/2/20	12/3/20	12/4/20	12/5/20	12/6/20	12/7/20	12/8/20	12/9/20	12/10/20	12/11/20	12/12/20	12/13/20	12/14/20	12/15/20	12/16/20	12/17/20	12/18/20
0	NaN	Afghanistan	33.939110	67.709953	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	...	42297	42463	42609	42795	42969	43035	43240	43468	43681	43924	44177	44363	44503	44706	44988	45174	45384	45600	45723	45844	46116	46274	46516	46718	46837	46837	47072	47306	47516	47716	47851	48053	48116	48229	48527	48718	48952	49161	49378	49621
1	NaN	Albania	41.153300	20.168300	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	24731	25294	25801	26211	26701	27233	27830	28432	29126	29837	30623	31459	32196	32761	33556	34300	34944	35600	36245	36790	37625	38182	39014	39719	40501	41302	42148	42988	43683	44436	45188	46061	46863	47742	48530	49191	50000	50637	51424	52004
2	NaN	Algeria	28.033900	1.659600	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	...	62693	63446	64257	65108	65975	66819	67679	68589	69591	70629	71652	72755	73774	74862	75867	77000	78025	79110	80168	81212	82221	83199	84152	85084	85927	86730	87502	88252	88825	89416	90014	90579	91121	91638	92102	92597	93065	93507	93933	94371
3	NaN	Andorra	42.506300	1.521800	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	5437	5477	5567	5616	5725	5725	5872	5914	5951	6018	6066	6142	6207	6256	6304	6351	6428	6534	6610	6610	6712	6745	6790	6842	6904	6955	7005	7050	7084	7127	7162	7190	7236	7288	7338	7382	7382	7446	7466	7519
4	NaN	Angola	-11.202700	17.873900	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	12680	12816	12953	13053	13228	13374	13451	13615	13818	13922	14134	14267	14413	14493	14634	14742	14821	14920	15008	15087	15103	15139	15251	15319	15361	15493	15536	15591	15648	15729	15804	15925	16061	16161	16188	16277	16362	16407	16484	16562
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
266	NaN	Vietnam	14.058324	108.277199	0	2	2	2	2	2	2	2	2	2	6	6	8	8	8	10	10	13	13	14	15	15	16	16	16	16	16	16	16	16	16	16	16	16	16	16	...	1215	1226	1252	1253	1256	1265	1281	1283	1288	1300	1304	1305	1306	1307	1312	1316	1321	1331	1339	1341	1343	1347	1351	1358	1361	1361	1365	1366	1367	1377	1381	1385	1391	1395	1397	1402	1405	1405	1407	1410
267	NaN	West Bank and Gaza	31.952200	35.233200	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	58838	59422	60065	60784	61514	62167	63031	63867	64935	66186	67296	68768	70254	71644	73196	75007	76727	78493	80429	81890	83585	85647	88004	90192	92708	94676	96098	98038	99758	101109	102992	104879	106622	108099	109738	111102	113409	115606	117755	119612
268	NaN	Yemen	15.552727	48.516388	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	2071	2071	2071	2071	2072	2072	2072	2078	2081	2083	2086	2090	2093	2099	2107	2114	2124	2137	2148	2160	2177	2191	2197	2217	2239	2267	2304	2337	2383	2078	2079	2081	2082	2083	2083	2084	2085	2085	2087	2087
269	NaN	Zambia	-13.133897	27.849332	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	16971	16997	17036	17056	17093	17097	17123	17187	17243	17280	17350	17373	17394	17424	17454	17466	17535	17553	17569	17589	17608	17647	17665	17700	17730	17857	17898	17916	17931	17963	18062	18091	18161	18217	18274	18322	18428	18456	18504	18575
270	NaN	Zimbabwe	-19.015438	29.154857	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	8561	8610	8667	8696	8765	8786	8829	8897	8945	8981	9046	9120	9172	9220	9308	9398	9508	9623	9714	9822	9822	9950	10129	10129	10424	10547	10617	10718	10839	10912	11007	11081	11162	11219	11246	11358	11522	11749	11866	12047

271 rows × 336 columns

2.2 EDA of World Data¶

Let’s visualize the global COVID-19 daily cases using all. As mentioned above, there are two location data: Province/State and Country/Region. We will add the daily cases of every country based on Country/Region.

group = all.groupby(['ObservationDate', 'Country/Region'])['Confirmed'].sum()
group = group.reset_index()
group.head()

	ObservationDate	Country/Region	Confirmed
0	01/22/2020	Hong Kong	0.0
1	01/22/2020	Japan	2.0
2	01/22/2020	Macau	1.0
3	01/22/2020	Mainland China	547.0
4	01/22/2020	South Korea	1.0

You can now see daily confirmed cases by country. Next, we will visualize world daily cases on the world map with animation effects. We can create the visualization with the plotly.express package.

import plotly as py
import plotly.express as px

With px.choropleth, we create the map layer, and then update dates through .update_layout.

Each parameter is as follows:

location : name of the column which indicates location information in the dataframe
locationmode : the range of countries (‘ISO-3’, ‘USA-states’, and ‘country names’)
color : name of the column which has data to be expressed on the diagram
animation_frame : name of the column for animation effects

Press the play button to see the COVID-19 daily case graphs.

choro_map=px.choropleth(group, 
                    locations="Country/Region", 
                    locationmode = "country names",
                    color="Confirmed", 
                    animation_frame="ObservationDate"
                   )

choro_map.update_layout(
    title_text = 'Global Spread of Coronavirus',
    title_x = 0.5,
    geo=dict(
        showframe = False,
        showcoastlines = False,
    ))
    
choro_map.show()

      
    

You can see that COVID-19 begins in China and then gradually spreads throughout the world. According to the visualization, recently there are a lot of cumulative cases in North America, South America, and India, compared to other countries.

2.3 EDA of South Korean Dataset¶

This time, we will extract the South Korean data from confirmed. The data is cumulative, which means it shows the sum of all cases on a given date.

confirmed[confirmed['Country/Region']=='Korea, South']

	Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	1/30/20	1/31/20	2/1/20	2/2/20	2/3/20	2/4/20	2/5/20	2/6/20	2/7/20	2/8/20	2/9/20	2/10/20	2/11/20	2/12/20	2/13/20	2/14/20	2/15/20	2/16/20	2/17/20	2/18/20	2/19/20	2/20/20	2/21/20	2/22/20	2/23/20	2/24/20	2/25/20	2/26/20	...	11/9/20	11/10/20	11/11/20	11/12/20	11/13/20	11/14/20	11/15/20	11/16/20	11/17/20	11/18/20	11/19/20	11/20/20	11/21/20	11/22/20	11/23/20	11/24/20	11/25/20	11/26/20	11/27/20	11/28/20	11/29/20	11/30/20	12/1/20	12/2/20	12/3/20	12/4/20	12/5/20	12/6/20	12/7/20	12/8/20	12/9/20	12/10/20	12/11/20	12/12/20	12/13/20	12/14/20	12/15/20	12/16/20	12/17/20	12/18/20
157	NaN	Korea, South	35.907757	127.766922	1	1	2	2	3	4	4	4	4	11	12	15	15	16	19	23	24	24	25	27	28	28	28	28	28	29	30	31	31	104	204	433	602	833	977	1261	...	27653	27799	27942	28133	28338	28546	28769	28998	29311	29654	30017	30403	30733	31004	31353	31735	32318	32887	33375	33824	34201	34652	35163	35703	36332	36915	37546	38161	38755	39432	40098	40786	41736	42766	43484	44364	45442	46453	47515	48570

1 rows × 336 columns

Excluding the Lat and Long columns, we only have the date and cases columns. For coding convenience, let’s flip the rows and columns and change the date format from str to datetime (to_datetime).

korea = confirmed[confirmed['Country/Region']=='Korea, South'].iloc[:,4:].T
korea.index = pd.to_datetime(korea.index)
korea

	157
2020-01-22	1
2020-01-23	1
2020-01-24	2
2020-01-25	2
2020-01-26	3
...	...
2020-12-14	44364
2020-12-15	45442
2020-12-16	46453
2020-12-17	47515
2020-12-18	48570

332 rows × 1 columns

Let’s visualize the data with matplotlib.pyplot and seaborn, which are representative visualization packages. With %matplotlib inline, the matplotlib graphs will be included in the notebook, along with the code. You can change the figure size through rcParams['figure.figsize'] from pylab and the style and font size through sns.set.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.plot(korea)
plt.show()

      
    

Now, let’s look at daily cases instead of cumulative cases. Using the diff function, it is possible to tell the difference between each row, which makes it easier to transform cumulative data into daily data. However, there is a missing value in the first row of the cumulative data, which we need to fill. Also, we will change the data format to an integer (int).

daily_cases = korea.diff().fillna(korea.iloc[0]).astype('int')
daily_cases

	157
2020-01-22	1
2020-01-23	0
2020-01-24	1
2020-01-25	0
2020-01-26	1
...	...
2020-12-14	880
2020-12-15	1078
2020-12-16	1011
2020-12-17	1062
2020-12-18	1055

332 rows × 1 columns

The following visualization is the output for daily cases. Since we already created a visualization, the visualization parameters will be based on the previous data unless you reset the parameters.

plt.plot(daily_cases)
plt.show()

There are spikes in the beginning of March and late August. At the end of the year, new daily cases skyrocket.

In this chapter, we explored COVID-19 confirmed case datasets in South Korea and across the world. In the next chapter, we will learn about data pre-processing of COVID-19 daily cases, using South Korea as our example, and use a deep learning model for time series forecasting.

PseudoLab Tutorial Book

2. EDA¶

2.1 Download Datasets¶

2.2 EDA of World Data¶

2.3 EDA of South Korean Dataset¶