2. EDA¶

Open In Colab

In the previous chapter, we examined the structure of deep learning models and evaluation metrics for time series analysis. In this chapter, we will explore and visualize the datasets we will be using to perform time series forecasting in the next chapter.

We will use COVID-19 daily case datasets for the time series forecasting. The datasets we are going to use are COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University and Kaggle: Novel Corona Virus 2019 Dataset.

COVID-19 daily cases by country are updated in the repository of Johns Hopkins University every day, along with various types of datasets, which are provided by Johns Hopkins University on its website and Kaggle.

In chapter 2.1, we will download the COVID-19 daily case datasets to see what features are included in them. In 2.2, we will perform EDA (Exploratory Data Analysis) on the datasets, focusing on world countries, and then in 2.3, we will dive into the COVID-19 daily cases in South Korea.

2.1 Download Datasets¶

Firstly, we will download the COVID-19 daily case datasets. Using the data-loader function coded by Pseudo-Lab, we can easily download the datasets.

Download the datasets from the repository of Tutorial-Book-Utils on GitHub by using git clone to save them in the Colab enviroment.

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils
Cloning into 'Tutorial-Book-Utils'...
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 24 (delta 6), reused 14 (delta 3), pack-reused 0
Unpacking objects: 100% (24/24), done.

dataset example

  • Figure 2-1. Downloaded Files in Colab

In Figure 2-1, you can see PL_data_loader.py in the directory, in addition to README.md and utils_ObjectDetection.py. PL_data_loader.py contains the functions to download Google Drive datasets. It is possible to download the COVID-19 daily case datasets with the input of COVIDTimeSeries if we use the parameter –data.

!python Tutorial-Book-Utils/PL_data_loader.py --data COVIDTimeSeries
COVIDTimeSeries.zip is done!

Next, let’s unarchive the downloaded file by using unzip, which is a Linux command. Unnecessary outputs are not printed if we use the command -q.

!unzip -q COVIDTimeSeries.zip

dataset example

  • Figure 2-2. Downloaded Datasets

Once the COVIDTimesSeries.zip is unarchived, you can see two downloaded files: covid_19_data.csv and time_series_covid19_confirmed_global.csv (Figure 2-2).

We will visualize the global COVID-19 daily cases with covid_19_data.csv based on time series, and we will visualize the daily cases of COVID-19 in South Korea with time_series_covid19_confirmed_global.csv. Let’s first view the saved values after loading each dataset.

import pandas as pd

all = pd.read_csv('covid_19_data.csv')
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')

all contains daily confirmed cases, daily deaths, and daily recovered data. The dataframe of the dataset consists of ObservationDate, Province/State, Country/Region, Confirmed, Deaths, and Recovered.

all
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered
0 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1.0 0.0 0.0
1 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14.0 0.0 0.0
2 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6.0 0.0 0.0
3 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1.0 0.0 0.0
4 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
172475 172476 12/06/2020 Zaporizhia Oblast Ukraine 2020-12-07 05:26:14 36539.0 337.0 6556.0
172476 172477 12/06/2020 Zeeland Netherlands 2020-12-07 05:26:14 6710.0 104.0 0.0
172477 172478 12/06/2020 Zhejiang Mainland China 2020-12-07 05:26:14 1295.0 1.0 1288.0
172478 172479 12/06/2020 Zhytomyr Oblast Ukraine 2020-12-07 05:26:14 31967.0 531.0 22263.0
172479 172480 12/06/2020 Zuid-Holland Netherlands 2020-12-07 05:26:14 154813.0 2414.0 0.0

172480 rows Ă— 8 columns

confirmed is sequential data showing daily cases by country. Country/Region and Province/State are location data. Long and Lat are longitude and latitude, respectively. MM/DD/YYYY column records daily confirmed cases. The dataframe looks like below.

confirmed
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 ... 11/9/20 11/10/20 11/11/20 11/12/20 11/13/20 11/14/20 11/15/20 11/16/20 11/17/20 11/18/20 11/19/20 11/20/20 11/21/20 11/22/20 11/23/20 11/24/20 11/25/20 11/26/20 11/27/20 11/28/20 11/29/20 11/30/20 12/1/20 12/2/20 12/3/20 12/4/20 12/5/20 12/6/20 12/7/20 12/8/20 12/9/20 12/10/20 12/11/20 12/12/20 12/13/20 12/14/20 12/15/20 12/16/20 12/17/20 12/18/20
0 NaN Afghanistan 33.939110 67.709953 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 ... 42297 42463 42609 42795 42969 43035 43240 43468 43681 43924 44177 44363 44503 44706 44988 45174 45384 45600 45723 45844 46116 46274 46516 46718 46837 46837 47072 47306 47516 47716 47851 48053 48116 48229 48527 48718 48952 49161 49378 49621
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 24731 25294 25801 26211 26701 27233 27830 28432 29126 29837 30623 31459 32196 32761 33556 34300 34944 35600 36245 36790 37625 38182 39014 39719 40501 41302 42148 42988 43683 44436 45188 46061 46863 47742 48530 49191 50000 50637 51424 52004
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 62693 63446 64257 65108 65975 66819 67679 68589 69591 70629 71652 72755 73774 74862 75867 77000 78025 79110 80168 81212 82221 83199 84152 85084 85927 86730 87502 88252 88825 89416 90014 90579 91121 91638 92102 92597 93065 93507 93933 94371
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 5437 5477 5567 5616 5725 5725 5872 5914 5951 6018 6066 6142 6207 6256 6304 6351 6428 6534 6610 6610 6712 6745 6790 6842 6904 6955 7005 7050 7084 7127 7162 7190 7236 7288 7338 7382 7382 7446 7466 7519
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 12680 12816 12953 13053 13228 13374 13451 13615 13818 13922 14134 14267 14413 14493 14634 14742 14821 14920 15008 15087 15103 15139 15251 15319 15361 15493 15536 15591 15648 15729 15804 15925 16061 16161 16188 16277 16362 16407 16484 16562
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
266 NaN Vietnam 14.058324 108.277199 0 2 2 2 2 2 2 2 2 2 6 6 8 8 8 10 10 13 13 14 15 15 16 16 16 16 16 16 16 16 16 16 16 16 16 16 ... 1215 1226 1252 1253 1256 1265 1281 1283 1288 1300 1304 1305 1306 1307 1312 1316 1321 1331 1339 1341 1343 1347 1351 1358 1361 1361 1365 1366 1367 1377 1381 1385 1391 1395 1397 1402 1405 1405 1407 1410
267 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 58838 59422 60065 60784 61514 62167 63031 63867 64935 66186 67296 68768 70254 71644 73196 75007 76727 78493 80429 81890 83585 85647 88004 90192 92708 94676 96098 98038 99758 101109 102992 104879 106622 108099 109738 111102 113409 115606 117755 119612
268 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 2071 2071 2071 2071 2072 2072 2072 2078 2081 2083 2086 2090 2093 2099 2107 2114 2124 2137 2148 2160 2177 2191 2197 2217 2239 2267 2304 2337 2383 2078 2079 2081 2082 2083 2083 2084 2085 2085 2087 2087
269 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 16971 16997 17036 17056 17093 17097 17123 17187 17243 17280 17350 17373 17394 17424 17454 17466 17535 17553 17569 17589 17608 17647 17665 17700 17730 17857 17898 17916 17931 17963 18062 18091 18161 18217 18274 18322 18428 18456 18504 18575
270 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 8561 8610 8667 8696 8765 8786 8829 8897 8945 8981 9046 9120 9172 9220 9308 9398 9508 9623 9714 9822 9822 9950 10129 10129 10424 10547 10617 10718 10839 10912 11007 11081 11162 11219 11246 11358 11522 11749 11866 12047

271 rows Ă— 336 columns

2.2 EDA of World Data¶

Let’s visualize the global COVID-19 daily cases using all. As mentioned above, there are two location data: Province/State and Country/Region. We will add the daily cases of every country based on Country/Region.

group = all.groupby(['ObservationDate', 'Country/Region'])['Confirmed'].sum()
group = group.reset_index()
group.head()
ObservationDate Country/Region Confirmed
0 01/22/2020 Hong Kong 0.0
1 01/22/2020 Japan 2.0
2 01/22/2020 Macau 1.0
3 01/22/2020 Mainland China 547.0
4 01/22/2020 South Korea 1.0

You can now see daily confirmed cases by country. Next, we will visualize world daily cases on the world map with animation effects. We can create the visualization with the plotly.express package.

import plotly as py
import plotly.express as px

With px.choropleth, we create the map layer, and then update dates through .update_layout.

Each parameter is as follows:

  • location : name of the column which indicates location information in the dataframe

  • locationmode : the range of countries (‘ISO-3’, ‘USA-states’, and ‘country names’)

  • color : name of the column which has data to be expressed on the diagram

  • animation_frame : name of the column for animation effects

Press the play button to see the COVID-19 daily case graphs.

choro_map=px.choropleth(group, 
                    locations="Country/Region", 
                    locationmode = "country names",
                    color="Confirmed", 
                    animation_frame="ObservationDate"
                   )

choro_map.update_layout(
    title_text = 'Global Spread of Coronavirus',
    title_x = 0.5,
    geo=dict(
        showframe = False,
        showcoastlines = False,
    ))
    
choro_map.show()

You can see that COVID-19 begins in China and then gradually spreads throughout the world. According to the visualization, recently there are a lot of cumulative cases in North America, South America, and India, compared to other countries.

2.3 EDA of South Korean Dataset¶

This time, we will extract the South Korean data from confirmed. The data is cumulative, which means it shows the sum of all cases on a given date.

confirmed[confirmed['Country/Region']=='Korea, South']
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 ... 11/9/20 11/10/20 11/11/20 11/12/20 11/13/20 11/14/20 11/15/20 11/16/20 11/17/20 11/18/20 11/19/20 11/20/20 11/21/20 11/22/20 11/23/20 11/24/20 11/25/20 11/26/20 11/27/20 11/28/20 11/29/20 11/30/20 12/1/20 12/2/20 12/3/20 12/4/20 12/5/20 12/6/20 12/7/20 12/8/20 12/9/20 12/10/20 12/11/20 12/12/20 12/13/20 12/14/20 12/15/20 12/16/20 12/17/20 12/18/20
157 NaN Korea, South 35.907757 127.766922 1 1 2 2 3 4 4 4 4 11 12 15 15 16 19 23 24 24 25 27 28 28 28 28 28 29 30 31 31 104 204 433 602 833 977 1261 ... 27653 27799 27942 28133 28338 28546 28769 28998 29311 29654 30017 30403 30733 31004 31353 31735 32318 32887 33375 33824 34201 34652 35163 35703 36332 36915 37546 38161 38755 39432 40098 40786 41736 42766 43484 44364 45442 46453 47515 48570

1 rows Ă— 336 columns

Excluding the Lat and Long columns, we only have the date and cases columns. For coding convenience, let’s flip the rows and columns and change the date format from str to datetime (to_datetime).

korea = confirmed[confirmed['Country/Region']=='Korea, South'].iloc[:,4:].T
korea.index = pd.to_datetime(korea.index)
korea
157
2020-01-22 1
2020-01-23 1
2020-01-24 2
2020-01-25 2
2020-01-26 3
... ...
2020-12-14 44364
2020-12-15 45442
2020-12-16 46453
2020-12-17 47515
2020-12-18 48570

332 rows Ă— 1 columns

Let’s visualize the data with matplotlib.pyplot and seaborn, which are representative visualization packages. With %matplotlib inline, the matplotlib graphs will be included in the notebook, along with the code. You can change the figure size through rcParams['figure.figsize'] from pylab and the style and font size through sns.set.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.plot(korea)
plt.show()
../../../_images/Ch2-EDA_34_0.png

Now, let’s look at daily cases instead of cumulative cases. Using the diff function, it is possible to tell the difference between each row, which makes it easier to transform cumulative data into daily data. However, there is a missing value in the first row of the cumulative data, which we need to fill. Also, we will change the data format to an integer (int).

daily_cases = korea.diff().fillna(korea.iloc[0]).astype('int')
daily_cases
157
2020-01-22 1
2020-01-23 0
2020-01-24 1
2020-01-25 0
2020-01-26 1
... ...
2020-12-14 880
2020-12-15 1078
2020-12-16 1011
2020-12-17 1062
2020-12-18 1055

332 rows Ă— 1 columns

The following visualization is the output for daily cases. Since we already created a visualization, the visualization parameters will be based on the previous data unless you reset the parameters.

plt.plot(daily_cases)
plt.show()
../../../_images/Ch2-EDA_38_0.png

There are spikes in the beginning of March and late August. At the end of the year, new daily cases skyrocket.

In this chapter, we explored COVID-19 confirmed case datasets in South Korea and across the world. In the next chapter, we will learn about data pre-processing of COVID-19 daily cases, using South Korea as our example, and use a deep learning model for time series forecasting.