{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "pMEyRqRZdbmr" }, "source": [ "# 3. Data Pre-Processing" ] }, { "cell_type": "markdown", "metadata": { "id": "_XmPU1nWHO5w" }, "source": [ "[](https://colab.research.google.com/github/Pseudo-Lab/Tutorial-Book-en/blob/master/book/chapters/en/time-series/Ch3-preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "eWbSYUc4dg21" }, "source": [ "In the previous chapter, we explored dataset features through an EDA. In this chapter, we will learn how to pre-process data for a time series task.\n", "\n", "Pre-processing data into pairs of features and target variable is required in order to use a sequential dataset for supervised learning. In addition, it is necessary to unify data scales to stably train deep learning models. In chapter 3.1, we will transform the raw data of COVID-19 confirmed cases into data for supervised learning, and in chapter 3.2, we will examine how to perform data scaling. " ] }, { "cell_type": "markdown", "metadata": { "id": "LATMT-vOiBai" }, "source": [ "## 3.1 Preparing Data for Supervised Learning" ] }, { "cell_type": "markdown", "metadata": { "id": "emYeU_6yvDfc" }, "source": [ "We will load a dataset for data pre-processing, using code introduced in chapter 2.1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5589, "status": "ok", "timestamp": 1608432119294, "user": { "displayName": "안성진", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64", "userId": "00266029492778998652" }, "user_tz": -540 }, "id": "jmgiP7hDihW_", "outputId": "b5c826cf-9fc9-486f-8f7b-944bfd03830d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into 'Tutorial-Book-Utils'...\n", "remote: Enumerating objects: 24, done.\u001b[K\n", "remote: Counting objects: 100% (24/24), done.\u001b[K\n", "remote: Compressing objects: 100% (20/20), done.\u001b[K\n", "remote: Total 24 (delta 6), reused 14 (delta 3), pack-reused 0\u001b[K\n", "Unpacking objects: 100% (24/24), done.\n", "COVIDTimeSeries.zip is done!\n" ] } ], "source": [ "!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils\n", "!python Tutorial-Book-Utils/PL_data_loader.py --data COVIDTimeSeries\n", "!unzip -q COVIDTimeSeries.zip" ] }, { "cell_type": "markdown", "metadata": { "id": "ypeC4OifvaK3" }, "source": [ "Let's extract `daily_cases`, which shows the daily confirmed COVID-19 cases for South Korea, using code introduced in chapter 2.3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 407 }, "executionInfo": { "elapsed": 1478, "status": "ok", "timestamp": 1608432122003, "user": { "displayName": "안성진", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64", "userId": "00266029492778998652" }, "user_tz": -540 }, "id": "LDrsndsCihdp", "outputId": "7329ccdc-bbc0-4c47-ef73-661889b5a02c" }, "outputs": [ { "data": { "text/html": [ "
\n", " | 157 | \n", "
---|---|
2020-01-22 | \n", "1 | \n", "
2020-01-23 | \n", "0 | \n", "
2020-01-24 | \n", "1 | \n", "
2020-01-25 | \n", "0 | \n", "
2020-01-26 | \n", "1 | \n", "
... | \n", "... | \n", "
2020-12-14 | \n", "880 | \n", "
2020-12-15 | \n", "1078 | \n", "
2020-12-16 | \n", "1011 | \n", "
2020-12-17 | \n", "1062 | \n", "
2020-12-18 | \n", "1055 | \n", "
332 rows × 1 columns
\n", "N - seq-length
size for supervised learning (See Figure 3-1).\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bY9hvmnaBbay"
},
"source": [
"- Figure 3-1 Transforming Process of Times Series Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "04UXWETidbOD"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def create_sequences(data, seq_length):\n",
" xs = []\n",
" ys = []\n",
" for i in range(len(data)-seq_length):\n",
" x = data.iloc[i:(i+seq_length)]\n",
" y = data.iloc[i+seq_length]\n",
" xs.append(x)\n",
" ys.append(y)\n",
" return np.array(xs), np.array(ys)\n",
"\n",
"seq_length = 5\n",
"X, y = create_sequences(daily_cases, seq_length)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p0-zPOcd8p6p"
},
"source": [
"With `seq_length` = 5, applying the `create_sequences` function to `daily_cases`, we got 327 samples in total for supervised learning. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1784,
"status": "ok",
"timestamp": 1608432122365,
"user": {
"displayName": "안성진",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
"userId": "00266029492778998652"
},
"user_tz": -540
},
"id": "4vQ_PEk0OVTN",
"outputId": "da775f36-0206-4f07-b777-9d2a195fea24"
},
"outputs": [
{
"data": {
"text/plain": [
"((327, 5, 1), (327, 1))"
]
},
"execution_count": 4,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"X.shape, y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4erMbiGaBBep"
},
"source": [
"We will divide the transformed dataset into training, validation, and test datasets with an 8:1:1 ratio. The total number of data is 327, so the division of each dataset results in the following: 261 data for training, 33 data for validation, and 33 data for testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1766,
"status": "ok",
"timestamp": 1608432122366,
"user": {
"displayName": "안성진",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
"userId": "00266029492778998652"
},
"user_tz": -540
},
"id": "JqJVumBC8409",
"outputId": "1c952458-d720-4253-a5eb-9bd3c8da7925"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"261\n"
]
}
],
"source": [
"train_size = int(327 * 0.8)\n",
"print(train_size)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_9HpHIR-Oh2T"
},
"outputs": [],
"source": [
"X_train, y_train = X[:train_size], y[:train_size]\n",
"X_val, y_val = X[train_size:train_size+33], y[train_size:train_size+33]\n",
"X_test, y_test = X[train_size+33:], y[train_size+33:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1745,
"status": "ok",
"timestamp": 1608432122368,
"user": {
"displayName": "안성진",
"photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
"userId": "00266029492778998652"
},
"user_tz": -540
},
"id": "8EeX5aOoOa0l",
"outputId": "904a97c3-5256-4468-ad28-5a67ce708294"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(261, 5, 1) (33, 5, 1) (33, 5, 1)\n",
"(261, 1) (33, 1) (33, 1)\n"
]
}
],
"source": [
"print(X_train.shape, X_val.shape, X_test.shape)\n",
"print(y_train.shape, y_val.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MQ05Fk12jke3"
},
"source": [
"## 3.2 Data Scaling"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tdLKCYi3CI1i"
},
"source": [
"In chapter 3.2, we will perform data scaling. More specifically, we will perform MinMax scaling, which transforms the data range to between 0 and 1. Apply the following mathematical notation for MinMax scaling after calculating the minimum and maximum values of the data group.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dYhCqFRbCuUd"
},
"source": [
"> $x_{scaled} = \\displaystyle\\frac{x - x_{min}}{x_{max} - x_{min}}$"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RGA3sPDvCv1A"
},
"source": [
"The data scaling for the training, validation, and test datasets must be processed based on the statistics of the training data. Input variables from the testing dataset should not be used, so we must perform training dataset scaling using the statitistics of the training data.