{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pMEyRqRZdbmr"
   },
   "source": [
    "# 3. Data Pre-Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_XmPU1nWHO5w"
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pseudo-Lab/Tutorial-Book-en/blob/master/book/chapters/en/time-series/Ch3-preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eWbSYUc4dg21"
   },
   "source": [
    "In the previous chapter, we explored dataset features through an EDA. In this chapter, we will learn how to pre-process data for a time series task.\n",
    "\n",
    "Pre-processing data into pairs of features and target variable is required in order to use a sequential dataset for supervised learning. In addition, it is necessary to unify data scales to stably train deep learning models. In chapter 3.1, we will transform the raw data of COVID-19 confirmed cases into data for supervised learning, and in chapter 3.2, we will examine how to perform data scaling. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LATMT-vOiBai"
   },
   "source": [
    "## 3.1 Preparing Data for Supervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "emYeU_6yvDfc"
   },
   "source": [
    "We will load a dataset for data pre-processing, using code introduced in chapter 2.1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 5589,
     "status": "ok",
     "timestamp": 1608432119294,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "jmgiP7hDihW_",
    "outputId": "b5c826cf-9fc9-486f-8f7b-944bfd03830d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'Tutorial-Book-Utils'...\n",
      "remote: Enumerating objects: 24, done.\u001b[K\n",
      "remote: Counting objects: 100% (24/24), done.\u001b[K\n",
      "remote: Compressing objects: 100% (20/20), done.\u001b[K\n",
      "remote: Total 24 (delta 6), reused 14 (delta 3), pack-reused 0\u001b[K\n",
      "Unpacking objects: 100% (24/24), done.\n",
      "COVIDTimeSeries.zip is done!\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils\n",
    "!python Tutorial-Book-Utils/PL_data_loader.py --data COVIDTimeSeries\n",
    "!unzip -q COVIDTimeSeries.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ypeC4OifvaK3"
   },
   "source": [
    "Let's extract `daily_cases`, which shows the daily confirmed COVID-19 cases for South Korea, using code introduced in chapter 2.3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 407
    },
    "executionInfo": {
     "elapsed": 1478,
     "status": "ok",
     "timestamp": 1608432122003,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "LDrsndsCihdp",
    "outputId": "7329ccdc-bbc0-4c47-ef73-661889b5a02c"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>157</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-01-22</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-01-23</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-01-24</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-01-25</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-01-26</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-12-14</th>\n",
       "      <td>880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-12-15</th>\n",
       "      <td>1078</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-12-16</th>\n",
       "      <td>1011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-12-17</th>\n",
       "      <td>1062</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-12-18</th>\n",
       "      <td>1055</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>332 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             157\n",
       "2020-01-22     1\n",
       "2020-01-23     0\n",
       "2020-01-24     1\n",
       "2020-01-25     0\n",
       "2020-01-26     1\n",
       "...          ...\n",
       "2020-12-14   880\n",
       "2020-12-15  1078\n",
       "2020-12-16  1011\n",
       "2020-12-17  1062\n",
       "2020-12-18  1055\n",
       "\n",
       "[332 rows x 1 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')\n",
    "confirmed[confirmed['Country/Region']=='Korea, South']\n",
    "korea = confirmed[confirmed['Country/Region']=='Korea, South'].iloc[:,4:].T\n",
    "korea.index = pd.to_datetime(korea.index)\n",
    "daily_cases = korea.diff().fillna(korea.iloc[0]).astype('int')\n",
    "daily_cases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JRzJXo3Cx4qQ"
   },
   "source": [
    "We need to convert the time series data shown above into pairs of input and output variables to use them for supervised learning. In a time series task, we call this kind of data as sequential data. Firstly, we need to define the sequence length in order to transform the data into sequential data. The sequence length is decided by how many days from the data we wish to use to predict future cases. For example, for a sequence length of 5, data in $t-1$, $t-2$, $t-3$, $t-4$, and $t-5$ are used to predict data in time $t$. Likewise, a task where we predict variable at time $t$ using data from $t-k$ to $t-1$ is called an one-step prediction task.\n",
    "\n",
    "The `create_sequences` function defined below transforms time series data with size `N` into data with a <code>N - seq-length</code> size for supervised learning (See Figure 3-1).\n",
    "\n",
    "![](https://github.com/Pseudo-Lab/Tutorial-Book/blob/master/book/pics/TS-ch3img01.png?raw=true)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bY9hvmnaBbay"
   },
   "source": [
    "- Figure 3-1 Transforming Process of Times Series Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "04UXWETidbOD"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def create_sequences(data, seq_length):\n",
    "    xs = []\n",
    "    ys = []\n",
    "    for i in range(len(data)-seq_length):\n",
    "        x = data.iloc[i:(i+seq_length)]\n",
    "        y = data.iloc[i+seq_length]\n",
    "        xs.append(x)\n",
    "        ys.append(y)\n",
    "    return np.array(xs), np.array(ys)\n",
    "\n",
    "seq_length = 5\n",
    "X, y = create_sequences(daily_cases, seq_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p0-zPOcd8p6p"
   },
   "source": [
    "With `seq_length` = 5, applying the `create_sequences` function to `daily_cases`, we got 327 samples in total for supervised learning. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1784,
     "status": "ok",
     "timestamp": 1608432122365,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "4vQ_PEk0OVTN",
    "outputId": "da775f36-0206-4f07-b777-9d2a195fea24"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((327, 5, 1), (327, 1))"
      ]
     },
     "execution_count": 4,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape, y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4erMbiGaBBep"
   },
   "source": [
    "We will divide the transformed dataset into training, validation, and test datasets with an 8:1:1 ratio. The total number of data is 327, so the division of each dataset results in the following: 261 data for training, 33 data for validation, and 33 data for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1766,
     "status": "ok",
     "timestamp": 1608432122366,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "JqJVumBC8409",
    "outputId": "1c952458-d720-4253-a5eb-9bd3c8da7925"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "261\n"
     ]
    }
   ],
   "source": [
    "train_size = int(327 * 0.8)\n",
    "print(train_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_9HpHIR-Oh2T"
   },
   "outputs": [],
   "source": [
    "X_train, y_train = X[:train_size], y[:train_size]\n",
    "X_val, y_val = X[train_size:train_size+33], y[train_size:train_size+33]\n",
    "X_test, y_test = X[train_size+33:], y[train_size+33:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1745,
     "status": "ok",
     "timestamp": 1608432122368,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "8EeX5aOoOa0l",
    "outputId": "904a97c3-5256-4468-ad28-5a67ce708294"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(261, 5, 1) (33, 5, 1) (33, 5, 1)\n",
      "(261, 1) (33, 1) (33, 1)\n"
     ]
    }
   ],
   "source": [
    "print(X_train.shape, X_val.shape, X_test.shape)\n",
    "print(y_train.shape, y_val.shape, y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MQ05Fk12jke3"
   },
   "source": [
    "## 3.2 Data Scaling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tdLKCYi3CI1i"
   },
   "source": [
    "In chapter 3.2, we will perform data scaling. More specifically, we will perform MinMax scaling, which transforms the data range to between 0 and 1. Apply the following mathematical notation for MinMax scaling after calculating the minimum and maximum values of the data group.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dYhCqFRbCuUd"
   },
   "source": [
    "> $x_{scaled} = \\displaystyle\\frac{x - x_{min}}{x_{max} - x_{min}}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RGA3sPDvCv1A"
   },
   "source": [
    "The data scaling for the training, validation, and test datasets must be processed based on the statistics of the training data. Input variables from the testing dataset should not be used, so we must perform training dataset scaling using the statitistics of the training data.<br>Since the model was trained with the statistics of the training data, the test data must also be scaled based on the same values in order to evaluate model performance later. Similarly, the validation data require data scaling based on the statistics of the training data, since validation data need to go through the same process of pre-processing as the test data.\n",
    "\n",
    "We will get the minimum and maximum values from the `X_train` data in order to apply MinMax scaling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1729,
     "status": "ok",
     "timestamp": 1608432122369,
     "user": {
      "displayName": "안성진",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiCjgkN_MvtrSUHRuFvstrWm6fhi5cf7CKd2UHYAw=s64",
      "userId": "00266029492778998652"
     },
     "user_tz": -540
    },
    "id": "Qmh0K8N4Glvb",
    "outputId": "421775e8-0a4d-4ff3-8357-527fe5c0c080"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 851\n"
     ]
    }
   ],
   "source": [
    "MIN = X_train.min()\n",
    "MAX = X_train.max()\n",
    "print(MIN, MAX)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qrD05cTgHgIS"
   },
   "source": [
    "The minimum and maximum values are 0 and 851, respectively. Next, we will define the MinMax scaling function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oGdcy31yHrla"
   },
   "outputs": [],
   "source": [
    "def MinMaxScale(array, min, max):\n",
    "\n",
    "    return (array - min) / (max - min)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OzYEOKK7Hw_R"
   },
   "source": [
    "Let's perform scaling using the `MinMaxScale` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FhJo_3FQH2HE"
   },
   "outputs": [],
   "source": [
    "X_train = MinMaxScale(X_train, MIN, MAX)\n",
    "y_train = MinMaxScale(y_train, MIN, MAX)\n",
    "X_val = MinMaxScale(X_val, MIN, MAX)\n",
    "y_val = MinMaxScale(y_val, MIN, MAX)\n",
    "X_test = MinMaxScale(X_test, MIN, MAX)\n",
    "y_test = MinMaxScale(y_test, MIN, MAX)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QhDGDK-cIJOd"
   },
   "source": [
    "Next, we will transform the data type from `np.array` into `torch.Tensor` in order for the data to be input in a PyTorch model. First, we will define the function for transforming the data type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "G1bRuXBDIIdZ"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "def make_Tensor(array):\n",
    "    return torch.from_numpy(array).float()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zMRYCpW6IeM_"
   },
   "source": [
    "We will perform the transformation through the `make_Tensor` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rxXjlAetInNg"
   },
   "outputs": [],
   "source": [
    "X_train = make_Tensor(X_train)\n",
    "y_train = make_Tensor(y_train)\n",
    "X_val = make_Tensor(X_val)\n",
    "y_val = make_Tensor(y_val)\n",
    "X_test = make_Tensor(X_test)\n",
    "y_test = make_Tensor(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "07vuEYiQJ08-"
   },
   "source": [
    "So far, we have practiced transforming data into the correct format for the supervised learning of time series and data scaling. In the next chapter, we will build a prediction model for COVID-19 cases with the data we curated. "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Ch3. 데이터 전처리.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}