Skip to content

Instantly share code, notes, and snippets.

@worthyag
Created February 15, 2024 07:31
Show Gist options
  • Save worthyag/215693a2bb403c0cf1a121296911302b to your computer and use it in GitHub Desktop.
Save worthyag/215693a2bb403c0cf1a121296911302b to your computer and use it in GitHub Desktop.
Can machines understand how you feel?
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "N8DVaLzZTKHk"
},
"source": [
"(CM3015) Machine Learning and Neural Networks - Following the universal workflow of DLWP 4.5 (1st Edition)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P6tbmyP8TyH5"
},
"source": [
"# Can machines understand how you feel? Using machine learning models to predict sentiments."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a6hutTq8VjWb"
},
"source": [
"*Aiming to predict the sentiments of unseen data.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8xhXc8-kH0Ti",
"tags": []
},
"source": [
"# 1 Defining the problem and assembling a dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GeQEdDNPWXCk",
"tags": []
},
"source": [
"## 1.1 Introduction and background"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sentiment analysis, defined as being \"*the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral*\"[1], has been a subject of research for many decades. However in recent years, there have been many breakthroughs and significant advancements. What triggered this? Why are people becoming increasingly interested in this particular domain?\n",
"\n",
"The reason would be the sheer use cases it provides. For instance, it allows for business insights, as it enables companies to gain insights into customer opinions, preferences, and the like [2]. This information can then be used to improve products and develop marketing strategies. It is also used in the political field, allowing analysts to gauge public opinion on political candidates and issues [3]. In addition, it is used in the medical field to to assess patient sentiment in medical records or social media posts [4]. Now I can't sit here and list out all its potential use cases, however its uses above are just a small selection of its capabilities. This highlights why this field is of interest to many.\n",
"\n",
"I will be conducting sentiment analysis utilising the Amazon Dataset. The Amazon reviews dataset \"*consists of reviews from amazon [, the] data span[s] a period of 18 years, including ~35 million reviews up to March 2013*\"[5]. My input feature will be the review content, and my output will be the review polarity 1 or 2 (where 1 is negative and 2 is positive)- though I will be normalising this. Overall, I am trying to predict the sentiment of the amazon reviews, whether they are positive or negative. The problem I am facing is a text classification problem falling under supervised learning. More specifically it is a binary classification problem. The general purpose of this report and/or machine learning task is to explore whether unknown data that hasn't been classified can lead to good predictions."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XrI9NcWcYyRq",
"tags": []
},
"source": [
"## 1.2 Aim and Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nmu4f3lxZ-Sp",
"tags": []
},
"source": [
"### 1.2.1 Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bliW__6Ucgpk"
},
"source": [
"- Conduct data processing.\n",
"- Write modular code to avoid repetition.\n",
"- Build a model that predicts the sentiments of unseen data.\n",
"- Evaluate the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "53Vfft47aG2Y",
"tags": []
},
"source": [
"### 1.2.2 Aims"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yFUb7PDKciEM"
},
"source": [
"<ol type=\"1\">\n",
" <li>Find the data needed to explore the objectives.</li>\n",
" <li>Define the problem.</li>\n",
" <li>Choose a measure of success.</li>\n",
" <li>Pick an evaluation protocol.</li>\n",
" <li>\n",
" Prepare the data.\n",
" <ul>\n",
" <li>Convert the textual data into numerical data.</li>\n",
" <li>Convert the numerical data into tensors.</li>\n",
" </ul>\n",
" </li>\n",
" <li>Develop a model that does better than the baseline.</li>\n",
" <li>Develop a model that overfits.</li>\n",
" <li>Regularize the model and tune the hyperparameters.</li>\n",
" <li>Evaluate the model.</li>\n",
"</ol>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lwg02d_IY_Dv",
"tags": []
},
"source": [
"## 1.3 Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PyO5pNXkykMR",
"tags": []
},
"source": [
"### 1.3.1 Limitations and dataset modifications"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y0J5CuPzjHwe"
},
"source": [
"I initially imported my data into a jupyter notebook that I was running on my own device. As previously mentioned, the *Amazon Reviews Polarity Dataset* [5] is very large, however I had no problem importing the data from a `csv` file and manipulating it. In the jupyter notebook, I created a class (more about it later in this section) that converted my `csv` files into dataframes that could be used to train and evaluate a model. With that said, it all went pear-shaped when I decided to instead use Google Colab to run the machine learning tasks more effectively.\n",
"\n",
"The first error I ran into was `ParserError: Error tokenizing data. C error: EOF inside string`. After a lot of documentation reading and StackOverflow reviewing, I thought that the problem was due to special characters in my dataset not being properly enclosed or handled. With this information in hand, I updated the parameters I passed to the Pandas `read_csv()` method in my `csv_to_df` method from\n",
"\n",
"\n",
"```\n",
"read_csv(file,\n",
" header=None,\n",
" sep=\",\",\n",
" encoding='utf-8)\n",
"```\n",
"to\n",
"\n",
"\n",
"```\n",
"read_csv(file,\n",
" header=None,\n",
" sep=\",\",\n",
" on_bad_lines=\"skip\",\n",
" engine=\"python\",\n",
" encoding='utf-8)\n",
"```\n",
"\n",
"\n",
"This seemingly solved my problem. I was able to load the data and work with it.\n",
"\n",
"Yet, upon reviewing the dataset, I realised I had a lot of missing data and my training dataset was shorter than my testing dataset. After many tedious hours of research, it dawned on me that, if the dataset loads into jupyter notebooks without any errors, then the problem most be with Google Colab. I noticed that although the training dataset was shorter than the testing dataset, the numbers were fairly similar (when I tried using different runtimes types the numbers were similar too). Therefore, I came to the conclusion that there is a limited size that can be read by Google Colab based on the runtime type when calling the `read_csv()` method.\n",
"\n",
"In order to mitigate this issue, I split the *Amazon Reviews Polarity Dataset* into smaller `csv` files (100000 rows each). I then converted them each into a dataframe, and combined the resulting dataframes- this solved the problem."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VnW9Rw4sKSre"
},
"source": [
"Since my dataset was too large to load into Excel or any other software on my device, I decided to write a zsh / bash script that would split my dataset (training and testing) into smaller files (I also concluded that the resulting files would be less error prone since I wasn't doing it manually). Below is the script that I wrote:\n",
"\n",
"\n",
"\n",
"```\n",
"# I wrote all this code.\n",
"\n",
"# Splitting the given file into smaller files. Each new file will have\n",
"# 100000 rows.\n",
"split -l 100000 \"$1\".csv \"$1\"_\n",
"\n",
"# Giving the created files the .csv extension.\n",
"for file in \"$1\"_*\n",
"do\n",
" mv \"$file\" \"$file.csv\"\n",
"done\n",
"```\n",
"\n",
"With this complete, I was then able to work with my data."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tiXW4rzDyqBI",
"tags": []
},
"source": [
"### 1.3.2 Assembling the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hqmCoZv-0pLr"
},
"source": [
"Since I am now working with many `csv` files. I decided to create a class to avoid repeating the same code when converting the `csv` files into dataframes, and combining the dataframes.\n",
"\n",
"I will begin by importing the pandas library to make use of its DataFrame class, and the os module to generate the file paths."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"executionInfo": {
"elapsed": 550,
"status": "ok",
"timestamp": 1693228087386,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "vyewCR_yy-SQ",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Importing pandas.\n",
"import pandas as pd\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yKic-gKDViaf",
"tags": []
},
"source": [
"#### 1.3.2.1 `FileToDataFrame` class"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jOuGFmmixk-r"
},
"source": [
"Next I will create a class with some of the following methods:\n",
"- `csv_to_df(file, cols)`\n",
" - This method will convert the `csv` files into dataframes, like the name suggests.\n",
"- `combine_df(file_paths, cols)`\n",
" - This method will combine the dataframes created by the `csv_to_df()` method.\n",
"- `get_file_paths(file_dir)`\n",
" - Considering that the training data was 3,600,000 lines before it was split into files 100,000 lines long resulting in 36 training data files- I need to create a method that will generate the file paths/names. This will prevent repetition and lessen the chance of errors like forgetting a file."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"executionInfo": {
"elapsed": 554,
"status": "ok",
"timestamp": 1693228093277,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "goQLTqy5vU8s",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"class FileToDataFrame:\n",
" \"\"\"\n",
" A class that generates file paths and creates dataframes from csv\n",
" files.\n",
"\n",
" Methods\n",
" -------\n",
" print_dir()\n",
" get_cols()\n",
" get_file_paths(file_dir)\n",
" csv_to_df(file, cols)\n",
" combine_df(file_paths, cols)\n",
" generate_df_from_dir()\n",
" \"\"\"\n",
"\n",
" def __init__(self, file_dir: str = None, cols: str = None):\n",
" \"\"\"\n",
" Initialises the FileToDataFrame class and sets the file directory\n",
" and column names.\n",
"\n",
" Parameters\n",
" ----------\n",
" file_dir : str, default=None\n",
" The path to the folder.\n",
" cols : list, default=None\n",
" The list of column names used to create the dataframe.\n",
" \"\"\"\n",
" self.file_dir = file_dir\n",
" self.cols = cols\n",
"\n",
" def print_dir(self):\n",
" \"\"\"\n",
" Prints the file path to the directory.\n",
"\n",
" Notes\n",
" -----\n",
" Useful for checking the directory that was initialised.\n",
" \"\"\"\n",
" print(self.file_dir)\n",
"\n",
" def get_cols(self):\n",
" \"\"\"\n",
" Returns the column names providing during initialisation.\n",
"\n",
" Returns\n",
" -------\n",
" cols : list\n",
" Returns the list of column names used to create the dataframe.\n",
" \"\"\"\n",
" return self.cols\n",
"\n",
" def get_file_paths(self, file_dir: str = None):\n",
" \"\"\"\n",
" Generates a list of file paths for a given directory.\n",
"\n",
" Parameters\n",
" ----------\n",
" file_dir : str, default=None\n",
" The path to the folder.\n",
"\n",
" Returns\n",
" -------\n",
" files : list\n",
" The list containing the filepaths.\n",
" \"\"\"\n",
" if file_dir == None:\n",
" file_dir = self.file_dir\n",
"\n",
" files = []\n",
"\n",
" # Adding the files.\n",
" for file_path in os.listdir(file_dir):\n",
" file_path_str = file_dir + \"/\" + file_path\n",
" files.append(file_path_str)\n",
"\n",
" return files\n",
"\n",
" def csv_to_df(self, file: str, cols: list):\n",
" \"\"\"\n",
" Creates a dataframe from a given file.\n",
"\n",
" Parameters\n",
" ----------\n",
" file : str\n",
" The path to the file.\n",
" cols : list\n",
" A list of the column names.\n",
"\n",
" Returns\n",
" -------\n",
" df : Pandas dataframe\n",
" The dataframe created from the file data.\n",
" \"\"\"\n",
" # Importing the data. (Setting the engine to python to run on google colabs).\n",
" csv_file = pd.read_csv(file, header=None, sep=\",\", on_bad_lines=\"skip\",\n",
" engine=\"python\", encoding='utf-8')\n",
"\n",
" # Creating an empty dictionary.\n",
" data = dict()\n",
"\n",
" # Creating a list from the data column values and appending it to the\n",
" # dictionary.\n",
" for i in range(len(cols)):\n",
" data[cols[i]] = [val for val in csv_file[csv_file.columns[i]]]\n",
"\n",
" # Creating a dataframe from the created dictionary.\n",
" df = pd.DataFrame(data)\n",
"\n",
" return df\n",
"\n",
" def combine_df(self, file_paths: list, cols: list):\n",
" \"\"\"\"\n",
" Creates and combines dataframes.\n",
"\n",
" Parameters\n",
" ----------\n",
" file_paths : list\n",
" The path to the files to convert to dataframes.\n",
" cols : list\n",
" The list of column names used to create the dataframe.\n",
"\n",
" Returns\n",
" -------\n",
" df : Pandas dataframe\n",
" The dataframe created from the combined dataframes.\n",
" \"\"\"\n",
" # Creating an empty dataframe.\n",
" df = pd.DataFrame(columns=cols)\n",
"\n",
" # Creating a new datframe for each file and appending it to df dataframe.\n",
" for file in file_paths:\n",
" df_new = self.csv_to_df(file, cols)\n",
" df = pd.concat([df, df_new], ignore_index=True)\n",
"\n",
" return df\n",
"\n",
" def generate_df_from_dir(self):\n",
" \"\"\"\"\n",
" Generates the dataframe from a given directory.\n",
"\n",
" Returns\n",
" -------\n",
" df : Pandas dataframe\n",
" The dataframe created from the files in the given directory.\n",
" \"\"\"\n",
" files = self.get_file_paths()\n",
" df = self.combine_df(files, self.cols)\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GtU0PQLrVue-",
"tags": []
},
"source": [
"#### 1.3.2.2 Creating the dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R86XJFcJdyVq"
},
"source": [
"With the `FileToDataFrame` class created, I can now create the dataframes.\n",
"\n",
"I will begin with the training dataframe."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 208,
"status": "ok",
"timestamp": 1693228127444,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "bsn5O_kjbUcU",
"outputId": "50c9ff58-d003-496b-f90d-ee416a899240",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 36 training data files: True\n",
"./data/training\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Training data directory.\n",
"training_dir = \"./data/training\"\n",
"\n",
"# Initialising the FileToDataFrame class\n",
"amazon_train = FileToDataFrame(training_dir,\n",
" [\"polarity\", \"review_title\", \"review_content\"])\n",
"\n",
"# Checking that there are 36 files.\n",
"print(f\"There are 36 training data files: {len(amazon_train.get_file_paths()) == 36}\")\n",
"\n",
"# Checking the directory is correct.\n",
"amazon_train.print_dir()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JYnISQ8u4Crm"
},
"source": [
"Now that I have verified that there are 36 training data files I can generate the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"executionInfo": {
"elapsed": 40340,
"status": "ok",
"timestamp": 1693228171215,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "OLaiG3Rv3Qb5",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the training dataframe.\n",
"train_data = amazon_train.generate_df_from_dir()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"executionInfo": {
"elapsed": 220,
"status": "ok",
"timestamp": 1693228175144,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "ZnwOPKz93tr-",
"outputId": "7aceaf3f-fea5-4527-e0d7-92c9b530a3a8",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>polarity</th>\n",
" <th>review_title</th>\n",
" <th>review_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>More posturing than substance</td>\n",
" <td>The first thing anyone who reads this book nee...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>The Courage of Captain Plum</td>\n",
" <td>James Oliver Curwood wrote many wonderful stor...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>the Courage of Captain Plum</td>\n",
" <td>Kind of hoped for a hardback, but guess that s...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>Great movie</td>\n",
" <td>Great Jimmy Stewart movie. A very conservative...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>The FBI Story</td>\n",
" <td>We both really enjoy catching up watching many...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3599995</th>\n",
" <td>2</td>\n",
" <td>Not as good as expected</td>\n",
" <td>First off, let me say this is a great lens. It...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3599996</th>\n",
" <td>2</td>\n",
" <td>Quality lens</td>\n",
" <td>Very good quality lens, feels solid and the pi...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3599997</th>\n",
" <td>2</td>\n",
" <td>Haven't used it enough.</td>\n",
" <td>Haven't used it enough to make an informed dec...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3599998</th>\n",
" <td>2</td>\n",
" <td>Canon L Lens at its Best</td>\n",
" <td>Got this lens with my Canon 5D Mark II and it ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3599999</th>\n",
" <td>2</td>\n",
" <td>Great lens</td>\n",
" <td>I've had this lens for 2 months now, and love ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3600000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" polarity review_title \\\n",
"0 1 More posturing than substance \n",
"1 1 The Courage of Captain Plum \n",
"2 2 the Courage of Captain Plum \n",
"3 2 Great movie \n",
"4 2 The FBI Story \n",
"... ... ... \n",
"3599995 2 Not as good as expected \n",
"3599996 2 Quality lens \n",
"3599997 2 Haven't used it enough. \n",
"3599998 2 Canon L Lens at its Best \n",
"3599999 2 Great lens \n",
"\n",
" review_content \n",
"0 The first thing anyone who reads this book nee... \n",
"1 James Oliver Curwood wrote many wonderful stor... \n",
"2 Kind of hoped for a hardback, but guess that s... \n",
"3 Great Jimmy Stewart movie. A very conservative... \n",
"4 We both really enjoy catching up watching many... \n",
"... ... \n",
"3599995 First off, let me say this is a great lens. It... \n",
"3599996 Very good quality lens, feels solid and the pi... \n",
"3599997 Haven't used it enough to make an informed dec... \n",
"3599998 Got this lens with my Canon 5D Mark II and it ... \n",
"3599999 I've had this lens for 2 months now, and love ... \n",
"\n",
"[3600000 rows x 3 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the created dataframe.\n",
"train_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Jv1-plDG5VnG"
},
"source": [
"Now I will create the testing dataframe."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 223,
"status": "ok",
"timestamp": 1693228180268,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "lu_CEVl6zhcR",
"outputId": "979727fa-b222-4886-c414-512e54b817c5",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 4 testing data files: True\n",
"./data/testing\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Testing data directory.\n",
"testing_dir = \"./data/testing\"\n",
"\n",
"# Initialising the FileToDataFrame class\n",
"amazon_test = FileToDataFrame(testing_dir,\n",
" [\"polarity\", \"review_title\", \"review_content\"])\n",
"\n",
"# Checking that there are 4 files.\n",
"print(f\"There are 4 testing data files: {len(amazon_test.get_file_paths()) == 4}\")\n",
"\n",
"# Checking the directory is correct.\n",
"amazon_test.print_dir()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "141f76QC4IVR"
},
"source": [
"Now that I have verified that there are 4 testing data files I can generate the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"executionInfo": {
"elapsed": 4035,
"status": "ok",
"timestamp": 1693228187327,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "zLdU8XAh3nyI",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the testing dataframe.\n",
"test_data = amazon_test.generate_df_from_dir()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"executionInfo": {
"elapsed": 221,
"status": "ok",
"timestamp": 1693228189787,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "EehmFo_m36wn",
"outputId": "ebfef6bc-f24b-44c8-a41a-7edc86be7c40",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>polarity</th>\n",
" <th>review_title</th>\n",
" <th>review_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>Useful for remodels</td>\n",
" <td>I recently remodeled my house and these came i...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>decent get what u pay for</td>\n",
" <td>I got this set for around the house and for th...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Rock solid</td>\n",
" <td>Perfect solution - stable and accessible. Perh...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>Fun, humorous, and touching!</td>\n",
" <td>I highly recommend this movie ... it's a beaut...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>Horror that isn't for the faint-hearted</td>\n",
" <td>Of the fourteen books of Shaun Hutson's that I...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>399995</th>\n",
" <td>1</td>\n",
" <td>Inferior product and not like the picture</td>\n",
" <td>I expected the 2-piece filter set shown in the...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>399996</th>\n",
" <td>1</td>\n",
" <td>Entire Set Was Not Recieved</td>\n",
" <td>I rec'd the filter, but not the plastic part t...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>399997</th>\n",
" <td>1</td>\n",
" <td>No Darn Good!</td>\n",
" <td>Now that we know this is an interview CD, I am...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>399998</th>\n",
" <td>2</td>\n",
" <td>Revenge</td>\n",
" <td>This is really just a typical revenge movie wi...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>399999</th>\n",
" <td>1</td>\n",
" <td>book was somewhat interesting but too negative</td>\n",
" <td>The author took a very negative approach to re...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>400000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" polarity review_title \\\n",
"0 2 Useful for remodels \n",
"1 2 decent get what u pay for \n",
"2 2 Rock solid \n",
"3 2 Fun, humorous, and touching! \n",
"4 2 Horror that isn't for the faint-hearted \n",
"... ... ... \n",
"399995 1 Inferior product and not like the picture \n",
"399996 1 Entire Set Was Not Recieved \n",
"399997 1 No Darn Good! \n",
"399998 2 Revenge \n",
"399999 1 book was somewhat interesting but too negative \n",
"\n",
" review_content \n",
"0 I recently remodeled my house and these came i... \n",
"1 I got this set for around the house and for th... \n",
"2 Perfect solution - stable and accessible. Perh... \n",
"3 I highly recommend this movie ... it's a beaut... \n",
"4 Of the fourteen books of Shaun Hutson's that I... \n",
"... ... \n",
"399995 I expected the 2-piece filter set shown in the... \n",
"399996 I rec'd the filter, but not the plastic part t... \n",
"399997 Now that we know this is an interview CD, I am... \n",
"399998 This is really just a typical revenge movie wi... \n",
"399999 The author took a very negative approach to re... \n",
"\n",
"[400000 rows x 3 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the created dataframe.\n",
"test_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yKaWSMYW5o6U"
},
"source": [
"Now that I have created the training and testing dataframes I will check the size and shape of them, to verify that no errors occured."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 207,
"status": "ok",
"timestamp": 1693228193063,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "y2WvsLASEprT",
"outputId": "0c17a2c5-e99d-4cb7-ed92-2d4ed1ff3a80",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DATA SIZE\n",
"---------\n",
"Training Data: 3600000 \t(3600000, 3)\n",
"Testing Data : 400000 \t(400000, 3)\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Printing the size and shape of the data.\n",
"print(\"DATA SIZE\")\n",
"print(\"---------\")\n",
"print(f\"Training Data: {len(train_data)} \\t{train_data.shape}\")\n",
"print(f\"Testing Data : {len(test_data)} \\t{test_data.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 1.3.3 Reducing the size of the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I originally reached section 5 and was about to begin building models, but was forced to come back to this section and reduce the amount of data, since my notebook environment kept on crashing, disenabling me from going further in the DLWP process."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ratio of positive reviews to negative ones was equal before the split, so it was important to keep it that way. If it changed, I would be dealing with a completely different problem."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Negative Reviews : 1800000\n",
"Positive Reviews : 1800000\n",
"Reviews are equal: True\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking the ratio of 1 (negative) or 2 (positive) (to check whether it is\n",
"# balanced).\n",
"num_neg_reviews = len(train_data.loc[train_data[\"polarity\"] == 1])\n",
"num_pos_reviews = len(train_data.loc[train_data[\"polarity\"] == 2])\n",
"reviews_is_equal = (num_neg_reviews == num_pos_reviews)\n",
"\n",
"print(f\"Negative Reviews : {num_neg_reviews}\")\n",
"print(f\"Positive Reviews : {num_pos_reviews}\")\n",
"print(f\"Reviews are equal: {reviews_is_equal}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### 1.3.3.1 `split_dataset` function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function below returns a fraction of the given dataset, in this case I want half of the dataset. It takes into the account the ratio of positive to negative reviews."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"def split_dataset(df, frac: float = 0.5, col: str = \"polarity\"):\n",
" \"\"\"\n",
" A function that returns a fraction of the given dataset.\n",
"\n",
" Parameters\n",
" ----------\n",
" df : Pandas DataFrame\n",
" The dataset that the sample will be taken from.\n",
" frac: float, default=0.5\n",
" The fraction of the dataset to sample.\n",
" col: str, default=\"polarity\"\n",
" Label column of the dataset\n",
"\n",
" Returns\n",
" -------\n",
" _ : Pandas DataFrame\n",
" Returns a sample of the Pandas DataFrame.\n",
" \"\"\"\n",
" # Checking that the inputted DataFrame is not None or empty.\n",
" try:\n",
" if df is None or df.empty:\n",
" raise ValueError(\"The inputted DataFrame is None or empty.\")\n",
"\n",
" # Separating the data into positive (2) and negative (1) data.\n",
" pos_data = df[df[col] == 2]\n",
" neg_data = df[df[col] == 1]\n",
"\n",
" # Calculating the amount of data to take from each category.\n",
" num_pos_data = int(frac * len(pos_data))\n",
" num_neg_data = int(frac * len(neg_data))\n",
"\n",
" # Splitting / sampling the dataset.\n",
" train_pos = pos_data.sample(n=num_pos_data, random_state=1)\n",
" train_neg = neg_data.sample(n=num_neg_data, random_state=1)\n",
"\n",
" #  Combining the positive and negative datasets.\n",
" df = pd.concat([train_pos, train_neg], ignore_index=True)\n",
"\n",
" # Returning the combined dataset.\n",
" return df\n",
" except Exception as e:\n",
" print(f\"ERROR OCCURRED: {e}\")\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### 1.3.3.2 Halving the training and testing datasets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will now reduce the size of the data with the created function."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell\n",
"# Reducing the size of data using the split_dataset function.\n",
"half_train_data = split_dataset(train_data)\n",
"half_test_data = split_dataset(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Viewing the new training dataset."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>polarity</th>\n",
" <th>review_title</th>\n",
" <th>review_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>Nice Hat</td>\n",
" <td>You can't beat Henschel for well made and reas...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Jesus Reborn</td>\n",
" <td>I once walked the earth in sin. I was a non-be...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>World's Toughest Computer</td>\n",
" <td>I've been saying for a month that I should wri...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>An informative book about figure skating...</td>\n",
" <td>Yamaguchi's Figure Skating for Dummies is an e...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>Viva Las Vegas!!!!!!</td>\n",
" <td>I had never heard of Dread Zeppelin until two ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799995</th>\n",
" <td>1</td>\n",
" <td>not for kids</td>\n",
" <td>The poses are too advanced for younger childre...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799996</th>\n",
" <td>1</td>\n",
" <td>a note on longevity</td>\n",
" <td>prior versions were impregnated with iodine an...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799997</th>\n",
" <td>1</td>\n",
" <td>FYI Made in China</td>\n",
" <td>Pretty pricey considering it's MADE IN CHINA. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799998</th>\n",
" <td>1</td>\n",
" <td>Breaking bits and dead drills</td>\n",
" <td>What's the saying? Three strikes, you're out? ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799999</th>\n",
" <td>1</td>\n",
" <td>Great Sphinx puzzle</td>\n",
" <td>Too hard for our house- the colors are very si...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1800000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" polarity review_title \\\n",
"0 2 Nice Hat \n",
"1 2 Jesus Reborn \n",
"2 2 World's Toughest Computer \n",
"3 2 An informative book about figure skating... \n",
"4 2 Viva Las Vegas!!!!!! \n",
"... ... ... \n",
"1799995 1 not for kids \n",
"1799996 1 a note on longevity \n",
"1799997 1 FYI Made in China \n",
"1799998 1 Breaking bits and dead drills \n",
"1799999 1 Great Sphinx puzzle \n",
"\n",
" review_content \n",
"0 You can't beat Henschel for well made and reas... \n",
"1 I once walked the earth in sin. I was a non-be... \n",
"2 I've been saying for a month that I should wri... \n",
"3 Yamaguchi's Figure Skating for Dummies is an e... \n",
"4 I had never heard of Dread Zeppelin until two ... \n",
"... ... \n",
"1799995 The poses are too advanced for younger childre... \n",
"1799996 prior versions were impregnated with iodine an... \n",
"1799997 Pretty pricey considering it's MADE IN CHINA. ... \n",
"1799998 What's the saying? Three strikes, you're out? ... \n",
"1799999 Too hard for our house- the colors are very si... \n",
"\n",
"[1800000 rows x 3 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the reduced dataframe.\n",
"half_train_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Viewing the new testing dataset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>polarity</th>\n",
" <th>review_title</th>\n",
" <th>review_content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>So cute</td>\n",
" <td>My son loves it. My complaint is that it laste...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Great movie</td>\n",
" <td>I am an absolute fan of the Crow series. I own...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Best Baby Einstein yet!</td>\n",
" <td>Baby Noah is the best Baby Einstein for any ag...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>Member of the pack</td>\n",
" <td>Fine story with well drawn characters and terr...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>Love this</td>\n",
" <td>My friend has this album for her daughter n I ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199995</th>\n",
" <td>1</td>\n",
" <td>Love VNV But Hate This DVD.</td>\n",
" <td>I practically like every track they have ever ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199996</th>\n",
" <td>1</td>\n",
" <td>don't buy</td>\n",
" <td>We needed one and this was what was available ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199997</th>\n",
" <td>1</td>\n",
" <td>did not work for long</td>\n",
" <td>We bought 2 of these to use with a remote cont...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199998</th>\n",
" <td>1</td>\n",
" <td>Not what I expected at all</td>\n",
" <td>I purchased this product under the assumption ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199999</th>\n",
" <td>1</td>\n",
" <td>Flexrake 100A</td>\n",
" <td>I have had a hula hoe for year. It has worked ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>200000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" polarity review_title \\\n",
"0 2 So cute \n",
"1 2 Great movie \n",
"2 2 Best Baby Einstein yet! \n",
"3 2 Member of the pack \n",
"4 2 Love this \n",
"... ... ... \n",
"199995 1 Love VNV But Hate This DVD. \n",
"199996 1 don't buy \n",
"199997 1 did not work for long \n",
"199998 1 Not what I expected at all \n",
"199999 1 Flexrake 100A \n",
"\n",
" review_content \n",
"0 My son loves it. My complaint is that it laste... \n",
"1 I am an absolute fan of the Crow series. I own... \n",
"2 Baby Noah is the best Baby Einstein for any ag... \n",
"3 Fine story with well drawn characters and terr... \n",
"4 My friend has this album for her daughter n I ... \n",
"... ... \n",
"199995 I practically like every track they have ever ... \n",
"199996 We needed one and this was what was available ... \n",
"199997 We bought 2 of these to use with a remote cont... \n",
"199998 I purchased this product under the assumption ... \n",
"199999 I have had a hula hoe for year. It has worked ... \n",
"\n",
"[200000 rows x 3 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the reduced dataframe.\n",
"half_test_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will print the new sizes of the datasets."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DATA SIZE\n",
"---------\n",
"Training Data: 1800000 \t(1800000, 3)\n",
"Testing Data : 200000 \t(200000, 3)\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Printing the size and shape of the data.\n",
"print(\"DATA SIZE\")\n",
"print(\"---------\")\n",
"print(f\"Training Data: {len(half_train_data)} \\t{half_train_data.shape}\")\n",
"print(f\"Testing Data : {len(half_test_data)} \\t{half_test_data.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LNWy8jBsZJ0H",
"tags": []
},
"source": [
"## 1.4 Constraints and Ethical Considerations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since I am using a dataset that consists of the reviews of many different people there are some things to consider. The first is the privacy concern, the dataset shouldn't contain any personal information. In this case, the dataset I am using has already been processed to eliminate all private information. I must also be transparent and clearly document how the data was collected. Again, this has already been done, and the link to the dataset I used is in my references. There are other ethical concerns but most of them have been already been mitigated, I just have to understand the implications of my report / ML task.\n",
"\n",
"One constraint to mention and that has decided to haunt me is the data size. Large datasets require substantial storage and processing power. The Amazon dataset is very large, and I have already started to see the drawbacks of using such a large dataset. My limited computing resources has caused some limitations, which will be seen later on in the report / ML task."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NRCzbf3dIX7w",
"tags": []
},
"source": [
"# 2 Choosing a measure of success"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EDbgF7ry94ef"
},
"source": [
"Now that I have the datasets ready, I need to decide on a measure of success. In order to do so, I will take a closer look at the training data.\n",
"\n",
"First, I will begin by checking the polarity column, since it is holds the target / labels data.\n",
"\n",
"I am under the assumption that there are only two unique values, 1 and 2. 1 representing negative reviews and 2 representing positive reviews. However when I was reading up on how the dataset was constructed I found that initially 1 and 2 were negative, 4 and 5 were positive, and 3 was ignored. Therefore, I have to check whether that data that I acquired has only 1 and 2 as values. If it includes 4 and 5, I will then have to further process the dataframes."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 212,
"status": "ok",
"timestamp": 1693228204731,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "l_lwvVo7Kc9p",
"outputId": "2e21330a-a8f0-4c7e-89ba-86063b434e2e",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking that polarity can only be 1 (negative) or 2 (positive).\n",
"set(half_train_data[\"polarity\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P2DkaOrbCUyT"
},
"source": [
"As it turns out, there are only 1s and 2s, therefore I will only need to vectorize the labels later.\n",
"\n",
"Next I will check the ratio of 1s to 2s since this will influence the measure of success that I will select."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 1217,
"status": "ok",
"timestamp": 1693228207566,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "hHYzCGOv4M5u",
"outputId": "f654f21a-a1eb-4f94-a02b-2915ed5919e5",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Negative Reviews : 900000\n",
"Positive Reviews : 900000\n",
"Reviews are equal: True\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking the ratio of 1 (negative) or 2 (positive) (to check whether it is\n",
"# balanced).\n",
"num_neg_reviews = len(half_train_data.loc[half_train_data[\"polarity\"] == 1])\n",
"num_pos_reviews = len(half_train_data.loc[half_train_data[\"polarity\"] == 2])\n",
"reviews_is_equal = (num_neg_reviews == num_pos_reviews)\n",
"\n",
"print(f\"Negative Reviews : {num_neg_reviews}\")\n",
"print(f\"Positive Reviews : {num_pos_reviews}\")\n",
"print(f\"Reviews are equal: {reviews_is_equal}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QB6gInE1HLb2"
},
"source": [
"Since the ratio of negative reviews to positive reviews is equal (there is not an inbalance), my measure of success will be accuracy rather than precision or recall. This is because accuracy is best suited for when classes are balanced. In addition, accuracy is simple and will measure the percentage of correctly predicted polarities- which is what I am aiming for.\n",
"\n",
"On the other hand, precision and recall are more suitable for datasets with class imbalances, since precision measures the true positive predictions compared to all positive predictions, and recall measures the true positive predictions compared to all actual positives.\n",
"\n",
"I will not be using ROC AUC (area under the receiver operating characteristic curve) as it better suited for evaluating the model's ability to distinguish between the classes across different thresholds."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4X1a_Tt1KekO",
"tags": []
},
"source": [
"# 3 Deciding on an evaluation protocol"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TbM0djY3K7Gs"
},
"source": [
"Since I have decided on a measure of success, it is now time to pick an evaluation protocol. As mentioned before, my dataset was pretty large 3.6 million lines (see *section 1.3.2.2*), 4 million if I included the test dataset. However, the size of the dataset was leading to numerous problems, therefore I decided to use half of the dataset. Now, my dataset is 1.8 million lines, 2 million if I include the reduced test dataset. This is still very large. Since the dataset is quite large I have the freedom to pick any (to certain degree) evaluation protocol that I would like to use.\n",
"\n",
"Due to this, I will be utilising the hold-out validation set approach since it requires less computational resources compared to k-fold cross validation and iterated k-fold validation, which in turn makes it faster. It works very well with larger datasets, and is great for model complexity tuning.\n",
"\n",
"I will not be using k-fold cross validation or iterated k-fold validation since they better suit smaller datasets. K-fold cross validation is great when you have too \"*few samples for hold out validation to be reliable*\"[6]. Iterated k-fold validation is great for performing \"*highly accurate model evaluation when little data is available*\"[6]."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T-gceqrbLV3O",
"tags": []
},
"source": [
"# 4 Preparing the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that I know what I am training, what I am optimising for, and how to evaluate my approach, I need to format my data, so that it can be fed into a machine learning model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will begin by importing the relevant libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"executionInfo": {
"elapsed": 3014,
"status": "ok",
"timestamp": 1693228217535,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "oqcrgFAp6p5H",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Importing the relevant libraries.\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"from keras.preprocessing.text import Tokenizer\n",
"import numpy as np\n",
"from keras import models\n",
"from keras import layers\n",
"from keras import optimizers\n",
"from keras import losses\n",
"from keras import metrics\n",
"from keras.optimizers import RMSprop\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 4.1 The `VectorizeDataset` class"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I decided to create a class called `VectorizeDataset` to simplify the vectorisation process. The class contains the following methods:\n",
"1. `vectorize_labels(isTest)`\n",
"2. `vectorize_inputs(isTest)`\n",
"\n",
"Like the name suggests the`vectorize_labels()` method converts the given labels into their corresponding one-hot binary representations, whereas the `vectorize_inputs()` method converts the given inputs (data / features) into numerical representations that can be used as inputs to a neural network."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"executionInfo": {
"elapsed": 195,
"status": "ok",
"timestamp": 1693228219742,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "63zB2HrLMa9B",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"class VectorizeDataset:\n",
" \"\"\"\n",
" A class that vectorises the given dataset.\n",
"\n",
" Methods\n",
" -------\n",
" vectorize_labels(isTest)\n",
" vectorize_inputs(isTest)\n",
" \"\"\"\n",
"\n",
" def __init__(self, dataset_train, dataset_test, labels: str, inputs: str,\n",
" max_words: int = 10000):\n",
" \"\"\"\n",
" Initialises the VectorizeDataset class and sets the dataset,\n",
" labels, and inputs.\n",
"\n",
" Parameters\n",
" ----------\n",
" dataset_train : Pandas.DataFrame\n",
" The training dataset from which the labels and inputs will be extracted.\n",
" dataset_test : Pandas.DataFrame\n",
" The testing dataset from which the labels and inputs will be extracted.\n",
" labels : str\n",
" The column name that refers to the target / output data.\n",
" inputs : str\n",
" The column name that refers to the feature / input data.\n",
" max_words : int, default=10000\n",
" The maximum vocabulary size.\n",
" \"\"\"\n",
" self.dataset_train = dataset_train\n",
" self.dataset_test = dataset_test\n",
" self.labels = labels\n",
" self.inputs = inputs\n",
" self.max_words = max_words\n",
"\n",
" def vectorize_labels(self, isTest: bool = False):\n",
" \"\"\"\n",
" Converts the given labels into their corresponding one-hot binary\n",
" representations.\n",
"\n",
" Parameters\n",
" ----------\n",
" isTest : bool, default=False\n",
" Checks whether the data to vectorise is testing data or training data.\n",
"\n",
" Returns\n",
" -------\n",
" _ : tensor\n",
" Returns a tensor containing the one-hot binary representations.\n",
" \"\"\"\n",
" # Extracting the labels column data.\n",
" labels_col = self.dataset_train[self.labels].values if (\n",
" isTest == False) else self.dataset_test[self.labels].values\n",
" \n",
" # Normalising the labels by subtracting 1.\n",
" return labels_col - 1\n",
"\n",
" def vectorize_inputs(self, isTest: bool = False):\n",
" \"\"\"\n",
" Converts the given inputs (data / features) into numerical\n",
" representations that can be used as inputs to a neural network.\n",
"\n",
" Parameters\n",
" ----------\n",
" isTest : bool, default=False\n",
" Checks whether the data to vectorise is testing data or training data.\n",
"\n",
" Returns\n",
" -------\n",
" one_hot_data : tensor\n",
" Returns a tensor holding the numerical representation of the input\n",
" data.\n",
" \"\"\"\n",
" # Extracting the inputs column data.\n",
" inputs_col = self.dataset_train[self.inputs].values\n",
"\n",
" # Initialising the Tokenizer + setting max_words to limit vocabulary size.\n",
" tokenizer = Tokenizer(num_words=self.max_words)\n",
"\n",
" # Fitting the tokenizer to the given inputs\n",
" tokenizer.fit_on_texts(inputs_col)\n",
"\n",
" if (isTest == False):\n",
" # Converting the inputs into sequences of integers.\n",
" sequences = tokenizer.texts_to_sequences(inputs_col)\n",
"\n",
" # Performing one-hot encoding on the created sequences.\n",
" one_hot_data = tokenizer.sequences_to_matrix(\n",
" sequences, mode='binary')\n",
" else:\n",
" # Extracting the inputs test column data.\n",
" inputs_test_col = self.dataset_test[self.inputs].values\n",
"\n",
" # Converting the test inputs into sequences of integers.\n",
" sequences = tokenizer.texts_to_sequences(inputs_test_col)\n",
"\n",
" # Performing one-hot encoding on the created sequences.\n",
" one_hot_data = tokenizer.sequences_to_matrix(\n",
" sequences, mode='binary')\n",
"\n",
" return one_hot_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 4.2 Vectorising the data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ye0BexDFCiUZ"
},
"source": [
"I initially planned on setting `max_words` (which is the maximum vocabulary size) to 10000. However, my notebook environment kept on crashing, therefore I reduced `max_words` from 10000 to 1000. This will evidently reduce the accuracy and predictive power of my model since the vocabulary size is 10 times smaller, however since my dataset is so large, it may be able to mitigate this to a certain extent, since the model has access to a lot of training data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will begin by initialising the `VectorizeDataset` class that I created, by feeding in the reduced dataset and the necessary parameters."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"executionInfo": {
"elapsed": 214,
"status": "ok",
"timestamp": 1693228223240,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "AfzlvVCB5CpX",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Initialising the VectorizeDataset class for the data.\n",
"vectorize_data = VectorizeDataset(half_train_data, half_test_data, \"polarity\",\n",
" \"review_content\", max_words=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next I will generate and view the vectorised training labels."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"executionInfo": {
"elapsed": 389,
"status": "ok",
"timestamp": 1693228225839,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "uXHELACY5uWY",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the vectorised training labels.\n",
"vectorised_train_labels = vectorize_data.vectorize_labels()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 199,
"status": "ok",
"timestamp": 1693228228091,
"user": {
"displayName": "Worthy",
"userId": "06377054689661600867"
},
"user_tz": -60
},
"id": "aNnf17L76A-a",
"outputId": "d20380b2-e9c5-458e-cdd4-6aa92a48e619",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(array([1, 1, 1, ..., 0, 0, 0], dtype=object), (1800000,))"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the vectorised labels and checking the shape.\n",
"vectorised_train_labels, vectorised_train_labels.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will generate and view the vectorised training inputs."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "qHl1BeLo6OYj",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the vectorised training inputs.\n",
"vectorised_train_inputs = vectorize_data.vectorize_inputs()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "ROzpq4Xz6beZ",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(array([[0., 0., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.]]),\n",
" (1800000, 1000))"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the vectorised training inputs and checking the shape.\n",
"vectorised_train_inputs, vectorised_train_inputs.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next I will generate and view the vectorised test labels."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "DR2umDGiDGts",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the vectorised testing labels.\n",
"vectorised_test_labels = vectorize_data.vectorize_labels(isTest=True)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "fxIKTZfqDPsP",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(array([1, 1, 1, ..., 0, 0, 0], dtype=object), (200000,))"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the vectorised labels and checking the shape.\n",
"vectorised_test_labels, vectorised_test_labels.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will generate and view the vectorised test inputs."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "KMGcruZM6kEC",
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Generating the vectorised testing inputs.\n",
"vectorised_test_inputs = vectorize_data.vectorize_inputs(isTest=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "VYAMf0CyAOea",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(array([[0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 0., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.]]),\n",
" (200000, 1000))"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing the vectorised testing inputs and checking the shape.\n",
"vectorised_test_inputs, vectorised_test_inputs.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 4.3 Splitting the data into training and validation to implement hold-out validation"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 4.3.1 The `BuildEvalModel` class"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once I finished vectorising my data, I decided to create a class called `BuildEvalModel` that allows me to compile, train, and evaluate models. The class has the following methods:\n",
"1. `train_val_split(train_inputs, train_label, train_ratio)`\n",
"2. `create_compile_model(units, activation, input_shape, num_of_layers)`\n",
"3. `fit_model(model, train_x, train_y, val_x, val_y, epochs, batch_size)`\n",
"\n",
"The `train_val_split()` method splits the given data into training and validation sets. This is needed for holdout validation. The `create_compile_model()` creates and compiles a Sequential model based on the given parameters. This will stop me from manually writing out the same code. The `fit_model()` fits the model- again needed to avoid unnecessary code repetition, leading to modular code."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"class BuildEvalModel:\n",
" \"\"\"\n",
" A class that compiles, trains, and evaluates a model.\n",
"\n",
" Methods\n",
" -------\n",
" train_val_split(train_inputs, train_label, train_ratio)\n",
" create_compile_model(units, activation, input_shape, num_of_layers)\n",
" fit_model(model, train_x, train_y, val_x, val_y, epochs, batch_size)\n",
"\n",
" \"\"\"\n",
"\n",
" def train_val_split(self, train_inputs: list, train_labels: list,\n",
" train_ratio: float = 0.8, random_seed=None):\n",
" \"\"\"\n",
" Splits the given data into training and validation sets.\n",
"\n",
" Parameters\n",
" ----------\n",
" train_inputs : list\n",
" The input training data.\n",
" train_labels : list\n",
" The labels for the training data.\n",
" train_ratio : float, default=0.8\n",
" The ratio used to calculating the proportion of training data\n",
" to validation data.\n",
" random_seed : int, default=None\n",
" Used to set the random seed for reproducibility. \n",
"\n",
" Returns\n",
" -------\n",
" _ : tuple\n",
" Returns a tuple containing the split data.\n",
"\n",
" Notes\n",
" -----\n",
" tuple[0] corresponds to the training data (inputs).\n",
" tuple[1] corresponds to the validation data (inputs).\n",
" tuple[2] corresponds to the training labels (targets).\n",
" tuple[3] corresponds to the validation labels (targets).\n",
" \"\"\"\n",
" # Checking whether the given data are numpy arrays.\n",
" if not isinstance(train_inputs, np.ndarray):\n",
" train_inputs = np.array(train_inputs)\n",
" if not isinstance(train_labels, np.ndarray):\n",
" train_labels = np.array(train_labels)\n",
"\n",
" # Setting the random seed for reproducibility.\n",
" if random_seed is not None:\n",
" np.random.seed(random_seed)\n",
"\n",
" # Checking that the length of the labels and inputs.\n",
" if (len(train_inputs) != len(train_labels)):\n",
" print(\"ERROR: The length of inputs doesn't equal the length of labels.\")\n",
" return\n",
"\n",
" # Calculating the number of data for the train and val data based on the train ratio.\n",
" total_data = len(train_inputs)\n",
" train_data = int(train_ratio * total_data)\n",
"\n",
" # Randomising / shuffling the data and labels.\n",
" indices = np.arange(total_data)\n",
" np.random.shuffle(indices)\n",
" train_inputs = train_inputs[indices]\n",
" train_labels = train_labels[indices]\n",
"\n",
" # Splitting the data into training and validation sets.\n",
" X_train = train_inputs[:train_data]\n",
" X_val = train_inputs[train_data:]\n",
" y_train = train_labels[:train_data]\n",
" y_val = train_labels[train_data:]\n",
"\n",
" return (X_train, X_val, y_train, y_val)\n",
"\n",
" def create_compile_model(self, units: list, activation: list,\n",
" input_shape=(1000,), num_of_layers: int = 3):\n",
" \"\"\"\n",
" Creates and compiles a Sequential model based on the given parameters.\n",
"\n",
" Parameters\n",
" ----------\n",
" units : list of integers (positive)\n",
" List containing the dimensionalities of the output space.\n",
" activation : list of str\n",
" Specifies the activation function to be used.\n",
" input_shape : tuple, default=(1000,)\n",
" The shape that corresponds to structure of the chosen data.\n",
" num_of_layers : int, default=3\n",
" The number of layers that the model will have.\n",
"\n",
" Returns\n",
" -------\n",
" model : keras Sequential object\n",
" Returns a model based on the given parameters.\n",
" \"\"\"\n",
" #  Creating the model.\n",
" model = keras.Sequential()\n",
"\n",
" # Adding models to the layers.\n",
" for i in range(num_of_layers):\n",
" if i == 0:\n",
" model.add(layers.Dense(units[i], activation=activation[i],\n",
" input_shape=input_shape))\n",
" else:\n",
" model.add(layers.Dense(\n",
" units[i],\n",
" activation=activation[i]\n",
" ))\n",
"\n",
" # Compiling the model.\n",
" model.compile(\n",
" optimizer='rmsprop',\n",
" loss='binary_crossentropy',\n",
" metrics=['accuracy']\n",
" )\n",
"\n",
" return model\n",
"\n",
" def fit_model(self, model, train_x, train_y, val_x, val_y, epochs: int = 20,\n",
" batch_size: int = 512):\n",
" \"\"\"\n",
" Fits the model.\n",
"\n",
" Parameters\n",
" ----------\n",
" model : keras.Sequential object\n",
" The model that will be built / fitted.\n",
" train_x : Array-like structure\n",
" The input data.\n",
" train_y : list / tensor / array\n",
" The target data.\n",
" val_x : list / tensor / array\n",
" The validation data.\n",
" val_y : list / array\n",
" The validation data.\n",
" epochs : int, default=20\n",
" The number of epochs to train the model.\n",
" batch_size : int, default=512\n",
" The number of samples per gradient update.\n",
"\n",
" Returns\n",
" -------\n",
" history : History object\n",
" Returns a History object where the attribute \"history\" holds a \n",
" record of training loss and other metrics.\n",
" \"\"\"\n",
" # Training the model.\n",
" history = model.fit(\n",
" train_x,\n",
" train_y,\n",
" epochs=epochs,\n",
" batch_size=batch_size,\n",
" validation_data=(val_x, val_y)\n",
" )\n",
"\n",
" return history"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 4.3.2 Converting the `vectorised_train_labels` and `vectorised_test_labels`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before I split the data, I need to double check that my data is of the right type."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(dtype('float64'), dtype('O'))"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking the type of the training data.\n",
"vectorised_train_inputs.dtype, vectorised_train_labels.dtype"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(dtype('float64'), dtype('O'))"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking the type of the testing data.\n",
"vectorised_test_inputs.dtype, vectorised_test_labels.dtype"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Since there are a couple variables with the wrong type (`object`). I will create a function to convert the type (to `float64`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### 4.3.2.1 The `obj_to_float` function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function defined below will convert the given data of type `object` to type `float64`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"def obj_to_float(data):\n",
" \"\"\"\n",
" Converts the dtype('O') into a dtype('float64')\n",
"\n",
" Parameters\n",
" ----------\n",
" data : Array-like structure\n",
" The data to be converted.\n",
"\n",
" Returns\n",
" -------\n",
" data : np.ndarray\n",
" Returns an np.ndarray\n",
" \"\"\"\n",
"\n",
" try:\n",
" # Converting the data.\n",
" data = data.astype('float64')\n",
"\n",
" # Printing the results to check that the conversion was successful.\n",
" print(f\"New type: {data.dtype}\")\n",
"\n",
" return data\n",
" except Exception as e:\n",
" print(f\"Conversion error: {e}\")\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### 4.3.2.2 Updating the types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that I have written the function, it is time to utilise it. I will convert the type of the training and testing labels since they were the ones that had the `object` data type."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New type: float64\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Converting the training labels to float_64.\n",
"vectorised_train_labels = obj_to_float(vectorised_train_labels)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New type: float64\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Converting the testing labels to float_64.\n",
"vectorised_test_labels = obj_to_float(vectorised_test_labels)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(dtype('float64'), dtype('float64'))"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Double checking the types\n",
"vectorised_train_labels.dtype, vectorised_test_labels.dtype"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 4.3.3 Splitting the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the data is of the right type, I will create an instance of the `BuildEvalModel` class, and call the `train_val_split()` method in order to generate training and validation sets. I will then check the shape and type of the data produced in case of error."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating an instance of the BuildEvalModel class.\n",
"build_eval_model = BuildEvalModel()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating training and validation sets (to implement hold-out validation).\n",
"X_train, X_val, y_train, y_val = build_eval_model.train_val_split(vectorised_train_inputs, \n",
" vectorised_train_labels,\n",
" random_seed=1)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1440000, 1000),\n",
" dtype('float64'),\n",
" array([[0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 0., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.]]))"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing and checking the shape and type of the training data.\n",
"X_train.shape, X_train.dtype, X_train"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"((1440000,), dtype('float64'), array([1., 0., 0., ..., 0., 1., 1.]))"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing and checking the shape and type of the training labels.\n",
"y_train.shape, y_train.dtype, y_train"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((360000, 1000),\n",
" dtype('float64'),\n",
" array([[0., 1., 0., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.],\n",
" [0., 1., 1., ..., 0., 0., 0.]]))"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing and checking the shape and type of the validation data.\n",
"X_val.shape, X_val.dtype, X_val"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((360000,), dtype('float64'), array([0., 1., 0., ..., 1., 1., 1.]))"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Viewing and checking the shape and type of the validation labels.\n",
"y_val.shape, y_val.dtype, y_val"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nT70q619NL6k",
"tags": []
},
"source": [
"# 5 Developing a model that does better than the baseline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Considering that the data is prepared it is time to build the model. The goal is to build the smallest model that does better than statistical power (baseline). In my case statistical power is anything greater than 0.5. This is because I am conducting a binary classification, and therefore have two class labels. In addition, the ratio of positive to negative reviews is equal (see *section 2*)- this makes the common sense baseline 0.5. This is important since the model beating statistical power means that it is doing better than what a human would do.\n",
"\n",
"Table 1 displays the recommended last-layer activation and loss functions based on the problem type."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Hx2peubZXXAB"
},
"source": [
"<table>\n",
" <caption><span style=\"font-weight: bold;\">Table 1</span> Choosing the right last-layer activation and loss function for your model [6].</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Problem Type</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Last-Layer Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Loss Function</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Binary classification</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>sigmoid</code></td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>binary_crossentropy</code></td>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Multiclass, single-label classification</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>softmax</code></td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>categorical_crossentropy</code></td>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Multiclass, multilabel classification</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>sigmoid</code></td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>binary_crossentropy</code></td>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Regression to arbitrary values</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">None</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>mse</code></td>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\">Regression to values between 0 and 1</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>sigmoid</code></td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: left; padding: 8px;\"><code>mse</code> or <code>binary_crossentropy</code></td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bPVMjKkD2gMJ"
},
"source": [
"My problem type is binary classification, so the last-layer activation I will go with is `sigmoid`. The loss function I will use is `binary-crossentropy`. For optimiser configuration, the optimiser I will use is `rmsprop` and its default learning rate."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.1 The `compile_fit_model` function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to avoid repetition I will create a function called `compile_fit_model()` that will call the `create_compile_model()` method, the `build_eval_model.fit_model()` method, and return `history.history` (a dictionary that contains a record of training loss and other metrics. This will be especially helpful later on, when it comes to model tuning."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote all the code in this cell.\n",
"def compile_fit_model(units: list, activation: list, num_of_layers: int,\n",
" epochs: int = 10, batch_size: int = 512):\n",
" \"\"\"\n",
" A function that compiles and fits a model.\n",
"\n",
" Parameters\n",
" ----------\n",
" units : list of integers (positive)\n",
" List containing the dimensionalities of the output space.\n",
" activation : list of strs\n",
" Specifies the activation functions to be used.\n",
" num_of_layers : int\n",
" The number of layers that the model will have.\n",
" epochs : int, default=10\n",
" The number of epochs to train the model.\n",
" batch_size : int, default=512\n",
" The number of samples per gradient update.\n",
"\n",
" Returns\n",
" -------\n",
" history : dict\n",
" Returns a dict that holds a record of training loss and other \n",
" metrics.\n",
" \"\"\"\n",
" # Creating an instance of the BuildEvalModel class.\n",
" build_eval_model = BuildEvalModel()\n",
"\n",
" # Creating and compiling the model.\n",
" model = build_eval_model.create_compile_model(\n",
" units=units,\n",
" activation=activation,\n",
" input_shape=(1000,),\n",
" num_of_layers=num_of_layers\n",
" )\n",
"\n",
" # Building / fitting the model.\n",
" history = build_eval_model.fit_model(\n",
" model, X_train, y_train, X_val, y_val,\n",
" epochs=epochs,\n",
" batch_size=batch_size\n",
" )\n",
"\n",
" return history.history"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.2 The `TrainValPlot` class"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will also create a class called `TrainValPlot` that will allow me to plot the loss or accuracy of training and validation data, and save time in the long run. The class is contains the following methods:\n",
"1. `display_legend_plot()`\n",
"2. `set_labels(title, xlabel, ylabel)`\n",
"3. `loss_plot(history, colour)`\n",
"4. `accuracy_plot(history, colour)`\n",
"5. `plot_model_loss(hyperparameter_name, hyperparameter, loss, val_loss, colour)`\n",
"6. `plot_model_accuracy(hyperparameter_name, hyperparameter, accuracy, val_accuracy, colour)`\n",
"\n",
"The `display_legend_plot()` method is a helper function that shows the legend and the plot, its purpose is to reduce the amount of repeated code. The `set_labels()` method is also a helper function, it sets the the title and the labels. The `loss_plot()` method plots the training loss against the validation loss. This will enable me to further analyse and visualise the data, it will also help me spot overfitting. The `accuracy_plot()` method plots the training accuracy against the validation accuracy, it will also be useful for spotting the point at which overfitting starts. The `plot_model_loss()` method plots the training loss against the validation loss based on a given hyperparameter. The `plot_model_accuracy()` method plots the training accuracy against the validation accuracy based on a given hyperparameter. The last two methods will be useful when I start tuning the hyperparameters (see *section 7*)."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#  I wrote all the code in this cell.\n",
"class TrainValPlot:\n",
" \"\"\"\n",
" Class that can plot the loss or accuracy of training and validation data.\n",
"\n",
" Methods\n",
" -------\n",
" display_legend_plot()\n",
" set_labels(title, xlabel, ylabel)\n",
" loss_plot(history, colour)\n",
" accuracy_plot(history, colour)\n",
" plot_model_loss(hyperparameter_name, hyperparameter, loss, val_loss, colour)\n",
" plot_model_accuracy(hyperparameter_name, hyperparameter, accuracy, val_accuracy, colour)\n",
" \"\"\"\n",
"\n",
" def __init__(self):\n",
" pass\n",
"\n",
" def display_legend_plot(self):\n",
" \"\"\"A helper function that shows the legend and the plot.\"\"\"\n",
" plt.legend()\n",
" plt.grid()\n",
" plt.show()\n",
"\n",
" def set_labels(self, title: str, xlabel: str, ylabel: str):\n",
" \"\"\"\n",
" A helper function that sets the title and labels.\n",
"\n",
" Parameters\n",
" ----------\n",
" title : str\n",
" The title of the graph.\n",
" xlabel : str\n",
" The x axis label.\n",
" ylabel : str\n",
" The y axis label.\n",
" \"\"\"\n",
" # Giving a title and labels\n",
" plt.title(title)\n",
" plt.xlabel(xlabel)\n",
" plt.ylabel(ylabel)\n",
"\n",
" def loss_plot(self, history: dict, colour: list = [\"DodgerBlue\", \"Violet\"]):\n",
" \"\"\"\n",
" A function that plots the training loss against the validation loss.\n",
"\n",
" Parameters\n",
" ----------\n",
" history : dict\n",
" The history dictionary that contains information about the loss.\n",
" colour : list, default=[\"DodgerBlue\", \"Violet\"]\n",
" A list of colours (string values).\n",
" \"\"\"\n",
"\n",
" # Accessing the loss values utilising the history dict.\n",
" training_loss = history[\"loss\"]\n",
" val_loss = history[\"val_loss\"]\n",
"\n",
" # Calculating the number of epochs.\n",
" epochs = range(1, len(training_loss) + 1)\n",
"\n",
" #  Plotting the training and validation loss.\n",
" plt.figure(figsize=(10, 6))\n",
" plt.plot(epochs, training_loss, colour[0], label=\"Training Loss\")\n",
" plt.plot(epochs, val_loss, colour[1], label=\"Validation Loss\")\n",
"\n",
" # Setting a title and labels, displaying the plot.\n",
" self.set_labels(\"Training and Validation Loss\", \"Epochs\", \"Loss\")\n",
" self.display_legend_plot()\n",
"\n",
" def accuracy_plot(self, history: dict, colour: list = [\"DodgerBlue\", \"Violet\"]):\n",
" \"\"\"\n",
" A function that plots the training accuracy against the validation\n",
" accuracy.\n",
"\n",
" Parameters\n",
" ----------\n",
" history : dict\n",
" The history dictionary that contains information about the \n",
" accuracy.\n",
" colour : list, default=[\"DodgerBlue\", \"Violet\"]\n",
" A list of colours (string values).\n",
" \"\"\"\n",
"\n",
" # Accessing the accuracy values utilising the history dict.\n",
" training_acc = history[\"accuracy\"]\n",
" val_acc = history[\"val_accuracy\"]\n",
"\n",
" # Calculating the number of epochs.\n",
" epochs = range(1, len(training_acc) + 1)\n",
"\n",
" # Plotting the training and validation accuracy.\n",
" plt.figure(figsize=(10, 6))\n",
" plt.plot(epochs, training_acc, colour[0], label=\"Training Accuracy\")\n",
" plt.plot(epochs, val_acc, colour[1], label=\"Validation Accuracy\")\n",
"\n",
" # Setting a title and labels, displaying the plot.\n",
" self.set_labels(\"Training and Validation Accuracy\",\n",
" \"Epochs\", \"Accuracy\")\n",
" self.display_legend_plot()\n",
"\n",
" def plot_model_loss(self, hyperparameter_name: str, hyperparameter: list,\n",
" loss: list, val_loss: list,\n",
" colour: list = [\"DodgerBlue\", \"Violet\"]):\n",
" \"\"\"\n",
" A function that plots the training loss against the validation\n",
" loss based on a given hyperparameter.\n",
"\n",
" Parameters\n",
" ----------\n",
" hyperparameter_name : str\n",
" The name of the hyperparameter (e.g. the number of units).\n",
" hyperparameter : list\n",
" The list containing the hyperparameter values.\n",
" loss : list\n",
" The loss values.\n",
" val_loss : list\n",
" The validation loss values.\n",
" colour : list, default=[\"DodgerBlue\", \"Violet\"]\n",
" A list of colours (string values).\n",
"\n",
" Notes\n",
" -----\n",
" The inputted lists (hyperparameter, loss, and val_loss) must be the \n",
" same length.\n",
" \"\"\"\n",
" # Checking if the lists are the same length.\n",
" if not (len(hyperparameter) == len(loss) == len(val_loss)):\n",
" raise ValueError(\"Input lists must be the same length.\")\n",
"\n",
" # Plotting the training and validation loss.\n",
" plt.figure(figsize=(10, 6))\n",
" plt.plot(hyperparameter, loss, colour[0], label=\"Training Loss\")\n",
" plt.plot(hyperparameter, val_loss, colour[1], label=\"Validation Loss\")\n",
"\n",
" # Setting a title and labels, displaying the plot.\n",
" self.set_labels(\n",
" f\"{hyperparameter_name}: Training and Validation Loss\", hyperparameter_name, \"Loss\")\n",
" self.display_legend_plot()\n",
"\n",
" def plot_model_accuracy(self, hyperparameter_name: str, hyperparameter: list,\n",
" accuracy: list, val_accuracy: list,\n",
" colour: list = [\"DodgerBlue\", \"Violet\"]):\n",
" \"\"\"\n",
" A function that plots the training accuracy against the validation\n",
" accuracy based on a given hyperparameter.\n",
"\n",
" Parameters\n",
" ----------\n",
" hyperparameter_name : str\n",
" The name of the hyperparameter (e.g. the number of units).\n",
" hyperparameter : list\n",
" The list containing the hyperparameter values.\n",
" accuracy : list\n",
" The accuracy values.\n",
" val_accuracy : list\n",
" The validation accuracy values.\n",
" colour : list, default=[\"DodgerBlue\", \"Violet\"]\n",
" A list of colours (string values).\n",
"\n",
" Notes\n",
" -----\n",
" The inputted lists (hyperparameter, accuracy, and val_accuracy) \n",
" must be the same length.\n",
" \"\"\"\n",
" # Checking if the lists are the same length.\n",
" if not (len(hyperparameter) == len(accuracy) == len(val_accuracy)):\n",
" raise ValueError(\"Input lists must be the same length.\")\n",
"\n",
" # Plotting the training and validation accuracy.\n",
" plt.figure(figsize=(10, 6))\n",
" plt.plot(hyperparameter, accuracy,\n",
" colour[0], label=\"Training Accuracy\")\n",
" plt.plot(hyperparameter, val_accuracy,\n",
" colour[1], label=\"Validation Accuracy\")\n",
"\n",
" # Setting a title and labels, displaying the plot.\n",
" self.set_labels(\n",
" f\"{hyperparameter_name}: Training and Validation Accuracy\", hyperparameter_name, \"Accuracy\")\n",
" self.display_legend_plot()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.3 The first model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the goal is to beat statistical power and I am aiming to create the simplest model, I will begin with a simple two layer model. Table 2 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 700px\">\n",
" <caption><span style=\"font-weight: bold;\">Table 2</span> Model 1 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">2</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[16, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">5</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.3.1 Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will begin by using the `compile_fit_model()` function to create, compile, and fit the model."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"2813/2813 [==============================] - 29s 10ms/step - loss: 0.3687 - accuracy: 0.8382 - val_loss: 0.3488 - val_accuracy: 0.8460\n",
"Epoch 2/5\n",
"2813/2813 [==============================] - 23s 8ms/step - loss: 0.3457 - accuracy: 0.8480 - val_loss: 0.3420 - val_accuracy: 0.8496\n",
"Epoch 3/5\n",
"2813/2813 [==============================] - 27s 10ms/step - loss: 0.3412 - accuracy: 0.8502 - val_loss: 0.3394 - val_accuracy: 0.8510\n",
"Epoch 4/5\n",
"2813/2813 [==============================] - 24s 8ms/step - loss: 0.3389 - accuracy: 0.8515 - val_loss: 0.3388 - val_accuracy: 0.8509\n",
"Epoch 5/5\n",
"2813/2813 [==============================] - 26s 9ms/step - loss: 0.3369 - accuracy: 0.8522 - val_loss: 0.3353 - val_accuracy: 0.8524\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history1a = compile_fit_model(units=[16, 1], \n",
" activation=[\"relu\", \"sigmoid\"], \n",
" num_of_layers=2,\n",
" epochs=5, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next I will view the keys that are in the dictionary, since these are the keys I used in the `TrainValPlot` class, and I need to ensure that they are what I expect."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Checking the available keys.\n",
"history1a.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.3.2 Plotting the training and validation loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the model compiled and fitted, I can now plot the training and validation loss so that I can make comparisons. First I will create an instance of the `TrainValPlot` class (I will be using this instance in the subsequent sections too). Then I will call the `loss_plot()` method."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# I wrote the code in this cell.\n",
"# Instantiating the class.\n",
"train_val_plot = TrainValPlot()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history1a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 1</span> Training and Validation loss for model 1.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 1 shows that overfitting doesn't occur, since the validation set does better than the training set. Underfitting doesn't occur either, this is evident since the loss decreases at each epoch meaning that the model is indeed learning the underlying patterns of the data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.3.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will call the `accuracy_plot()` method."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history1a, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 2</span> Training and Validation accuracy for model 1.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Likewise, figure 2 demonstrates that underfitting doesn't occur, and the accuracy generally increases for both training and validation data. However, overfitting may be occurring at the 3rd epoch, since while the training accuracy increases, the validation accuracy drops, though it does rise again at the 5th epoch. The accuracy reaches a high of about 85% for both the training and validation which means that model the achieves statistical power."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.4 The second model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the first model was pretty simple, it still beat statistical power by a large amount. I am going to reduce the model to a 1 layer one (and 1 unit), to see if I am able to achieve statistical power with a smaller model. Table 3 displays the hyperparameters / parameters I will be using for this model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 700px\">\n",
" <caption><span style=\"font-weight: bold;\">Table 3</span> Model 2 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">1</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">5</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.4.1 Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am using the `compile_fit_model()` function to create, compile, and fit the model."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"2813/2813 [==============================] - 32s 11ms/step - loss: 0.4042 - accuracy: 0.8273 - val_loss: 0.3665 - val_accuracy: 0.8409\n",
"Epoch 2/5\n",
"2813/2813 [==============================] - 24s 9ms/step - loss: 0.3657 - accuracy: 0.8415 - val_loss: 0.3644 - val_accuracy: 0.8418\n",
"Epoch 3/5\n",
"2813/2813 [==============================] - 24s 8ms/step - loss: 0.3651 - accuracy: 0.8417 - val_loss: 0.3644 - val_accuracy: 0.8420\n",
"Epoch 4/5\n",
"2813/2813 [==============================] - 23s 8ms/step - loss: 0.3651 - accuracy: 0.8417 - val_loss: 0.3644 - val_accuracy: 0.8420\n",
"Epoch 5/5\n",
"2813/2813 [==============================] - 25s 9ms/step - loss: 0.3652 - accuracy: 0.8418 - val_loss: 0.3644 - val_accuracy: 0.8420\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history2a = compile_fit_model(units=[1], \n",
" activation=[\"sigmoid\"], \n",
" num_of_layers=1,\n",
" epochs=5, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.4.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history2a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 3</span> Training and Validation loss for model 2.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 3 shows that while overfitting doesn't occur, underfitting does. Underfitting starts at the 2nd epoch. The graph demonstrates that from the 2nd epoch, the loss value remains constant, this means that the model isn't learning. This is expected for a simple 1 layer, 1 unit model. The loss values are also higher than model 1."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.4.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history2a, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 4</span> Training and Validation accuracy for model 2.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 4 also displays underfitting. From the 2nd epoch the model's accuracy remains constant. That being said, the model reaches a high of around 84% accuracy. While this is lower than the last model, it still exceeds statistical power. I suspect that the accuracy is high due to great amount of data- the number of epochs must also bear an effect as well."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.5 The third model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yet again, the model performed pretty well. I can no longer decrease the size of the layers nor the units so I'll try to further simplify it by decreasing the number of epochs from 5 to 3. Table 4 displays the hyperparameters / parameters that I will be using for this model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 700px\">\n",
" <caption><span style=\"font-weight: bold;\">Table 4</span> Model 3 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">1</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">3</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">128</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.5.1 Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am using the `compile_fit_model()` function to create, compile, and fit the model."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/3\n",
"11250/11250 [==============================] - 69s 6ms/step - loss: 0.3814 - accuracy: 0.8356 - val_loss: 0.3651 - val_accuracy: 0.8422\n",
"Epoch 2/3\n",
"11250/11250 [==============================] - 42s 4ms/step - loss: 0.3665 - accuracy: 0.8415 - val_loss: 0.3663 - val_accuracy: 0.8415\n",
"Epoch 3/3\n",
"11250/11250 [==============================] - 44s 4ms/step - loss: 0.3669 - accuracy: 0.8415 - val_loss: 0.3661 - val_accuracy: 0.8418\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history3a = compile_fit_model(units=[1], \n",
" activation=[\"sigmoid\"], \n",
" num_of_layers=1,\n",
" epochs=3, \n",
" batch_size=128)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.5.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history3a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 5</span> Training and Validation loss for model 3.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 5 above shows that overfitting is occurring. This is evident due to the divergence between the training loss and validation loss. While the training loss is decreasing the validation loss is increasing. This means that this model doesn't generalise well to unseen data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.5.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history3a, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 6</span> Training and Validation accuracy for model 3.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Likewise figure 6 displays overfitting for the same reasons as above. Rather than increasing, the validation accuracy decreases, however it increases again at the 3rd epoch. Despite this, the model still achieves statistical power."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 5.6 The fourth model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The third model still performed pretty well. I want to further simplify the model for the last time. This time I will only run 1 epoch. Table 5 displays the hyperparameters / parameters that I will be using for this model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 700px\">\n",
" <caption><span style=\"font-weight: bold;\">Table 5</span> Model 4 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">1</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">1</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.6.1 Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am using the `compile_fit_model()` function to create, compile, and fit the model."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2813/2813 [==============================] - 34s 12ms/step - loss: 0.4057 - accuracy: 0.8258 - val_loss: 0.3665 - val_accuracy: 0.8411\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history4a = compile_fit_model(units=[1], \n",
" activation=[\"sigmoid\"], \n",
" num_of_layers=1,\n",
" epochs=1, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 5.6.2 Summary of results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since only 1 epoch was run, there is not point plotting a graph for these results. Instead I will summarise the results in a table."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 700px\">\n",
" <caption><span style=\"font-weight: bold;\">Table 6</span> Model 4 Summary.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Loss</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Validation Loss</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Accuracy</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Validation Accuracy</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">0.3669</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">0.3661</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">0.8415</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">0.8418</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Table 6 above shows that the validation data did better than the training data, this decreases the likelihood of overfitting. It also proves that the model learnt something, since the validation improved upon the training data results. This model also achieved statistical power (reaching ~84% accuracy). It is not possible to further simplify the model, therefore this model is the simplist model that achieves statistical power."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wXS3zyR5NvZC",
"tags": []
},
"source": [
"# 6 Scaling up - developing a model that overfits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Being that I have achieved statistical power I will now develop a model that overfits. To achieve this, I will increase my model capacity (by increasing the number of layers and units). The reason I am aiming to overfit is because the \"*universal tension in machine learning is between optimization and generalization*\"[6] and therefore the optimal model is one \"*that stands right at the border between underfitting and overfitting*\"[6]. Yet in order to determine that border, the model must first cross it.\n",
"\n",
"Considering that my model predicts product reviews anything better than common sense is enough (especially since a lot bias is involved). If my problem was bank fraud or something similar, I would want a model that is at least 99.9% accurate- otherwise I would be alarming lots of customers for no reason. That being said, I am still aiming to scale up, make the model more powerful, and get the best accuracy that I possibly can. Though for my problem domain 80% to 90% is sufficient."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 6.1 The first model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since I am aiming to create a model that overfits I will create a larger model. In the previous section my models were either 1 layer or 2 layers (and yet they achieved statistical power), therefore I will begin with 3 layers. I will also increase the number of epochs to 10. Table 7 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <caption><span style=\"font-weight: bold;\">Table 7</span> Model 1 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">3</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[32, 16, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">10</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.1.1 Building the model"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/10\n",
"2813/2813 [==============================] - 57s 19ms/step - loss: 0.3544 - accuracy: 0.8426 - val_loss: 0.3289 - val_accuracy: 0.8556\n",
"Epoch 2/10\n",
"2813/2813 [==============================] - 46s 16ms/step - loss: 0.3234 - accuracy: 0.8581 - val_loss: 0.3192 - val_accuracy: 0.8608\n",
"Epoch 3/10\n",
"2813/2813 [==============================] - 44s 16ms/step - loss: 0.3139 - accuracy: 0.8634 - val_loss: 0.3165 - val_accuracy: 0.8621\n",
"Epoch 4/10\n",
"2813/2813 [==============================] - 46s 16ms/step - loss: 0.3082 - accuracy: 0.8665 - val_loss: 0.3127 - val_accuracy: 0.8640\n",
"Epoch 5/10\n",
"2813/2813 [==============================] - 46s 16ms/step - loss: 0.3041 - accuracy: 0.8687 - val_loss: 0.3107 - val_accuracy: 0.8655\n",
"Epoch 6/10\n",
"2813/2813 [==============================] - 36s 13ms/step - loss: 0.3011 - accuracy: 0.8704 - val_loss: 0.3112 - val_accuracy: 0.8648\n",
"Epoch 7/10\n",
"2813/2813 [==============================] - 30s 11ms/step - loss: 0.2990 - accuracy: 0.8713 - val_loss: 0.3113 - val_accuracy: 0.8651\n",
"Epoch 8/10\n",
"2813/2813 [==============================] - 42s 15ms/step - loss: 0.2971 - accuracy: 0.8725 - val_loss: 0.3100 - val_accuracy: 0.8655\n",
"Epoch 9/10\n",
"2813/2813 [==============================] - 39s 14ms/step - loss: 0.2956 - accuracy: 0.8731 - val_loss: 0.3092 - val_accuracy: 0.8658\n",
"Epoch 10/10\n",
"2813/2813 [==============================] - 39s 14ms/step - loss: 0.2944 - accuracy: 0.8738 - val_loss: 0.3092 - val_accuracy: 0.8659\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history1b = compile_fit_model(units=[32, 16, 1], \n",
" activation=[\"relu\", \"relu\", \"sigmoid\"], \n",
" num_of_layers=3,\n",
" epochs=10, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.1.2 Plotting the training and validation loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like in the previous section, I will plot the training and validation loss so that I can make comparisons. I will use the instance of the `TrainValPlot` class from before. Then I will call the `loss_plot()` method."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history1b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 7</span> Training and Validation loss for model 1.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based of figure 7 it seems that overfitting occurs around the 4th epoch, since this is where the training loss and validation loss begin to diverge. In addition, from then on the training loss decreases rapidly whereas the validation loss begins to plateau."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.1.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history1b, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 8</span> Training and Validation accuracy for model 1.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 8 behaves like figure 7. At the 4th epoch overfitting begins to take place, as made obvious by the divergence of the green (validation) and purple (training) line. The training accuracy reaches approximately 87.3% whereas the validation accuracy reaches approximately 86.5%. Although the graph displays overfitting, the differences between the two accuracies are not that high. If I were to just look at the numbers, I might not have realised that overfitting had taken place."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 6.2 The second model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the previous model overfitted and reached an accuracy of about 87%, I would like to build a bigger model and see whether the loss can decrease and the accuracy can increase. In addition, I will increase the number of epochs to 15. Table 8 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <caption><span style=\"font-weight: bold;\">Table 8</span> Model 2 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">4</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[64, 32, 16, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"relu\", \"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">15</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.2.1 Building the model"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/15\n",
"2813/2813 [==============================] - 45s 15ms/step - loss: 0.3449 - accuracy: 0.8473 - val_loss: 0.3189 - val_accuracy: 0.8610\n",
"Epoch 2/15\n",
"2813/2813 [==============================] - 48s 17ms/step - loss: 0.3109 - accuracy: 0.8651 - val_loss: 0.3095 - val_accuracy: 0.8662\n",
"Epoch 3/15\n",
"2813/2813 [==============================] - 41s 15ms/step - loss: 0.2995 - accuracy: 0.8711 - val_loss: 0.3069 - val_accuracy: 0.8671\n",
"Epoch 4/15\n",
"2813/2813 [==============================] - 37s 13ms/step - loss: 0.2921 - accuracy: 0.8753 - val_loss: 0.3038 - val_accuracy: 0.8691\n",
"Epoch 5/15\n",
"2813/2813 [==============================] - 36s 13ms/step - loss: 0.2871 - accuracy: 0.8777 - val_loss: 0.3035 - val_accuracy: 0.8692\n",
"Epoch 6/15\n",
"2813/2813 [==============================] - 42s 15ms/step - loss: 0.2831 - accuracy: 0.8798 - val_loss: 0.3040 - val_accuracy: 0.8690\n",
"Epoch 7/15\n",
"2813/2813 [==============================] - 44s 16ms/step - loss: 0.2799 - accuracy: 0.8813 - val_loss: 0.3039 - val_accuracy: 0.8693\n",
"Epoch 8/15\n",
"2813/2813 [==============================] - 47s 17ms/step - loss: 0.2775 - accuracy: 0.8827 - val_loss: 0.3039 - val_accuracy: 0.8694\n",
"Epoch 9/15\n",
"2813/2813 [==============================] - 46s 16ms/step - loss: 0.2751 - accuracy: 0.8838 - val_loss: 0.3049 - val_accuracy: 0.8691\n",
"Epoch 10/15\n",
"2813/2813 [==============================] - 47s 17ms/step - loss: 0.2733 - accuracy: 0.8848 - val_loss: 0.3068 - val_accuracy: 0.8681\n",
"Epoch 11/15\n",
"2813/2813 [==============================] - 42s 15ms/step - loss: 0.2717 - accuracy: 0.8856 - val_loss: 0.3068 - val_accuracy: 0.8684\n",
"Epoch 12/15\n",
"2813/2813 [==============================] - 30s 11ms/step - loss: 0.2701 - accuracy: 0.8866 - val_loss: 0.3070 - val_accuracy: 0.8683\n",
"Epoch 13/15\n",
"2813/2813 [==============================] - 29s 10ms/step - loss: 0.2688 - accuracy: 0.8872 - val_loss: 0.3092 - val_accuracy: 0.8678\n",
"Epoch 14/15\n",
"2813/2813 [==============================] - 43s 15ms/step - loss: 0.2677 - accuracy: 0.8879 - val_loss: 0.3086 - val_accuracy: 0.8679\n",
"Epoch 15/15\n",
"2813/2813 [==============================] - 46s 16ms/step - loss: 0.2665 - accuracy: 0.8885 - val_loss: 0.3094 - val_accuracy: 0.8673\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history2b = compile_fit_model(units=[64, 32, 16, 1], \n",
" activation=[\"relu\", \"relu\", \"relu\", \"sigmoid\"], \n",
" num_of_layers=4,\n",
" epochs=15, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.2.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history2b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 9</span> Training and Validation loss for model 2.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 9 shows that overfitting definitely occurs. It seems to begin again around the 4th epoch. From then on, the validation loss plateaus and then increases, whereas the training loss continues to drop. This makes it clear that this model doesn't generalise well to unseen data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.2.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history2b, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 10</span> Training and Validation accuracy for model 2.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same is true for figure 10 that also displays overfitting (starting at the 4th epoch). Looking at the trajectory of the green line (validation accuracy) tells us that the accuracy begins to decrease after it plateaus. On the other hand, the training accuracy reaches approximately 88.8% and the validation accuracy reaches approximately 86.7%, meaning that although overfitting occurs, they are not too far off from each other. Furthermore, 88.8% is the highest accuracy that a model has produced so far."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 6.3 The third model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last model overfitted and reached an accuracy of approximately 88.8%- which is quite good. However, I don't believe that my model is that complex. For this next model, I will increase the number of units, while keeping the number of layers the same. I will also increase the epochs to 20, and decrease the batch size to 128. Table 9 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <caption><span style=\"font-weight: bold;\">Table 9</span> Model 3 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">4</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[128, 64, 32, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"relu\", \"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">20</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">128</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.3.1 Building the model"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/20\n",
"11250/11250 [==============================] - 123s 11ms/step - loss: 0.3340 - accuracy: 0.8542 - val_loss: 0.3165 - val_accuracy: 0.8641\n",
"Epoch 2/20\n",
"11250/11250 [==============================] - 94s 8ms/step - loss: 0.3096 - accuracy: 0.8685 - val_loss: 0.3102 - val_accuracy: 0.8677\n",
"Epoch 3/20\n",
"11250/11250 [==============================] - 108s 10ms/step - loss: 0.2999 - accuracy: 0.8743 - val_loss: 0.3074 - val_accuracy: 0.8694\n",
"Epoch 4/20\n",
"11250/11250 [==============================] - 111s 10ms/step - loss: 0.2926 - accuracy: 0.8783 - val_loss: 0.3081 - val_accuracy: 0.8700\n",
"Epoch 5/20\n",
"11250/11250 [==============================] - 104s 9ms/step - loss: 0.2869 - accuracy: 0.8818 - val_loss: 0.3079 - val_accuracy: 0.8692\n",
"Epoch 6/20\n",
"11250/11250 [==============================] - 92s 8ms/step - loss: 0.2827 - accuracy: 0.8845 - val_loss: 0.3076 - val_accuracy: 0.8691\n",
"Epoch 7/20\n",
"11250/11250 [==============================] - 95s 8ms/step - loss: 0.2786 - accuracy: 0.8869 - val_loss: 0.3097 - val_accuracy: 0.8692\n",
"Epoch 8/20\n",
"11250/11250 [==============================] - 126s 11ms/step - loss: 0.2757 - accuracy: 0.8889 - val_loss: 0.3122 - val_accuracy: 0.8677\n",
"Epoch 9/20\n",
"11250/11250 [==============================] - 125s 11ms/step - loss: 0.2729 - accuracy: 0.8904 - val_loss: 0.3112 - val_accuracy: 0.8679\n",
"Epoch 10/20\n",
"11250/11250 [==============================] - 128s 11ms/step - loss: 0.2706 - accuracy: 0.8921 - val_loss: 0.3182 - val_accuracy: 0.8661\n",
"Epoch 11/20\n",
"11250/11250 [==============================] - 116s 10ms/step - loss: 0.2684 - accuracy: 0.8933 - val_loss: 0.3152 - val_accuracy: 0.8662\n",
"Epoch 12/20\n",
"11250/11250 [==============================] - 91s 8ms/step - loss: 0.2667 - accuracy: 0.8945 - val_loss: 0.3160 - val_accuracy: 0.8658\n",
"Epoch 13/20\n",
"11250/11250 [==============================] - 493s 44ms/step - loss: 0.2649 - accuracy: 0.8958 - val_loss: 0.3202 - val_accuracy: 0.8659\n",
"Epoch 14/20\n",
"11250/11250 [==============================] - 74s 7ms/step - loss: 0.2634 - accuracy: 0.8968 - val_loss: 0.3195 - val_accuracy: 0.8654\n",
"Epoch 15/20\n",
"11250/11250 [==============================] - 165s 15ms/step - loss: 0.2617 - accuracy: 0.8979 - val_loss: 0.3275 - val_accuracy: 0.8647\n",
"Epoch 16/20\n",
"11250/11250 [==============================] - 81s 7ms/step - loss: 0.2600 - accuracy: 0.8987 - val_loss: 0.3261 - val_accuracy: 0.8650\n",
"Epoch 17/20\n",
"11250/11250 [==============================] - 88s 8ms/step - loss: 0.2585 - accuracy: 0.8997 - val_loss: 0.3227 - val_accuracy: 0.8638\n",
"Epoch 18/20\n",
"11250/11250 [==============================] - 82s 7ms/step - loss: 0.2571 - accuracy: 0.9006 - val_loss: 0.3288 - val_accuracy: 0.8638\n",
"Epoch 19/20\n",
"11250/11250 [==============================] - 87s 8ms/step - loss: 0.2561 - accuracy: 0.9012 - val_loss: 0.3291 - val_accuracy: 0.8644\n",
"Epoch 20/20\n",
"11250/11250 [==============================] - 80s 7ms/step - loss: 0.2551 - accuracy: 0.9019 - val_loss: 0.3297 - val_accuracy: 0.8625\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history3b = compile_fit_model(units=[128, 64, 32, 1], \n",
" activation=[\"relu\", \"relu\", \"relu\", \"sigmoid\"], \n",
" num_of_layers=4,\n",
" epochs=20, \n",
" batch_size=128)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.3.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history3b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 11</span> Training and Validation loss for model 3.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 11 displays that this model also overfits, occurring from around the 3rd epoch. It seems like the more complex the model, the earlier overfitting occurs- though it is too early to tell, and the problem type and dataset size will also bear some influence. From the 3rd epoch, the validation loss becomes noisy and erratic, which is another clear indicator of overfitting data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.3.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history3b, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 12</span> Training and Validation accuracy for model 3.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like in figure 11, figure 12 also displays overfitting occurring from about the 3rd epoch. The validation accuracy reaches a high of approximately 87%, but ends around 86.3%. On the other hand, the training accuracy ends around 90.2%, which is almost 4% greater. I believe that this is the largest gap that a model has displayed so far. This makes a lot of sense, since it is the most complex model so far."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 6.4 The fourth model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I have only been steadily increasing my model so far, which is unnecessary since I am not at the point of experimentation or tuning. The goal is to build the most powerful model I can, and I am trying to push my model to the brink. Therefore, I will increase the number of layers to 7, while keeping the number of epochs at 20. Table 10 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <caption><span style=\"font-weight: bold;\">Table 10</span> Model 4 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">7</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[256, 128, 128, 128, 64, 32, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"relu\", \"relu\", \"relu\", \"relu\",\n",
" \"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">20</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.4.1 Building the model"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/20\n",
"2813/2813 [==============================] - 118s 40ms/step - loss: 0.3375 - accuracy: 0.8505 - val_loss: 0.3220 - val_accuracy: 0.8592\n",
"Epoch 2/20\n",
"2813/2813 [==============================] - 92s 33ms/step - loss: 0.2954 - accuracy: 0.8734 - val_loss: 0.3021 - val_accuracy: 0.8699\n",
"Epoch 3/20\n",
"2813/2813 [==============================] - 80s 29ms/step - loss: 0.2767 - accuracy: 0.8831 - val_loss: 0.2980 - val_accuracy: 0.8723\n",
"Epoch 4/20\n",
"2813/2813 [==============================] - 92s 33ms/step - loss: 0.2628 - accuracy: 0.8902 - val_loss: 0.3070 - val_accuracy: 0.8714\n",
"Epoch 5/20\n",
"2813/2813 [==============================] - 77s 27ms/step - loss: 0.2510 - accuracy: 0.8962 - val_loss: 0.3028 - val_accuracy: 0.8698\n",
"Epoch 6/20\n",
"2813/2813 [==============================] - 79s 28ms/step - loss: 0.2402 - accuracy: 0.9021 - val_loss: 0.3092 - val_accuracy: 0.8691\n",
"Epoch 7/20\n",
"2813/2813 [==============================] - 74s 26ms/step - loss: 0.2308 - accuracy: 0.9067 - val_loss: 0.3180 - val_accuracy: 0.8675\n",
"Epoch 8/20\n",
"2813/2813 [==============================] - 73s 26ms/step - loss: 0.2219 - accuracy: 0.9112 - val_loss: 0.3174 - val_accuracy: 0.8661\n",
"Epoch 9/20\n",
"2813/2813 [==============================] - 74s 26ms/step - loss: 0.2141 - accuracy: 0.9154 - val_loss: 0.3375 - val_accuracy: 0.8631\n",
"Epoch 10/20\n",
"2813/2813 [==============================] - 73s 26ms/step - loss: 0.2067 - accuracy: 0.9191 - val_loss: 0.3313 - val_accuracy: 0.8629\n",
"Epoch 11/20\n",
"2813/2813 [==============================] - 72s 26ms/step - loss: 0.1998 - accuracy: 0.9224 - val_loss: 0.3453 - val_accuracy: 0.8618\n",
"Epoch 12/20\n",
"2813/2813 [==============================] - 73s 26ms/step - loss: 0.1938 - accuracy: 0.9255 - val_loss: 0.3579 - val_accuracy: 0.8605\n",
"Epoch 13/20\n",
"2813/2813 [==============================] - 78s 28ms/step - loss: 0.1882 - accuracy: 0.9283 - val_loss: 0.3613 - val_accuracy: 0.8591\n",
"Epoch 14/20\n",
"2813/2813 [==============================] - 74s 26ms/step - loss: 0.1827 - accuracy: 0.9310 - val_loss: 0.3615 - val_accuracy: 0.8581\n",
"Epoch 15/20\n",
"2813/2813 [==============================] - 84s 30ms/step - loss: 0.1777 - accuracy: 0.9335 - val_loss: 0.3794 - val_accuracy: 0.8565\n",
"Epoch 16/20\n",
"2813/2813 [==============================] - 87s 31ms/step - loss: 0.1734 - accuracy: 0.9355 - val_loss: 0.3925 - val_accuracy: 0.8563\n",
"Epoch 17/20\n",
"2813/2813 [==============================] - 114s 41ms/step - loss: 0.1686 - accuracy: 0.9377 - val_loss: 0.4021 - val_accuracy: 0.8558\n",
"Epoch 18/20\n",
"2813/2813 [==============================] - 115s 41ms/step - loss: 0.1645 - accuracy: 0.9399 - val_loss: 0.4007 - val_accuracy: 0.8542\n",
"Epoch 19/20\n",
"2813/2813 [==============================] - 104s 37ms/step - loss: 0.1607 - accuracy: 0.9416 - val_loss: 0.4004 - val_accuracy: 0.8534\n",
"Epoch 20/20\n",
"2813/2813 [==============================] - 104s 37ms/step - loss: 0.1569 - accuracy: 0.9435 - val_loss: 0.4215 - val_accuracy: 0.8526\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history4b = compile_fit_model(units=[256, 128, 128, 128, 64, 32, 1], \n",
" activation=[\"relu\", \"relu\", \"relu\", \"relu\", \n",
" \"relu\", \"relu\", \"sigmoid\"],\n",
" num_of_layers=7,\n",
" epochs=20, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.4.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation loss.\n",
"train_val_plot.loss_plot(history4b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 13</span> Training and Validation loss for model 4.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 13 shows that overfitting occurs almost instantly (from the 2nd epoch). The validation loss only decreases from the 1st epoch to the 2nd, but then from there on it becomes noisy and erratic, though it generally increases. Figure 13 displays a large divergence between the training and validation loss, with the validation loss ending higher than it started. This is unsurprising since larger more complex models are more susceptible to overfitting, especially when the number of epochs is high."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.4.3 Plotting the training and validation accuracy"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Plotting the training and validation accuracy.\n",
"train_val_plot.accuracy_plot(history4b, [\"SlateBlue\", \"LightGreen\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<caption><span style=\"font-weight: bold;\">Figure 14</span> Training and Validation accuracy for model 4.</caption>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 14 shows that the training accuracy reaches 94% whereas the validation accuracy reaches approximately 85.3%. This gap is even larger than the last model. Overfitting occurs from the 2nd epoch, however this is the highest training accuracy reached so far."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 6.5 The fifth model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final model within this section will be a 10 layer model- 3 layers larger than the last, though I will decrease the number of epochs to 15. Table 11 displays the hyperparameters / parameters I will be using for the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table style=\"width: 850px;\">\n",
" <caption><span style=\"font-weight: bold;\">Table 11</span> Model 5 hyperparameters / parameters.</caption>\n",
" <tr style=\"background-color: #ECE5FC;\">\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Number of Layers</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Units</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Activation</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Epochs</th>\n",
" <th style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">Batch Size</th>\n",
" </tr>\n",
" <tr style=\"background-color: #FFFFFF;\">\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">10</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[512, 512, 256, 256, 128, 128, 128, 64, 32, 1]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">[\"relu\", \"relu\", \"relu\", \"relu\", \"relu\",\n",
" \"relu\", \"relu\", \"relu\", \"relu\", \"sigmoid\"]</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">15</td>\n",
" <td style=\"border: 1px solid #dddddd; text-align: center; padding: 8px;\">512</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.5.1 Building the model"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/15\n",
"2813/2813 [==============================] - 307s 108ms/step - loss: 0.3401 - accuracy: 0.8493 - val_loss: 0.3068 - val_accuracy: 0.8668\n",
"Epoch 2/15\n",
"2813/2813 [==============================] - 268s 95ms/step - loss: 0.2916 - accuracy: 0.8753 - val_loss: 0.2990 - val_accuracy: 0.8724\n",
"Epoch 3/15\n",
"2813/2813 [==============================] - 351s 125ms/step - loss: 0.2687 - accuracy: 0.8869 - val_loss: 0.2979 - val_accuracy: 0.8726\n",
"Epoch 4/15\n",
"2813/2813 [==============================] - 289s 103ms/step - loss: 0.2489 - accuracy: 0.8971 - val_loss: 0.3038 - val_accuracy: 0.8687\n",
"Epoch 5/15\n",
"2813/2813 [==============================] - 260s 92ms/step - loss: 0.2305 - accuracy: 0.9063 - val_loss: 0.3078 - val_accuracy: 0.8688\n",
"Epoch 6/15\n",
"2813/2813 [==============================] - 291s 104ms/step - loss: 0.2130 - accuracy: 0.9148 - val_loss: 0.3272 - val_accuracy: 0.8666\n",
"Epoch 7/15\n",
"2813/2813 [==============================] - 234s 83ms/step - loss: 0.1961 - accuracy: 0.9230 - val_loss: 0.3417 - val_accuracy: 0.8645\n",
"Epoch 8/15\n",
"2813/2813 [==============================] - 271s 97ms/step - loss: 0.1802 - accuracy: 0.9306 - val_loss: 0.3712 - val_accuracy: 0.8617\n",
"Epoch 9/15\n",
"2813/2813 [==============================] - 316s 112ms/step - loss: 0.1649 - accuracy: 0.9376 - val_loss: 0.4035 - val_accuracy: 0.8571\n",
"Epoch 10/15\n",
"2813/2813 [==============================] - 340s 121ms/step - loss: 0.1509 - accuracy: 0.9440 - val_loss: 0.3958 - val_accuracy: 0.8571\n",
"Epoch 11/15\n",
"2813/2813 [==============================] - 289s 103ms/step - loss: 0.1371 - accuracy: 0.9501 - val_loss: 0.4199 - val_accuracy: 0.8559\n",
"Epoch 12/15\n",
"2813/2813 [==============================] - 299s 106ms/step - loss: 0.1251 - accuracy: 0.9551 - val_loss: 0.4329 - val_accuracy: 0.8531\n",
"Epoch 13/15\n",
"2813/2813 [==============================] - 232s 83ms/step - loss: 0.1138 - accuracy: 0.9598 - val_loss: 0.4731 - val_accuracy: 0.8513\n",
"Epoch 14/15\n",
"2813/2813 [==============================] - 248s 88ms/step - loss: 0.1034 - accuracy: 0.9639 - val_loss: 0.5275 - val_accuracy: 0.8510\n",
"Epoch 15/15\n",
"2813/2813 [==============================] - 229s 81ms/step - loss: 0.0938 - accuracy: 0.9676 - val_loss: 0.5489 - val_accuracy: 0.8490\n"
]
}
],
"source": [
"# I wrote all the code in this cell.\n",
"# Creating, compiling, and fitting the model\n",
"history5b = compile_fit_model(units=[512, 512, 256, 256, 128, 128, 128, 64,\n",
" 32, 1], \n",
" activation=[\"relu\", \"relu\", \"relu\", \"relu\", \n",
" \"relu\", \"relu\", \"relu\", \"relu\", \n",
" \"relu\", \"sigmoid\"],\n",
" num_of_layers=10,\n",
" epochs=15, \n",
" batch_size=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### 6.5.2 Plotting the training and validation loss"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment