Last active
July 18, 2023 02:42
-
-
Save lsloan/5633133d4ae93d951f84672e95129adf to your computer and use it in GitHub Desktop.
Copy of Pandas 101 - Data Cleaning in Pandas.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/lsloan/5633133d4ae93d951f84672e95129adf/copy-of-pandas-101-data-cleaning-in-pandas.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c6038cd0", | |
"metadata": { | |
"id": "c6038cd0" | |
}, | |
"source": [ | |
"\n", | |
"# Data Cleaning in Pandas\n", | |
"\n", | |
"Inspired by \"Alex the Analyst\" and his data cleaning video for Pandas newbies, I made some corrections to his notebook and tried to make it more efficient.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Resources" | |
], | |
"metadata": { | |
"id": "UvmEWbkaKSIl" | |
}, | |
"id": "UvmEWbkaKSIl" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"* Video: https://www.youtube.com/watch?v=bDhvCp3_lYw\n", | |
"* Original notebook: https://github.com/AlexTheAnalyst/PandasYouTubeSeries/blob/331621f89f679892d18af059ea9061f91321a934/Pandas%20101%20-%20Data%20Cleaning%20in%20Pandas.ipynb\n", | |
"* Data: https://raw.githubusercontent.com/AlexTheAnalyst/PandasYouTubeSeries/331621f89f679892d18af059ea9061f91321a934/Customer%20Call%20List.xlsx" | |
], | |
"metadata": { | |
"id": "VmQLeYX5JqMW" | |
}, | |
"id": "VmQLeYX5JqMW" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Code" | |
], | |
"metadata": { | |
"id": "t2dzuR7GJ16X" | |
}, | |
"id": "t2dzuR7GJ16X" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "9d8c99e2", | |
"metadata": { | |
"id": "9d8c99e2", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 891 | |
}, | |
"outputId": "d3bc90ff-6627-4271-c699-447c5c26346e" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
" CustomerID First_Name Last_Name Phone_Number \\\n", | |
"0 1001 Frodo Baggins 123-545-5421 \n", | |
"1 1002 Abed Nadir 123/643/9775 \n", | |
"2 1003 Walter /White 7066950392 \n", | |
"3 1004 Dwight Schrute 123-543-2345 \n", | |
"4 1005 Jon Snow 876|678|3469 \n", | |
"5 1006 Ron Swanson 304-762-2467 \n", | |
"6 1007 Jeff Winger NaN \n", | |
"7 1008 Sherlock Holmes 876|678|3469 \n", | |
"8 1009 Gandalf NaN N/a \n", | |
"9 1010 Peter Parker 123-545-5421 \n", | |
"10 1011 Samwise Gamgee NaN \n", | |
"11 1012 Harry ...Potter 7066950392 \n", | |
"12 1013 Don Draper 123-543-2345 \n", | |
"13 1014 Leslie Knope 876|678|3469 \n", | |
"14 1015 Toby Flenderson_ 304-762-2467 \n", | |
"15 1016 Ron Weasley 123-545-5421 \n", | |
"16 1017 Michael Scott 123/643/9775 \n", | |
"17 1018 Clark Kent 7066950392 \n", | |
"18 1019 Creed Braton N/a \n", | |
"19 1020 Anakin Skywalker 876|678|3469 \n", | |
"20 1020 Anakin Skywalker 876|678|3469 \n", | |
"\n", | |
" Address Paying Customer Do_Not_Contact \\\n", | |
"0 123 Shire Lane, Shire Yes No \n", | |
"1 93 West Main Street No Yes \n", | |
"2 298 Drugs Driveway N NaN \n", | |
"3 980 Paper Avenue, Pennsylvania, 18503 Yes Y \n", | |
"4 123 Dragons Road Y No \n", | |
"5 768 City Parkway Yes Yes \n", | |
"6 1209 South Street No No \n", | |
"7 98 Clue Drive N No \n", | |
"8 123 Middle Earth Yes NaN \n", | |
"9 25th Main Street, New York Yes No \n", | |
"10 612 Shire Lane, Shire Yes No \n", | |
"11 2394 Hogwarts Avenue Y NaN \n", | |
"12 2039 Main Street Yes N \n", | |
"13 343 City Parkway Yes No \n", | |
"14 214 HR Avenue N No \n", | |
"15 2395 Hogwarts Avenue No N \n", | |
"16 121 Paper Avenue, Pennsylvania Yes No \n", | |
"17 3498 Super Lane Y NaN \n", | |
"18 N/a N/a Yes \n", | |
"19 910 Tatooine Road, Tatooine Yes N \n", | |
"20 910 Tatooine Road, Tatooine Yes N \n", | |
"\n", | |
" Not_Useful_Column \n", | |
"0 True \n", | |
"1 False \n", | |
"2 True \n", | |
"3 True \n", | |
"4 True \n", | |
"5 True \n", | |
"6 False \n", | |
"7 False \n", | |
"8 False \n", | |
"9 True \n", | |
"10 True \n", | |
"11 True \n", | |
"12 False \n", | |
"13 False \n", | |
"14 False \n", | |
"15 False \n", | |
"16 False \n", | |
"17 True \n", | |
"18 True \n", | |
"19 True \n", | |
"20 True " | |
], | |
"text/html": [ | |
"\n", | |
"\n", | |
" <div id=\"df-ca295370-c3bd-40e7-9ae8-67c1d3e59bef\">\n", | |
" <div class=\"colab-df-container\">\n", | |
" <div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>CustomerID</th>\n", | |
" <th>First_Name</th>\n", | |
" <th>Last_Name</th>\n", | |
" <th>Phone_Number</th>\n", | |
" <th>Address</th>\n", | |
" <th>Paying Customer</th>\n", | |
" <th>Do_Not_Contact</th>\n", | |
" <th>Not_Useful_Column</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1001</td>\n", | |
" <td>Frodo</td>\n", | |
" <td>Baggins</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>123 Shire Lane, Shire</td>\n", | |
" <td>Yes</td>\n", | |
" <td>No</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1002</td>\n", | |
" <td>Abed</td>\n", | |
" <td>Nadir</td>\n", | |
" <td>123/643/9775</td>\n", | |
" <td>93 West Main Street</td>\n", | |
" <td>No</td>\n", | |
" <td>Yes</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1003</td>\n", | |
" <td>Walter</td>\n", | |
" <td>/White</td>\n", | |
" <td>7066950392</td>\n", | |
" <td>298 Drugs Driveway</td>\n", | |
" <td>N</td>\n", | |
" <td>NaN</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1004</td>\n", | |
" <td>Dwight</td>\n", | |
" <td>Schrute</td>\n", | |
" <td>123-543-2345</td>\n", | |
" <td>980 Paper Avenue, Pennsylvania, 18503</td>\n", | |
" <td>Yes</td>\n", | |
" <td>Y</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1005</td>\n", | |
" <td>Jon</td>\n", | |
" <td>Snow</td>\n", | |
" <td>876|678|3469</td>\n", | |
" <td>123 Dragons Road</td>\n", | |
" <td>Y</td>\n", | |
" <td>No</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>1006</td>\n", | |
" <td>Ron</td>\n", | |
" <td>Swanson</td>\n", | |
" <td>304-762-2467</td>\n", | |
" <td>768 City Parkway</td>\n", | |
" <td>Yes</td>\n", | |
" <td>Yes</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>1007</td>\n", | |
" <td>Jeff</td>\n", | |
" <td>Winger</td>\n", | |
" <td>NaN</td>\n", | |
" <td>1209 South Street</td>\n", | |
" <td>No</td>\n", | |
" <td>No</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1008</td>\n", | |
" <td>Sherlock</td>\n", | |
" <td>Holmes</td>\n", | |
" <td>876|678|3469</td>\n", | |
" <td>98 Clue Drive</td>\n", | |
" <td>N</td>\n", | |
" <td>No</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>1009</td>\n", | |
" <td>Gandalf</td>\n", | |
" <td>NaN</td>\n", | |
" <td>N/a</td>\n", | |
" <td>123 Middle Earth</td>\n", | |
" <td>Yes</td>\n", | |
" <td>NaN</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>1010</td>\n", | |
" <td>Peter</td>\n", | |
" <td>Parker</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>25th Main Street, New York</td>\n", | |
" <td>Yes</td>\n", | |
" <td>No</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>1011</td>\n", | |
" <td>Samwise</td>\n", | |
" <td>Gamgee</td>\n", | |
" <td>NaN</td>\n", | |
" <td>612 Shire Lane, Shire</td>\n", | |
" <td>Yes</td>\n", | |
" <td>No</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>1012</td>\n", | |
" <td>Harry</td>\n", | |
" <td>...Potter</td>\n", | |
" <td>7066950392</td>\n", | |
" <td>2394 Hogwarts Avenue</td>\n", | |
" <td>Y</td>\n", | |
" <td>NaN</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>1013</td>\n", | |
" <td>Don</td>\n", | |
" <td>Draper</td>\n", | |
" <td>123-543-2345</td>\n", | |
" <td>2039 Main Street</td>\n", | |
" <td>Yes</td>\n", | |
" <td>N</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>1014</td>\n", | |
" <td>Leslie</td>\n", | |
" <td>Knope</td>\n", | |
" <td>876|678|3469</td>\n", | |
" <td>343 City Parkway</td>\n", | |
" <td>Yes</td>\n", | |
" <td>No</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>1015</td>\n", | |
" <td>Toby</td>\n", | |
" <td>Flenderson_</td>\n", | |
" <td>304-762-2467</td>\n", | |
" <td>214 HR Avenue</td>\n", | |
" <td>N</td>\n", | |
" <td>No</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>1016</td>\n", | |
" <td>Ron</td>\n", | |
" <td>Weasley</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>2395 Hogwarts Avenue</td>\n", | |
" <td>No</td>\n", | |
" <td>N</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>1017</td>\n", | |
" <td>Michael</td>\n", | |
" <td>Scott</td>\n", | |
" <td>123/643/9775</td>\n", | |
" <td>121 Paper Avenue, Pennsylvania</td>\n", | |
" <td>Yes</td>\n", | |
" <td>No</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>1018</td>\n", | |
" <td>Clark</td>\n", | |
" <td>Kent</td>\n", | |
" <td>7066950392</td>\n", | |
" <td>3498 Super Lane</td>\n", | |
" <td>Y</td>\n", | |
" <td>NaN</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>1019</td>\n", | |
" <td>Creed</td>\n", | |
" <td>Braton</td>\n", | |
" <td>N/a</td>\n", | |
" <td>N/a</td>\n", | |
" <td>N/a</td>\n", | |
" <td>Yes</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>19</th>\n", | |
" <td>1020</td>\n", | |
" <td>Anakin</td>\n", | |
" <td>Skywalker</td>\n", | |
" <td>876|678|3469</td>\n", | |
" <td>910 Tatooine Road, Tatooine</td>\n", | |
" <td>Yes</td>\n", | |
" <td>N</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>1020</td>\n", | |
" <td>Anakin</td>\n", | |
" <td>Skywalker</td>\n", | |
" <td>876|678|3469</td>\n", | |
" <td>910 Tatooine Road, Tatooine</td>\n", | |
" <td>Yes</td>\n", | |
" <td>N</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>\n", | |
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ca295370-c3bd-40e7-9ae8-67c1d3e59bef')\"\n", | |
" title=\"Convert this dataframe to an interactive table.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
" width=\"24px\">\n", | |
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", | |
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", | |
" </svg>\n", | |
" </button>\n", | |
"\n", | |
"\n", | |
"\n", | |
" <div id=\"df-ebc7abdb-dd56-42a0-b44b-f6adda2b9978\">\n", | |
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-ebc7abdb-dd56-42a0-b44b-f6adda2b9978')\"\n", | |
" title=\"Suggest charts.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
" width=\"24px\">\n", | |
" <g>\n", | |
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", | |
" </g>\n", | |
"</svg>\n", | |
" </button>\n", | |
" </div>\n", | |
"\n", | |
"<style>\n", | |
" .colab-df-quickchart {\n", | |
" background-color: #E8F0FE;\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: #1967D2;\n", | |
" height: 32px;\n", | |
" padding: 0 0 0 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-quickchart:hover {\n", | |
" background-color: #E2EBFA;\n", | |
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: #174EA6;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-quickchart {\n", | |
" background-color: #3B4455;\n", | |
" fill: #D2E3FC;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-quickchart:hover {\n", | |
" background-color: #434B5C;\n", | |
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
" fill: #FFFFFF;\n", | |
" }\n", | |
"</style>\n", | |
"\n", | |
" <script>\n", | |
" async function quickchart(key) {\n", | |
" const containerElement = document.querySelector('#' + key);\n", | |
" const charts = await google.colab.kernel.invokeFunction(\n", | |
" 'suggestCharts', [key], {});\n", | |
" }\n", | |
" </script>\n", | |
"\n", | |
" <script>\n", | |
"\n", | |
"function displayQuickchartButton(domScope) {\n", | |
" let quickchartButtonEl =\n", | |
" domScope.querySelector('#df-ebc7abdb-dd56-42a0-b44b-f6adda2b9978 button.colab-df-quickchart');\n", | |
" quickchartButtonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
"}\n", | |
"\n", | |
" displayQuickchartButton(document);\n", | |
" </script>\n", | |
" <style>\n", | |
" .colab-df-container {\n", | |
" display:flex;\n", | |
" flex-wrap:wrap;\n", | |
" gap: 12px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert {\n", | |
" background-color: #E8F0FE;\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: #1967D2;\n", | |
" height: 32px;\n", | |
" padding: 0 0 0 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert:hover {\n", | |
" background-color: #E2EBFA;\n", | |
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: #174EA6;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert {\n", | |
" background-color: #3B4455;\n", | |
" fill: #D2E3FC;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert:hover {\n", | |
" background-color: #434B5C;\n", | |
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
" fill: #FFFFFF;\n", | |
" }\n", | |
" </style>\n", | |
"\n", | |
" <script>\n", | |
" const buttonEl =\n", | |
" document.querySelector('#df-ca295370-c3bd-40e7-9ae8-67c1d3e59bef button.colab-df-convert');\n", | |
" buttonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
"\n", | |
" async function convertToInteractive(key) {\n", | |
" const element = document.querySelector('#df-ca295370-c3bd-40e7-9ae8-67c1d3e59bef');\n", | |
" const dataTable =\n", | |
" await google.colab.kernel.invokeFunction('convertToInteractive',\n", | |
" [key], {});\n", | |
" if (!dataTable) return;\n", | |
"\n", | |
" const docLinkHtml = 'Like what you see? Visit the ' +\n", | |
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", | |
" + ' to learn more about interactive tables.';\n", | |
" element.innerHTML = '';\n", | |
" dataTable['output_type'] = 'display_data';\n", | |
" await google.colab.output.renderOutput(dataTable, element);\n", | |
" const docLink = document.createElement('div');\n", | |
" docLink.innerHTML = docLinkHtml;\n", | |
" element.appendChild(docLink);\n", | |
" }\n", | |
" </script>\n", | |
" </div>\n", | |
" </div>\n" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 1 | |
} | |
], | |
"source": [ | |
"import io\n", | |
"import pandas as pd\n", | |
"\n", | |
"# df = pd.read_excel(r\"C:\\Users\\alexf\\OneDrive\\Documents\\Pandas Tutorial\\Customer Call List.xlsx\")\n", | |
"dfOriginal = pd.read_excel('https://raw.githubusercontent.com/AlexTheAnalyst/PandasYouTubeSeries/331621f89f679892d18af059ea9061f91321a934/Customer%20Call%20List.xlsx')\n", | |
"\n", | |
"dfOriginal" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"An important thing to remember when working with Pandas, especially when data sets may be large, is to drop rows and columns of data as early as possible. By dropping unneeded data, the transformations following that will be faster because they have less data to process.\n", | |
"\n", | |
"In this example of making a customer call list, it makes sense to drop all duplicate rows and rows of customers that don't have a phone number. Rather than removing rows that don't have a \"do not contact\" response, we remove all rows whose values don't explicitly start with \"N\" in that column. Finally, we can remove the \"do not conatct\" column because it's no longer useful (along with the other non-useful one).\n", | |
"\n", | |
"After most of the drops are done, we can continue with transformations, like phone number and address formatting, remove illegal characters from names, and cleaning up \"Y\"/\"N\" answers.\n", | |
"\n", | |
"> ### ⛔️ Avoid `inplace=true`\n", | |
"> The Pandas maintainers have warned that the `inplace` argument will be removed from a future version of the module. It's best to avoid using it now. It's being removed because…\n", | |
"> 1. It breaks method chaining because it always causes functions to return `None`.\n", | |
"> 1. It doesn't save memory or reduce processing time." | |
], | |
"metadata": { | |
"id": "Ug7pi3DaKi3r" | |
}, | |
"id": "Ug7pi3DaKi3r" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "8624a784", | |
"metadata": { | |
"id": "8624a784", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 407 | |
}, | |
"outputId": "1903880c-3b43-47a2-8622-529302634c6b" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
" CustomerID First_Name Last_Name Phone_Number Paying Customer \\\n", | |
"0 1001 Frodo Baggins 123-545-5421 Y \n", | |
"1 1005 Jon Snow 876-678-3469 Y \n", | |
"2 1008 Sherlock Holmes 876-678-3469 N \n", | |
"3 1010 Peter Parker 123-545-5421 Y \n", | |
"4 1013 Don Draper 123-543-2345 Y \n", | |
"5 1014 Leslie Knope 876-678-3469 Y \n", | |
"6 1015 Toby Flenderson 304-762-2467 N \n", | |
"7 1016 Ron Weasley 123-545-5421 N \n", | |
"8 1017 Michael Scott 123-643-9775 Y \n", | |
"9 1020 Anakin Skywalker 876-678-3469 Y \n", | |
"\n", | |
" Street_Address State Zip_Code \n", | |
"0 123 Shire Lane Shire \n", | |
"1 123 Dragons Road \n", | |
"2 98 Clue Drive \n", | |
"3 25th Main Street New York \n", | |
"4 2039 Main Street \n", | |
"5 343 City Parkway \n", | |
"6 214 HR Avenue \n", | |
"7 2395 Hogwarts Avenue \n", | |
"8 121 Paper Avenue Pennsylvania \n", | |
"9 910 Tatooine Road Tatooine " | |
], | |
"text/html": [ | |
"\n", | |
"\n", | |
" <div id=\"df-41820296-de63-4f26-809c-ffde3acd76e6\">\n", | |
" <div class=\"colab-df-container\">\n", | |
" <div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>CustomerID</th>\n", | |
" <th>First_Name</th>\n", | |
" <th>Last_Name</th>\n", | |
" <th>Phone_Number</th>\n", | |
" <th>Paying Customer</th>\n", | |
" <th>Street_Address</th>\n", | |
" <th>State</th>\n", | |
" <th>Zip_Code</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1001</td>\n", | |
" <td>Frodo</td>\n", | |
" <td>Baggins</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>Y</td>\n", | |
" <td>123 Shire Lane</td>\n", | |
" <td>Shire</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1005</td>\n", | |
" <td>Jon</td>\n", | |
" <td>Snow</td>\n", | |
" <td>876-678-3469</td>\n", | |
" <td>Y</td>\n", | |
" <td>123 Dragons Road</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1008</td>\n", | |
" <td>Sherlock</td>\n", | |
" <td>Holmes</td>\n", | |
" <td>876-678-3469</td>\n", | |
" <td>N</td>\n", | |
" <td>98 Clue Drive</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1010</td>\n", | |
" <td>Peter</td>\n", | |
" <td>Parker</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>Y</td>\n", | |
" <td>25th Main Street</td>\n", | |
" <td>New York</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1013</td>\n", | |
" <td>Don</td>\n", | |
" <td>Draper</td>\n", | |
" <td>123-543-2345</td>\n", | |
" <td>Y</td>\n", | |
" <td>2039 Main Street</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>1014</td>\n", | |
" <td>Leslie</td>\n", | |
" <td>Knope</td>\n", | |
" <td>876-678-3469</td>\n", | |
" <td>Y</td>\n", | |
" <td>343 City Parkway</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>1015</td>\n", | |
" <td>Toby</td>\n", | |
" <td>Flenderson</td>\n", | |
" <td>304-762-2467</td>\n", | |
" <td>N</td>\n", | |
" <td>214 HR Avenue</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1016</td>\n", | |
" <td>Ron</td>\n", | |
" <td>Weasley</td>\n", | |
" <td>123-545-5421</td>\n", | |
" <td>N</td>\n", | |
" <td>2395 Hogwarts Avenue</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>1017</td>\n", | |
" <td>Michael</td>\n", | |
" <td>Scott</td>\n", | |
" <td>123-643-9775</td>\n", | |
" <td>Y</td>\n", | |
" <td>121 Paper Avenue</td>\n", | |
" <td>Pennsylvania</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>1020</td>\n", | |
" <td>Anakin</td>\n", | |
" <td>Skywalker</td>\n", | |
" <td>876-678-3469</td>\n", | |
" <td>Y</td>\n", | |
" <td>910 Tatooine Road</td>\n", | |
" <td>Tatooine</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>\n", | |
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-41820296-de63-4f26-809c-ffde3acd76e6')\"\n", | |
" title=\"Convert this dataframe to an interactive table.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
" width=\"24px\">\n", | |
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", | |
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", | |
" </svg>\n", | |
" </button>\n", | |
"\n", | |
"\n", | |
"\n", | |
" <div id=\"df-08d4a457-90ef-4fdd-a608-91500bf26cf1\">\n", | |
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-08d4a457-90ef-4fdd-a608-91500bf26cf1')\"\n", | |
" title=\"Suggest charts.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
" width=\"24px\">\n", | |
" <g>\n", | |
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", | |
" </g>\n", | |
"</svg>\n", | |
" </button>\n", | |
" </div>\n", | |
"\n", | |
"<style>\n", | |
" .colab-df-quickchart {\n", | |
" background-color: #E8F0FE;\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: #1967D2;\n", | |
" height: 32px;\n", | |
" padding: 0 0 0 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-quickchart:hover {\n", | |
" background-color: #E2EBFA;\n", | |
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: #174EA6;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-quickchart {\n", | |
" background-color: #3B4455;\n", | |
" fill: #D2E3FC;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-quickchart:hover {\n", | |
" background-color: #434B5C;\n", | |
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
" fill: #FFFFFF;\n", | |
" }\n", | |
"</style>\n", | |
"\n", | |
" <script>\n", | |
" async function quickchart(key) {\n", | |
" const containerElement = document.querySelector('#' + key);\n", | |
" const charts = await google.colab.kernel.invokeFunction(\n", | |
" 'suggestCharts', [key], {});\n", | |
" }\n", | |
" </script>\n", | |
"\n", | |
" <script>\n", | |
"\n", | |
"function displayQuickchartButton(domScope) {\n", | |
" let quickchartButtonEl =\n", | |
" domScope.querySelector('#df-08d4a457-90ef-4fdd-a608-91500bf26cf1 button.colab-df-quickchart');\n", | |
" quickchartButtonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
"}\n", | |
"\n", | |
" displayQuickchartButton(document);\n", | |
" </script>\n", | |
" <style>\n", | |
" .colab-df-container {\n", | |
" display:flex;\n", | |
" flex-wrap:wrap;\n", | |
" gap: 12px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert {\n", | |
" background-color: #E8F0FE;\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: #1967D2;\n", | |
" height: 32px;\n", | |
" padding: 0 0 0 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert:hover {\n", | |
" background-color: #E2EBFA;\n", | |
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: #174EA6;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert {\n", | |
" background-color: #3B4455;\n", | |
" fill: #D2E3FC;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert:hover {\n", | |
" background-color: #434B5C;\n", | |
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
" fill: #FFFFFF;\n", | |
" }\n", | |
" </style>\n", | |
"\n", | |
" <script>\n", | |
" const buttonEl =\n", | |
" document.querySelector('#df-41820296-de63-4f26-809c-ffde3acd76e6 button.colab-df-convert');\n", | |
" buttonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
"\n", | |
" async function convertToInteractive(key) {\n", | |
" const element = document.querySelector('#df-41820296-de63-4f26-809c-ffde3acd76e6');\n", | |
" const dataTable =\n", | |
" await google.colab.kernel.invokeFunction('convertToInteractive',\n", | |
" [key], {});\n", | |
" if (!dataTable) return;\n", | |
"\n", | |
" const docLinkHtml = 'Like what you see? Visit the ' +\n", | |
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", | |
" + ' to learn more about interactive tables.';\n", | |
" element.innerHTML = '';\n", | |
" dataTable['output_type'] = 'display_data';\n", | |
" await google.colab.output.renderOutput(dataTable, element);\n", | |
" const docLink = document.createElement('div');\n", | |
" docLink.innerHTML = docLinkHtml;\n", | |
" element.appendChild(docLink);\n", | |
" }\n", | |
" </script>\n", | |
" </div>\n", | |
" </div>\n" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 2 | |
} | |
], | |
"source": [ | |
"df = dfOriginal\n", | |
"\n", | |
"### Drops\n", | |
"\n", | |
"df = df.drop_duplicates()\\\n", | |
" .drop(df[df['Phone_Number'].isna()].index)\\\n", | |
" .drop(df[\n", | |
" df['Do_Not_Contact'].str[0].str.upper() != 'N'\n", | |
" ].index)\\\n", | |
" .drop(columns = ['Do_Not_Contact', 'Not_Useful_Column'])\\\n", | |
" .reset_index(drop=True)\n", | |
"\n", | |
"### Transformations\n", | |
"\n", | |
"df['Phone_Number'] = df['Phone_Number'].astype(str)\\\n", | |
" .replace(r'\\D', '', regex=True)\\\n", | |
" .apply(lambda x: f'{x[:3]}-{x[3:6]}-{x[6:]}'.rstrip('-'))\n", | |
"\n", | |
"# It would be wise to decide which characters are allowed and apply the\n", | |
"# same rules to the first name, address, etc.\n", | |
"df['Last_Name'] = df['Last_Name'].str.replace(r'[\\W_]', '', regex=True)\n", | |
"\n", | |
"# The original notebook doesn't show it, but somewhere along the line, the\n", | |
"# value of \"Paying Customer\" became \"Y\" or \"N\"\n", | |
"df['Paying Customer'] = df['Paying Customer'].str.upper()\\\n", | |
" .apply(lambda c: 'Y' if c[0] == 'Y' else 'N')\n", | |
"\n", | |
"# Some original data rows had postal codes, but they're dropped by this point.\n", | |
"# So, add trailing commas to simulate empty postal codes.\n", | |
"df[['Street_Address', 'State', 'Zip_Code']] = \\\n", | |
" df['Address'].apply(lambda a: a + ',').str\\\n", | |
" .split(',', n=2, expand=True)\\\n", | |
" .fillna('')\n", | |
"\n", | |
"# Remove the original address column which has been replaced by new columns.\n", | |
"df = df.drop(columns=['Address'])\n", | |
"\n", | |
"df" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.12" | |
}, | |
"colab": { | |
"provenance": [], | |
"name": "Copy of Pandas 101 - Data Cleaning in Pandas.ipynb", | |
"include_colab_link": true | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment