diff --git a/_notebook/exercice_titanic.ipynb b/_notebook/exercice_titanic.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..a9fd54f4d1d236fe3062701703fa2ee0b8f55eb5 --- /dev/null +++ b/_notebook/exercice_titanic.ipynb @@ -0,0 +1,465 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Titanic notebook\n", + "\n", + "The purpose of this notebook is to study the titanic dataset and select relevant features in order to predict whether someone survived the shipwreck.\n", + "\n", + "This notebook will use functions from the _titanic module to preprocess the data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Module and data import" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Adding relative path for imports\n", + "import os\n", + "import sys\n", + "module_path = os.path.abspath(os.path.join('..'))\n", + "if module_path not in sys.path:\n", + " sys.path.append(module_path)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import _titanic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ### First look at the data" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "train = pd.read_csv('../_data/titanic_train.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>PassengerId</th>\n", + " <th>Survived</th>\n", + " <th>Pclass</th>\n", + " <th>Name</th>\n", + " <th>Sex</th>\n", + " <th>Age</th>\n", + " <th>SibSp</th>\n", + " <th>Parch</th>\n", + " <th>Ticket</th>\n", + " <th>Fare</th>\n", + " <th>Cabin</th>\n", + " <th>Embarked</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>Braund, Mr. Owen Harris</td>\n", + " <td>male</td>\n", + " <td>22.0</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>A/5 21171</td>\n", + " <td>7.2500</td>\n", + " <td>NaN</td>\n", + " <td>S</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n", + " <td>female</td>\n", + " <td>38.0</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>PC 17599</td>\n", + " <td>71.2833</td>\n", + " <td>C85</td>\n", + " <td>C</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>3</td>\n", + " <td>1</td>\n", + " <td>3</td>\n", + " <td>Heikkinen, Miss. Laina</td>\n", + " <td>female</td>\n", + " <td>26.0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>STON/O2. 3101282</td>\n", + " <td>7.9250</td>\n", + " <td>NaN</td>\n", + " <td>S</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", + " <td>female</td>\n", + " <td>35.0</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>113803</td>\n", + " <td>53.1000</td>\n", + " <td>C123</td>\n", + " <td>S</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>Allen, Mr. William Henry</td>\n", + " <td>male</td>\n", + " <td>35.0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>373450</td>\n", + " <td>8.0500</td>\n", + " <td>NaN</td>\n", + " <td>S</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "3 4 1 1 \n", + "4 5 0 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "2 Heikkinen, Miss. Laina female 26.0 0 \n", + "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", + "4 Allen, Mr. William Henry male 35.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C \n", + "2 0 STON/O2. 3101282 7.9250 NaN S \n", + "3 0 113803 53.1000 C123 S \n", + "4 0 373450 8.0500 NaN S " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see above, the dataset has 11 features and the target feature \"Survived\". The 11 features are as following :\n", + " - `PassengerId` : an ID ranging from one to the number of passengers\n", + " - `Pclass` : the ticket class\n", + " - `Name` : the name of the passenger\n", + " - `Sex` : the sex of the passenger\n", + " - `Age` : the age in years\n", + " - `SibSp` : the number of siblings/spouses aboard the Titanic\n", + " - `Parch` : the number of parents/children aboard the Titanic\n", + " - `Ticket` : the ticket number\n", + " - `Fare` : the price of the ticker\n", + " - `Cabin` : the cabin number\n", + " - `Embarked` : the port of embarkation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating Models" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can have a first prediction using three of the features : `SibSp`, `Parch`, `Fare`. Indeed, for the first two features, we can assume that the more family relations a passenger had in the ship, the more likely they were to survive. For the last feature, we can assume that the more expensive the ticket, the wealthier the passenger and the higher the probability of them to survive." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "model1_cols = ['SibSp', 'Parch', 'Fare']\n", + "X, y = _titanic.parse_model(train.copy(), name_Y='Survived', use_columns=model1_cols)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " 0 0.65 0.94 0.77 157\n", + " 1 0.76 0.28 0.41 111\n", + "\n", + " micro avg 0.66 0.66 0.66 268\n", + " macro avg 0.70 0.61 0.59 268\n", + "weighted avg 0.69 0.66 0.62 268\n", + "\n", + "score : 0.664179104477612\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/lib/python3/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", + " FutureWarning)\n" + ] + } + ], + "source": [ + "_titanic.logmodel_prediction(X, y, 0.3, 42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This prediction is far from accurate, but it is a first model upon which we can add features." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to choose other features, one method is to use a correlation matrix to see which feature is correlated to survival." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<matplotlib.axes._subplots.AxesSubplot at 0x7f2f24454588>" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 432x288 with 2 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import seaborn as sn\n", + "sn.heatmap(train.corr(), annot=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Among the feature not yet selected, the two features with the highest absolute correlation value are `Pclass` and `Age`.\n", + "\n", + "Let us study these two features and the impact they have on our current model." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "dead = train[train['Survived']==0]\n", + "survived = train[train['Survived']==1]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "_titanic.plot_hist('Pclass', 'Dead', 'Survived', dead, survived)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As seen with the previous plot, the `Pclass` feature has a great impact on whether someone will survive. Indeed, people in the first Pclass are more likely to survive than people in the third Pclass.\n", + "\n", + "Thus, we add `Pclass` in our model." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "model2 = train[model1_cols+['Survived', 'Pclass']]\n", + "model2_cols = model1_cols + ['Pclass']\n", + "X2, y2 = _titanic.parse_model(model2, name_Y='Survived', use_columns=model2_cols)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " 0 0.66 0.85 0.75 154\n", + " 1 0.68 0.42 0.52 114\n", + "\n", + " micro avg 0.67 0.67 0.67 268\n", + " macro avg 0.67 0.64 0.63 268\n", + "weighted avg 0.67 0.67 0.65 268\n", + "\n", + "score : 0.667910447761194\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/lib/python3/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", + " FutureWarning)\n" + ] + } + ], + "source": [ + "_titanic.logmodel_prediction(X2, y2, 0.3, 101)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding the `Pclass` feature only slightly increase the score." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}