Import Statements

In [1]:
import pandas as pd

from helper_functions import prepare_data, replace_strings

from pprint import pprint
from IPython.display import Image

Data Preparation

In [2]:
# load data
df_train = pd.read_csv("../../data/train.csv", index_col="PassengerId")
df_test = pd.read_csv("../../data/test.csv", index_col="PassengerId")
test_labels = pd.read_csv("../../data/test_labels.csv", index_col="PassengerId", squeeze=True)

# prepare data
df_train = prepare_data(df_train)
df_test = prepare_data(df_test, train_set=False)

# handle missing values in training data
embarked_mode = df_train.Embarked.mode()[0]
df_train["Embarked"].fillna(embarked_mode, inplace=True)

df_train.head()
Out[2]:
Sex Pclass Age_Group Embarked SibSp ParCh Survived
PassengerId
1 male 3 Adult S 1 0 0
2 female 1 Adult C 1 0 1
3 female 3 Adult S 0 0 1
4 female 1 Adult S 1 0 1
5 male 3 Adult S 0 0 0

Naive Bayes from Scratch

In [3]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[3]:

1. Step of the Algorithm

In [4]:
example_table = {
    
    "Sex": {"female": [0.15, 0.68],
            "male": [0.85, 0.32]},
    
    "Pclass": {1: [0.15, 0.40],
               2: [0.18, 0.25],
               3: [0.68, 0.35]},
    
    "class_names": [0, 1],
    "class_counts": [549, 342]
}
In [5]:
def create_table(df, label_column):

            
    return table

In the previous post, we made our import statements, prepared the data and saw how the Naive Bayes algorithm generally works. And then, we started implementing the first step of the algorithm.

Therefor, we first thought about how to represent the look-up table in code and we decided to use a nested dictionary (see "example_table" in cell 4). After that we created the skeleton of the function that is going to implement the first step of the algorithm.

So now, let's build the logic for this function. Therefore, let's first create two variables called "df" and "label_column" so that we have something to work with.

In [6]:
df = df_train
label_column = "Survived"

So, we would pass in our training data ("df_train") for the "df" parameter. And we would set the "label_column" equal to "Survived" since that is the label of the Titanic Data Set.

So now, let's actually start building the function. And the first thing that we do, is to initiate the look-up table that the function is eventually going to return. And it is an empty dictionary.

In [7]:
table = {}

And this dictionary we now want to populate with the respective pieces of information. And we are going to start with the information about the label of the data set. So, let's create a comment for that.

In [8]:
table = {}

# determine values for the label

And, as seen in the previous post, what we want to know about the label, are the names of the different classes and how often they appear. Therefor, we obviously first have to access the label column.

In [9]:
table = {}

# determine values for the label
df[label_column]
Out[9]:
PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

And now, to get the respective pieces of information, we can make use of the "value_counts" method.

In [10]:
table = {}

# determine values for the label
df[label_column].value_counts()
Out[10]:
0    549
1    342
Name: Survived, dtype: int64

So, the class names are "0" and "1". And there are 549 passengers that didn't survive and 342 that did survive. And, as we have seen before in the image or in the "example_table", those are the values that we now want to store into our empty dictionary ("table") that we have created at the beginning of the cell.

However, before we actually do that, I want to make you aware of some behavior of the "value_counts" method. Namely, it orders the unique values of a pandas Series based on how often they occur. So, it lists the value that appears most often first.

In this case, since there are more non-survivors in the data, the "0" is listed first. However, if there would be more survivors, then the "1" would be listed first. We can see that if we, for example, only look at female passengers.

In [11]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts()
Out[11]:
1    233
0     81
Name: Survived, dtype: int64

Here, there are more survivors and therefore the "1" is listed first. And this is a problem in terms of the look-up table that we want to create.

In [12]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[12]:
In [13]:
pprint(example_table, width=40)
{'Pclass': {1: [0.15, 0.4],
            2: [0.18, 0.25],
            3: [0.68, 0.35]},
 'Sex': {'female': [0.15, 0.68],
         'male': [0.85, 0.32]},
 'class_counts': [549, 342],
 'class_names': [0, 1]}

Namely, as mentioned before, the order of the elements in all the lists of the look-up table is important. So, in our case, the first element of each list should refer to class "0". And the second element of each list should refer to class "1". For example, out of all the 549 non-survivors, 15% travel in the 1st class. And out of all the 342 survivors, 40% travel in the 1st class.

So, we need to make sure that the "class_names" are ordered in a particular way, instead of just listing them based on which class appears most often. And therefore, we are simply going to order them numerically (or alphabetically if the class names are strings).

In order to do that, we can make use of the "sort_index" method.

In [14]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts().sort_index()
Out[14]:
0     81
1    233
Name: Survived, dtype: int64

Now, class "0" is listed first even though there are actually more passengers belonging to class "1".

Side note: We could also set the "sort" parameter of the "value_counts" method equal to "False" to achieve the same result.
In [15]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts(sort=False)
Out[15]:
0     81
1    233
Name: Survived, dtype: int64
However, I think it is more clear what the code actually does if we use the "sort_index" method.

Okay, so that is the behavior of the "value_counts" method that I wanted to point out. So now, let's consider all the passengers again and not just the female passengers.

In [16]:
table = {}

# determine values for the label
df[label_column].value_counts().sort_index()
Out[16]:
0    549
1    342
Name: Survived, dtype: int64

And now, let's store the information about the names of the different classes and how often they appear into the "table", i.e. dictinoary, that we have created at the beginning of the cell. Therefor, let's store the output of the "value_counts" method into a variable.

In [17]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
value_counts
Out[17]:
0    549
1    342
Name: Survived, dtype: int64

Now, we can access the names of the classes by using the "index" attribute.

In [18]:
value_counts.index
Out[18]:
Int64Index([0, 1], dtype='int64')

And we can access the actual counts by using the "values" attribute.

In [19]:
value_counts.values
Out[19]:
array([549, 342], dtype=int64)

So now, let's finally store the class names and counts into the "table".

In [20]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index
table["class_counts"] = value_counts.values

So, let's have a look at the "table".

In [21]:
table
Out[21]:
{'class_names': Int64Index([0, 1], dtype='int64'),
 'class_counts': array([549, 342], dtype=int64)}

And, as you can see, we have stored the respective pieces of information under the respective keys. But what you can also see is that we have stored a so-called "Int64Index" object under "class_names" which is a specifc pandas object.

In [22]:
type(table["class_names"])
Out[22]:
pandas.core.indexes.numeric.Int64Index

And actually this is not really a problem. So, we could leave it like that. However, to make everything look more uniform, we are going to transform it into a NumPy array (just like the array under "class_counts").

In order to do that, we are going to use the "to_numpy" method after we have called the "index" attribute.

In [23]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

If we now look at the table, then we can see that we have now stored only NumPy arrays.

In [24]:
table
Out[24]:
{'class_names': array([0, 1], dtype=int64),
 'class_counts': array([549, 342], dtype=int64)}

And with that, we have now stored all the information that we need with regards to the label of the data set. So now, let's start working on the code that stores the necessary information about the features, namely the respective probabilities.

In [25]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[25]:
In [26]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features

So here, what we want to do is, for each feature we want to know how the values of that feature are distributed. And we don't just want to know that for the data set as a whole, but we want to know that for each respective class.

So, the first thing that we need to do is, we need to loop over all the features of the data set. And we can do that by using the "columns" attribute of our data frame "df".

In [27]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
df.columns
Out[27]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh', 'Survived'], dtype='object')

This returns a list-like object that contains the names of all the columns in our data set. However, this obviously also includes the label ("Survived"). So, we need to drop that.

In [28]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
df.drop(label_column, axis=1).columns
Out[28]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh'], dtype='object')

Now, we can loop over this "Index" object in order to access each feature of the data set.

In [29]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
for feature in df.drop(label_column, axis=1).columns:
    print(feature)
Sex
Pclass
Age_Group
Embarked
SibSp
ParCh

So now, let's write the code that is going to be executed within each iteration of the for-loop. Therefor, let's actually comment out the for-loop statement for now. And istead of that, let's create a variable called "feature".

In [30]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"

So, we are going to pretend that we are in the iteration of the for-loop where we are currently looking at the feature "Sex". This way, we actually have something to work with and we don't have to use print-statements all the time within the for-loop.

Okay, so now let's have a look again at the "example_table" to see what we want to do for each feature.

In [31]:
pprint(example_table, width=40)
{'Pclass': {1: [0.15, 0.4],
            2: [0.18, 0.25],
            3: [0.68, 0.35]},
 'Sex': {'female': [0.15, 0.68],
         'male': [0.85, 0.32]},
 'class_counts': [549, 342],
 'class_names': [0, 1]}

Namely, for each feature we want to create a dictionary. And the keys of the dictionary are the different values of that feature. And the values of the dictionary are lists containing the respective probabilities.

So, the first thing that we need to do within the for-loop, is to create an empty dictionary which we then want to populate.

In [32]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

So, let's have a look at the "table".

In [33]:
table
Out[33]:
{'class_names': array([0, 1], dtype=int64),
 'class_counts': array([549, 342], dtype=int64),
 'Sex': {}}

And, as you can see, it now includes the feature "Sex" and there we have an empty dictionary. So now, let's store the respective probabilities into that dictionary.

Therefor, we need to know how often the values "male" and "female" of the feature "Sex" appear. So, we need to access the "Sex" column of the data frame "df".

In [34]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df[feature]
Out[34]:
PassengerId
1        male
2      female
3      female
4      female
5        male
        ...  
887      male
888    female
889    female
890      male
891      male
Name: Sex, Length: 891, dtype: object

And now, in order to determine how often the respective values appear, we can again make use of the "value_counts" method.

In [35]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df[feature].value_counts()
Out[35]:
male      577
female    314
Name: Sex, dtype: int64

So, there are 577 male passengers and 314 female passengers in the data set. However, these are just the counts for the whole data set. What we need instead are the counts grouped by the different classes "0" and "1". To get those, we can use the "groupby" method on "df".

In [36]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df.groupby(label_column)[feature].value_counts()
Out[36]:
Survived  Sex   
0         male      468
          female     81
1         female    233
          male      109
Name: Sex, dtype: int64

Now, we have the male and female counts grouped by the different classes. But, as you know, for the Naive Bayes algorithm we actually need the respective probabilities. And we can get those by setting the "normalize" parameter of the "value_counts" method equal to "True".

In [37]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df.groupby(label_column)[feature].value_counts(normalize=True)
Out[37]:
Survived  Sex   
0         male      0.852459
          female    0.147541
1         female    0.681287
          male      0.318713
Name: Sex, dtype: float64

Those are the percentages that we can also see in the image from before.

In [38]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[38]: