Import Statements

In [1]:
import pandas as pd

from helper_functions import prepare_data, replace_strings

from pprint import pprint
from IPython.display import Image

Data Preparation

In [2]:
# load data
df_train = pd.read_csv("../../data/train.csv", index_col="PassengerId")
df_test = pd.read_csv("../../data/test.csv", index_col="PassengerId")
test_labels = pd.read_csv("../../data/test_labels.csv", index_col="PassengerId", squeeze=True)

# prepare data
df_train = prepare_data(df_train)
df_test = prepare_data(df_test, train_set=False)

# handle missing values in training data
embarked_mode = df_train.Embarked.mode()[0]
df_train["Embarked"].fillna(embarked_mode, inplace=True)

df_train.head()
Out[2]:
Sex Pclass Age_Group Embarked SibSp ParCh Survived
PassengerId
1 male 3 Adult S 1 0 0
2 female 1 Adult C 1 0 1
3 female 3 Adult S 0 0 1
4 female 1 Adult S 1 0 1
5 male 3 Adult S 0 0 0

Naive Bayes from Scratch

In [3]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[3]:

1. Step of the Algorithm

In [4]:
example_table = {
    
    "Sex": {"female": [0.15, 0.68],
            "male": [0.85, 0.32]},
    
    "Pclass": {1: [0.15, 0.40],
               2: [0.18, 0.25],
               3: [0.68, 0.35]},
    
    "class_names": [0, 1],
    "class_counts": [549, 342]
}
In [5]:
def create_table(df, label_column):

            
    return table

In the previous post, we made our import statements, prepared the data and saw how the Naive Bayes algorithm generally works. And then, we started implementing the first step of the algorithm.

Therefor, we first thought about how to represent the look-up table in code and we decided to use a nested dictionary (see "example_table" in cell 4). After that we created the skeleton of the function that is going to implement the first step of the algorithm.

So now, let's build the logic for this function. Therefore, let's first create two variables called "df" and "label_column" so that we have something to work with.

In [6]:
df = df_train
label_column = "Survived"

So, we would pass in our training data ("df_train") for the "df" parameter. And we would set the "label_column" equal to "Survived" since that is the label of the Titanic Data Set.

So now, let's actually start building the function. And the first thing that we do, is to initiate the look-up table that the function is eventually going to return. And it is an empty dictionary.

In [7]:
table = {}

And this dictionary we now want to populate with the respective pieces of information. And we are going to start with the information about the label of the data set. So, let's create a comment for that.

In [8]:
table = {}

# determine values for the label

And, as seen in the previous post, what we want to know about the label, are the names of the different classes and how often they appear. Therefor, we obviously first have to access the label column.

In [9]:
table = {}

# determine values for the label
df[label_column]
Out[9]:
PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

And now, to get the respective pieces of information, we can make use of the "value_counts" method.

In [10]:
table = {}

# determine values for the label
df[label_column].value_counts()
Out[10]:
0    549
1    342
Name: Survived, dtype: int64

So, the class names are "0" and "1". And there are 549 passengers that didn't survive and 342 that did survive. And, as we have seen before in the image or in the "example_table", those are the values that we now want to store into our empty dictionary ("table") that we have created at the beginning of the cell.

However, before we actually do that, I want to make you aware of some behavior of the "value_counts" method. Namely, it orders the unique values of a pandas Series based on how often they occur. So, it lists the value that appears most often first.

In this case, since there are more non-survivors in the data, the "0" is listed first. However, if there would be more survivors, then the "1" would be listed first. We can see that if we, for example, only look at female passengers.

In [11]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts()
Out[11]:
1    233
0     81
Name: Survived, dtype: int64

Here, there are more survivors and therefore the "1" is listed first. And this is a problem in terms of the look-up table that we want to create.

In [12]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[12]:
In [13]:
pprint(example_table, width=40)
{'Pclass': {1: [0.15, 0.4],
            2: [0.18, 0.25],
            3: [0.68, 0.35]},
 'Sex': {'female': [0.15, 0.68],
         'male': [0.85, 0.32]},
 'class_counts': [549, 342],
 'class_names': [0, 1]}

Namely, as mentioned before, the order of the elements in all the lists of the look-up table is important. So, in our case, the first element of each list should refer to class "0". And the second element of each list should refer to class "1". For example, out of all the 549 non-survivors, 15% travel in the 1st class. And out of all the 342 survivors, 40% travel in the 1st class.

So, we need to make sure that the "class_names" are ordered in a particular way, instead of just listing them based on which class appears most often. And therefore, we are simply going to order them numerically (or alphabetically if the class names are strings).

In order to do that, we can make use of the "sort_index" method.

In [14]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts().sort_index()
Out[14]:
0     81
1    233
Name: Survived, dtype: int64

Now, class "0" is listed first even though there are actually more passengers belonging to class "1".

Side note: We could also set the "sort" parameter of the "value_counts" method equal to "False" to achieve the same result.
In [15]:
table = {}

# determine values for the label
df[df.Sex == "female"][label_column].value_counts(sort=False)
Out[15]:
0     81
1    233
Name: Survived, dtype: int64
However, I think it is more clear what the code actually does if we use the "sort_index" method.

Okay, so that is the behavior of the "value_counts" method that I wanted to point out. So now, let's consider all the passengers again and not just the female passengers.

In [16]:
table = {}

# determine values for the label
df[label_column].value_counts().sort_index()
Out[16]:
0    549
1    342
Name: Survived, dtype: int64

And now, let's store the information about the names of the different classes and how often they appear into the "table", i.e. dictinoary, that we have created at the beginning of the cell. Therefor, let's store the output of the "value_counts" method into a variable.

In [17]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
value_counts
Out[17]:
0    549
1    342
Name: Survived, dtype: int64

Now, we can access the names of the classes by using the "index" attribute.

In [18]:
value_counts.index
Out[18]:
Int64Index([0, 1], dtype='int64')

And we can access the actual counts by using the "values" attribute.

In [19]:
value_counts.values
Out[19]:
array([549, 342], dtype=int64)

So now, let's finally store the class names and counts into the "table".

In [20]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index
table["class_counts"] = value_counts.values

So, let's have a look at the "table".

In [21]:
table
Out[21]:
{'class_names': Int64Index([0, 1], dtype='int64'),
 'class_counts': array([549, 342], dtype=int64)}

And, as you can see, we have stored the respective pieces of information under the respective keys. But what you can also see is that we have stored a so-called "Int64Index" object under "class_names" which is a specifc pandas object.

In [22]:
type(table["class_names"])
Out[22]:
pandas.core.indexes.numeric.Int64Index

And actually this is not really a problem. So, we could leave it like that. However, to make everything look more uniform, we are going to transform it into a NumPy array (just like the array under "class_counts").

In order to do that, we are going to use the "to_numpy" method after we have called the "index" attribute.

In [23]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

If we now look at the table, then we can see that we have now stored only NumPy arrays.

In [24]:
table
Out[24]:
{'class_names': array([0, 1], dtype=int64),
 'class_counts': array([549, 342], dtype=int64)}

And with that, we have now stored all the information that we need with regards to the label of the data set. So now, let's start working on the code that stores the necessary information about the features, namely the respective probabilities.

In [25]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[25]:
In [26]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features

So here, what we want to do is, for each feature we want to know how the values of that feature are distributed. And we don't just want to know that for the data set as a whole, but we want to know that for each respective class.

So, the first thing that we need to do is, we need to loop over all the features of the data set. And we can do that by using the "columns" attribute of our data frame "df".

In [27]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
df.columns
Out[27]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh', 'Survived'], dtype='object')

This returns a list-like object that contains the names of all the columns in our data set. However, this obviously also includes the label ("Survived"). So, we need to drop that.

In [28]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
df.drop(label_column, axis=1).columns
Out[28]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh'], dtype='object')

Now, we can loop over this "Index" object in order to access each feature of the data set.

In [29]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
for feature in df.drop(label_column, axis=1).columns:
    print(feature)
Sex
Pclass
Age_Group
Embarked
SibSp
ParCh

So now, let's write the code that is going to be executed within each iteration of the for-loop. Therefor, let's actually comment out the for-loop statement for now. And istead of that, let's create a variable called "feature".

In [30]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"

So, we are going to pretend that we are in the iteration of the for-loop where we are currently looking at the feature "Sex". This way, we actually have something to work with and we don't have to use print-statements all the time within the for-loop.

Okay, so now let's have a look again at the "example_table" to see what we want to do for each feature.

In [31]:
pprint(example_table, width=40)
{'Pclass': {1: [0.15, 0.4],
            2: [0.18, 0.25],
            3: [0.68, 0.35]},
 'Sex': {'female': [0.15, 0.68],
         'male': [0.85, 0.32]},
 'class_counts': [549, 342],
 'class_names': [0, 1]}

Namely, for each feature we want to create a dictionary. And the keys of the dictionary are the different values of that feature. And the values of the dictionary are lists containing the respective probabilities.

So, the first thing that we need to do within the for-loop, is to create an empty dictionary which we then want to populate.

In [32]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

So, let's have a look at the "table".

In [33]:
table
Out[33]:
{'class_names': array([0, 1], dtype=int64),
 'class_counts': array([549, 342], dtype=int64),
 'Sex': {}}

And, as you can see, it now includes the feature "Sex" and there we have an empty dictionary. So now, let's store the respective probabilities into that dictionary.

Therefor, we need to know how often the values "male" and "female" of the feature "Sex" appear. So, we need to access the "Sex" column of the data frame "df".

In [34]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df[feature]
Out[34]:
PassengerId
1        male
2      female
3      female
4      female
5        male
        ...  
887      male
888    female
889    female
890      male
891      male
Name: Sex, Length: 891, dtype: object

And now, in order to determine how often the respective values appear, we can again make use of the "value_counts" method.

In [35]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df[feature].value_counts()
Out[35]:
male      577
female    314
Name: Sex, dtype: int64

So, there are 577 male passengers and 314 female passengers in the data set. However, these are just the counts for the whole data set. What we need instead are the counts grouped by the different classes "0" and "1". To get those, we can use the "groupby" method on "df".

In [36]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df.groupby(label_column)[feature].value_counts()
Out[36]:
Survived  Sex   
0         male      468
          female     81
1         female    233
          male      109
Name: Sex, dtype: int64

Now, we have the male and female counts grouped by the different classes. But, as you know, for the Naive Bayes algorithm we actually need the respective probabilities. And we can get those by setting the "normalize" parameter of the "value_counts" method equal to "True".

In [37]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

df.groupby(label_column)[feature].value_counts(normalize=True)
Out[37]:
Survived  Sex   
0         male      0.852459
          female    0.147541
1         female    0.681287
          male      0.318713
Name: Sex, dtype: float64

Those are the percentages that we can also see in the image from before.

In [38]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[38]:

However, we are actually not going to make use of the "normalize" parameter (later on, we will also see why). Instead, we are going to calculate those probabilities manually by using the counts. So, let's store the counts into a variable called "counts".

In [39]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
counts
Out[39]:
Survived  Sex   
0         male      468
          female     81
1         female    233
          male      109
Name: Sex, dtype: int64

And now, to make the calcuations easier, we are going to transform this pandas Series into a pandas DataFrame by using the "unstack" method.

In [40]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
counts.unstack(label_column)
Out[40]:
Survived 0 1
Sex
female 81 233
male 468 109

So, let's store this data frame into a variable called "df_counts".

In [41]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)
df_counts
Out[41]:
Survived 0 1
Sex
female 81 233
male 468 109

So now, we want to know how the values "female" and "male" are distributed for the classes "0" and "1". Therefor, we need to know how many "0" and "1" there are in total. And we can get that by simply using the "sum" method on "df_counts".

In [42]:
df_counts.sum()
Out[42]:
Survived
0    549
1    342
dtype: int64

So, there are 549 non-survivors and 342 survivors (which are the same numbers as in the image above). And now, in order to calculate the probabilities, we need to divide the numbers in the columns of "df_counts" with the respective total number.

So, for example, in order to calculate the probabilities for the non-survivors, we need to divide the 81 "female" non-survivors by 549 total non-survivors. And we need to divide the 468 "male" non-survivors by 549 total non-survivors.

For the survivors we need to do the same thing. So, we can simply divide "df_counts" by "df_counts.sum()".

In [43]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)
df_counts / df_counts.sum()
Out[43]:
Survived 0 1
Sex
female 0.147541 0.681287
male 0.852459 0.318713

And those are now the exact same probabilities that we got before when we used the "normalize" parameter of the "value_counts" method. So, let's store them into a variable called "df_probabilities".

In [44]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
df_probabilities
Out[44]:
Survived 0 1
Sex
female 0.147541 0.681287
male 0.852459 0.318713

And now, we simply need to store those probabilities into the empty dictionary that we created for the feature "Sex". So, let's have a look again at the "example_table" to see how exactly we want to store them.

In [45]:
pprint(example_table, width=40)
{'Pclass': {1: [0.15, 0.4],
            2: [0.18, 0.25],
            3: [0.68, 0.35]},
 'Sex': {'female': [0.15, 0.68],
         'male': [0.85, 0.32]},
 'class_counts': [549, 342],
 'class_names': [0, 1]}

And, as you can see, for the value "female" we want to store the probabilities 0.15 and 0.68. So, looking at "df_probabilities", what we want to store into the empty dictinoray, are the rows of that data frame.

So, we need to loop over the rows of "df_probabilities". And therefor, we can make use of the "index" attribute.

In [46]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
df_probabilities.index
Out[46]:
Index(['female', 'male'], dtype='object', name='Sex')

This returns a list-like object containing the indices of "df_probabilities". So, let's loop over that.

In [47]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
for value in df_probabilities.index:
    print(value)
female
male

So now, let's write the code that is going to be executed within each iteration of the for-loop. Therefor, just as we did before, let's actually comment out the for-loop statement for now. And istead of that, let's create a variable called "value".

In [48]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
# for value in df_probabilities.index:
value = "female"

So, we are going to pretend that we are in the iteration of the for-loop where we are currently looking at the value "female".

Okay, so now we need to use "value" to access the respective row of "df_probabilities" so that we can then store the probabilities into our "table". To see how we can do that, let's print out "df_probabilities" again.

In [49]:
df_probabilities
Out[49]:
Survived 0 1
Sex
female 0.147541 0.681287
male 0.852459 0.318713

And now, in order to access the "female" row, we need to make use of the pandas "loc-indexer".

In [50]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
# for value in df_probabilities.index:
value = "female"
df_probabilities.loc[value]
Out[50]:
Survived
0    0.147541
1    0.681287
Name: female, dtype: float64

So, that's how we can access specific rows of "df_probabilities". And since those are the probabilities that we want to store into our "table", let's transform this pandas Series into a NumPy array by again using the "to_numpy" method.

In [51]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
# for value in df_probabilities.index:
value = "female"
df_probabilities.loc[value].to_numpy()
Out[51]:
array([0.14754098, 0.68128655])

So now, let's store this array into a variable called "probabilities".

In [52]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
# for value in df_probabilities.index:
value = "female"
probabilities = df_probabilities.loc[value].to_numpy()
probabilities
Out[52]:
array([0.14754098, 0.68128655])

And now, let's finally store these "probabilities" into the empty dictionary that we created for the feature "Sex".

In [53]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
# for feature in df.drop(label_column, axis=1).columns:
feature = "Sex"
table[feature] = {}

# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)

# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
# for value in df_probabilities.index:
value = "female"
probabilities = df_probabilities.loc[value].to_numpy()
table[feature][value] = probabilities

So, let's have a look at the "table" to see if it worked.

In [54]:
table
Out[54]:
{'class_names': array([0, 1], dtype=int64),
 'class_counts': array([549, 342], dtype=int64),
 'Sex': {'female': array([0.14754098, 0.68128655])}}

And, as you can see, for the feature "Sex" we have now stored the respective probabilites for the value "female". So, our code is working. Therefore, let's now run it for all the features and all the values. So, we need to uncomment the two for-loop statements and delete the lines where we created the variables "feature" and "value".

In [55]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
for feature in df.drop(label_column, axis=1).columns:
    table[feature] = {}

    # determine counts
    counts = df.groupby(label_column)[feature].value_counts()
    df_counts = counts.unstack(label_column)

    # calculate probabilities
    df_probabilities = df_counts / df_counts.sum()
    for value in df_probabilities.index:
        probabilities = df_probabilities.loc[value].to_numpy()
        table[feature][value] = probabilities

So, let's again have a look at the "table".

In [56]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[56]:
In [57]:
pprint(table)
{'Age_Group': {'Adult': array([0.61748634, 0.61695906]),
               'Child': array([0.05282332, 0.11695906]),
               'Teenager': array([0.10200364, 0.11403509]),
               'Unknown': array([0.2276867 , 0.15204678])},
 'Embarked': {'C': array([0.13661202, 0.27192982]),
              'Q': array([0.0856102, 0.0877193]),
              'S': array([0.77777778, 0.64035088])},
 'ParCh': {0: array([0.81056466, 0.68128655]),
           1: array([0.09653916, 0.19005848]),
           2: array([0.07285974, 0.11695906]),
           3: array([0.00364299, 0.00877193]),
           4: array([0.00728597,        nan]),
           5: array([0.00728597, 0.00292398]),
           6: array([0.00182149,        nan])},
 'Pclass': {1: array([0.14571949, 0.39766082]),
            2: array([0.17668488, 0.25438596]),
            3: array([0.67759563, 0.34795322])},
 'Sex': {'female': array([0.14754098, 0.68128655]),
         'male': array([0.85245902, 0.31871345])},
 'SibSp': {0: array([0.72495446, 0.61403509]),
           1: array([0.17668488, 0.32748538]),
           2: array([0.0273224, 0.0380117]),
           3: array([0.02185792, 0.01169591]),
           4: array([0.0273224 , 0.00877193]),
           5: array([0.00910747,        nan]),
           8: array([0.01275046,        nan])},
 'class_counts': array([549, 342], dtype=int64),
 'class_names': array([0, 1], dtype=int64)}

And, as you can see, for each feature we now have a dictionary. And in those dictionaries we have a list of probabilities for each unique value of that feature. And if you look at the features "Sex" and "Pclass", then you can see that the probabilities are the same as the probabilities in the image.

So, we are basically done and we can copy this code into our "create_table" function that we created at the beginning of this post. Before we do that, however, we need to address one problem. Namely, if you look at the features "ParCh" and "SibSp", then you can see that there are "nan"-values in the "table". This stands for "Not a Number" and we need to replace them with actual probabilities.

Therefor, let's first check out why this happens. And luckily, the last iteration of the "for feature in..." for-loop was actually for the "ParCh" feature.

In [58]:
feature
Out[58]:
'ParCh'

So, every variable created within the for-loop refers to feature "ParCh". Therefore, let's first have a look at "counts".

In [59]:
counts
Out[59]:
Survived  ParCh
0         0        445
          1         53
          2         40
          4          4
          5          4
          3          2
          6          1
1         0        233
          1         65
          2         40
          3          3
          5          1
Name: ParCh, dtype: int64

And here, we can see what the problem is. Namely, for the non-survivors ("Survived=0"), we had passengers in our data set for each possible value of "ParCh". For the survivors ("Survived=1"), however, we only had passengers in our data set for the values 0, 1, 2, 3 and 5. So, there were no survivors in the data set that travelled with 4 or 6 parents/children.

So, when we then transform this pandas Series into a pandas DataFrame, pandas fills in these missing values with "NaN".

In [60]:
df_counts
Out[60]:
Survived 0 1
ParCh
0 445.0 233.0
1 53.0 65.0
2 40.0 40.0
3 2.0 3.0
4 4.0 NaN
5 4.0 1.0
6 1.0 NaN

So, this is where the "nan" in our "table" come from. So now, we need to write the code that is going to replace them. And since we only want to do that when there actually are "NaN" values in "df_counts", let's first write a condition that checks exactly that.

And therefor, we can make use of the "isna" method (side note: we could also use the "isnull" method).

In [61]:
df_counts.isna()
Out[61]:
Survived 0 1
ParCh
0 False False
1 False False
2 False False
3 False False
4 False True
5 False False
6 False True

This returns a data frame that only consists of Booleans indicating which elements of the data frame are "NaN".

But we don't really want to know which exact elements are "NaN", instead we simply want to know if there are any "NaN" in the data frame. Therefor, we can use the "any" method.

In [62]:
df_counts.isna().any()
Out[62]:
Survived
0    False
1     True
dtype: bool

This checks if there are any elements that are "True". And in this case, it does it for each respective column of "df_counts". That's why there is a "False" for column "0" and a "True" for column "1" since the "True" values are only in column "1".

Now, if we want to know if there are any "True" values for the whole data frame, then we need to set the "axis" parameter of the "any" method equal to "None".

In [63]:
df_counts.isna().any(axis=None)
Out[63]:
True

So, if this condition is "True", then we want to replace the "Nan" in "df_counts".

In [64]:
if df_counts.isna().any(axis=None):
    print("do something to replace 'NaN'")
do something to replace 'NaN'

So now, let's see how we do that. Namely, what we want to do is, we want to first replace them with zeros. And therefor, we can use the "fillna" method.

In [65]:
df_counts.fillna(value=0)
Out[65]:
Survived 0 1
ParCh
0 445.0 233.0
1 53.0 65.0
2 40.0 40.0
3 2.0 3.0
4 4.0 0.0
5 4.0 1.0
6 1.0 0.0

So, this is how we can replace them. However, the actual "df_counts" data frame hasn't changed by running the above code.

In [66]:
df_counts
Out[66]:
Survived 0 1
ParCh
0 445.0 233.0
1 53.0 65.0
2 40.0 40.0
3 2.0 3.0
4 4.0 NaN
5 4.0 1.0
6 1.0 NaN

So, we need to run something like "df_counts = ..." or we can also simply set the "inplace" parameter of the "fillna" method to "True".

In [67]:
df_counts.fillna(value=0, inplace=True)

Now, the "df_counts" data frame has actually been modified and we don't have any "NaN" anymore.

In [68]:
df_counts
Out[68]:
Survived 0 1
ParCh
0 445.0 233.0
1 53.0 65.0
2 40.0 40.0
3 2.0 3.0
4 4.0 0.0
5 4.0 1.0
6 1.0 0.0

But we can't also just leave the zeros in the data frame because if we do that, then we would run into the "problem of rare values". (what this problem exactly is and how to solve it, I have covered in the second post of my "Naive Bayes explained" series). So, what we are going to do is, we are going to add one instance to each entry in this data frame.

In [69]:
df_counts + 1
Out[69]:
Survived 0 1
ParCh
0 446.0 234.0
1 54.0 66.0
2 41.0 41.0
3 3.0 4.0
4 5.0 1.0
5 5.0 2.0
6 2.0 1.0

Now, there aren't any zeros anymore. But, as before with the "fillna" method, the code above hasn't actually altered the "df_counts" data frame.

In [70]:
df_counts
Out[70]:
Survived 0 1
ParCh
0 445.0 233.0
1 53.0 65.0
2 40.0 40.0
3 2.0 3.0
4 4.0 0.0
5 4.0 1.0
6 1.0 0.0

So, we need to run the code like this:

In [71]:
df_counts += 1

Now, "df_counts" has been changed.

In [72]:
df_counts
Out[72]:
Survived 0 1
ParCh
0 446.0 234.0
1 54.0 66.0
2 41.0 41.0
3 3.0 4.0
4 5.0 1.0
5 5.0 2.0
6 2.0 1.0

So, this is how we can replace the "nan" in our "table" and take care of the "problem of rare values". So now, let's add these code snippets to the code that is going to comprise the "create_table" function.

In [73]:
table = {}

# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values

# determine probabilities for the features
for feature in df.drop(label_column, axis=1).columns:
    table[feature] = {}

    # determine counts
    counts = df.groupby(label_column)[feature].value_counts()
    df_counts = counts.unstack(label_column)
    
    # add one count to avoid "problem of rare values"
    if df_counts.isna().any(axis=None):
        df_counts.fillna(value=0, inplace=True)
        df_counts += 1

    # calculate probabilities
    df_probabilities = df_counts / df_counts.sum()
    for value in df_probabilities.index:
        probabilities = df_probabilities.loc[value].to_numpy()
        table[feature][value] = probabilities

So now, let's have a look at the table again.

In [74]:
pprint(table)
{'Age_Group': {'Adult': array([0.61748634, 0.61695906]),
               'Child': array([0.05282332, 0.11695906]),
               'Teenager': array([0.10200364, 0.11403509]),
               'Unknown': array([0.2276867 , 0.15204678])},
 'Embarked': {'C': array([0.13661202, 0.27192982]),
              'Q': array([0.0856102, 0.0877193]),
              'S': array([0.77777778, 0.64035088])},
 'ParCh': {0: array([0.80215827, 0.67048711]),
           1: array([0.0971223 , 0.18911175]),
           2: array([0.07374101, 0.11747851]),
           3: array([0.00539568, 0.01146132]),
           4: array([0.00899281, 0.00286533]),
           5: array([0.00899281, 0.00573066]),
           6: array([0.00359712, 0.00286533])},
 'Pclass': {1: array([0.14571949, 0.39766082]),
            2: array([0.17668488, 0.25438596]),
            3: array([0.67759563, 0.34795322])},
 'Sex': {'female': array([0.14754098, 0.68128655]),
         'male': array([0.85245902, 0.31871345])},
 'SibSp': {0: array([0.7176259 , 0.60458453]),
           1: array([0.17625899, 0.32378223]),
           2: array([0.02877698, 0.04011461]),
           3: array([0.02338129, 0.01432665]),
           4: array([0.02877698, 0.01146132]),
           5: array([0.01079137, 0.00286533]),
           8: array([0.01438849, 0.00286533])},
 'class_counts': array([549, 342], dtype=int64),
 'class_names': array([0, 1], dtype=int64)}

And, as you can see, there are no "nan" values anymore and we also only have non-zero probabilities. So now, our code is working properly. So, let's put into the skeleton of our "create_table" function.

In [75]:
def create_table(df, label_column):
    table = {}

    # determine values for the label
    value_counts = df[label_column].value_counts().sort_index()
    table["class_names"] = value_counts.index.to_numpy()
    table["class_counts"] = value_counts.values

    # determine probabilities for the features
    for feature in df.drop(label_column, axis=1).columns:
        table[feature] = {}

        # determine counts
        counts = df.groupby(label_column)[feature].value_counts()
        df_counts = counts.unstack(label_column)

        # add one count to avoid "problem of rare values"
        if df_counts.isna().any(axis=None):
            df_counts.fillna(value=0, inplace=True)
            df_counts += 1

        # calculate probabilities
        df_probabilities = df_counts / df_counts.sum()
        for value in df_probabilities.index:
            probabilities = df_probabilities.loc[value].to_numpy()
            table[feature][value] = probabilities
            
    return table

So, this is what the function looks like that is going to comprise the first step of the Naive Bayes algorithm. So, let's now use it to create a "lookup_table" that we can then use in the second step of the algorithm.

In [76]:
lookup_table = create_table(df_train, label_column="Survived")
pprint(lookup_table)
{'Age_Group': {'Adult': array([0.61748634, 0.61695906]),
               'Child': array([0.05282332, 0.11695906]),
               'Teenager': array([0.10200364, 0.11403509]),
               'Unknown': array([0.2276867 , 0.15204678])},
 'Embarked': {'C': array([0.13661202, 0.27192982]),
              'Q': array([0.0856102, 0.0877193]),
              'S': array([0.77777778, 0.64035088])},
 'ParCh': {0: array([0.80215827, 0.67048711]),
           1: array([0.0971223 , 0.18911175]),
           2: array([0.07374101, 0.11747851]),
           3: array([0.00539568, 0.01146132]),
           4: array([0.00899281, 0.00286533]),
           5: array([0.00899281, 0.00573066]),
           6: array([0.00359712, 0.00286533])},
 'Pclass': {1: array([0.14571949, 0.39766082]),
            2: array([0.17668488, 0.25438596]),
            3: array([0.67759563, 0.34795322])},
 'Sex': {'female': array([0.14754098, 0.68128655]),
         'male': array([0.85245902, 0.31871345])},
 'SibSp': {0: array([0.7176259 , 0.60458453]),
           1: array([0.17625899, 0.32378223]),
           2: array([0.02877698, 0.04011461]),
           3: array([0.02338129, 0.01432665]),
           4: array([0.02877698, 0.01146132]),
           5: array([0.01079137, 0.00286533]),
           8: array([0.01438849, 0.00286533])},
 'class_counts': array([549, 342], dtype=int64),
 'class_names': array([0, 1], dtype=int64)}

And with that, we have reached the end of this part. And in the next part, we are going to build the function that uses the "lookup_table" to make predictions about the examples of the test set.

In [ ]: