In this tutorial, we are going to code a Naive Bayes classifier from scratch. And therefor, one obviously needs to know how the algorithm actually works. So, if you don't know that yet, then you should probably first check out my Naive Bayes explained tutotrial. However, I will also give a quick recap of how the algorithm works in this tutorial.

Okay, so having said that, let's start with the code. And first, we need to make our import statements.

Import Statements

In [1]:
import pandas as pd

from helper_functions import prepare_data, replace_strings

from pprint import pprint
from IPython.display import Image

So first, we import "pandas" as "pd", which is the library that we are going to use to build the algorithm.

Then, from "helper_functions" we import "prepare_data" and "replace_strings". And "helper_functions" is simply a Python file (in the same folder as this notebook) where I stored some functions that help us to prepare our data. So, we don't actually need them for the algorithm. I just created them to make sure that this notebook doesn't get too cluttered.

After that, we import "pprint" from "pprint" which we will need to print out a nested dictionary in a more nicely formatted way.

And lastly, we import "Image" from "IPython.display" which I use to display an image of the algorithm later on in the notebook.

So, those are the import statements. And now, let's have a look at the data that we are going to use.

Data Preparation

In [2]:
# load data
df_train = pd.read_csv("../../data/train.csv", index_col="PassengerId")
df_test = pd.read_csv("../../data/test.csv", index_col="PassengerId")
test_labels = pd.read_csv("../../data/test_labels.csv", index_col="PassengerId", squeeze=True)

# prepare data
df_train = prepare_data(df_train)
df_test = prepare_data(df_test, train_set=False)

# handle missing values in training data
embarked_mode = df_train.Embarked.mode()[0]
df_train["Embarked"].fillna(embarked_mode, inplace=True)

df_train.head()
Out[2]:
Sex Pclass Age_Group Embarked SibSp ParCh Survived
PassengerId
1 male 3 Adult S 1 0 0
2 female 1 Adult C 1 0 1
3 female 3 Adult S 0 0 1
4 female 1 Adult S 1 0 1
5 male 3 Adult S 0 0 0

Namely, we are going to use the Titanic Data Set. This data set contains information about the passengers of the Titanic accident. So, each row represents one specific passenger. And the available features are the following:

  • Sex: male or female
  • Pclass: whether the passenger travels in the 1st, 2nd or 3rd class
  • Age_Group:
    • Child (0-12 years)
    • Teenager (13-19 years)
    • Adult (above 19 years)
    • Unknown
  • Embarked (i.e. at which port the passenger got on board):
    • C (Cherbourg)
    • Q (Queenstown)
    • S (Southampton)
  • SibSp: number of siblings/spouses aboard the Titanic
  • ParCh: number of parents/children aboard the Titanic

The last column of the dataframe ("Survived") contains the label of this data set. Namely, it tells us whether the respective passenger did survive (1) or did not survive (0) the accident.

So, that is our training data. And now, let's have a look at the test data.

In [3]:
df_test.head()
Out[3]:
Sex Pclass Age_Group Embarked SibSp ParCh
PassengerId
892 male 3 Adult Q 0 0
893 female 3 Adult S 1 0
894 male 2 Adult Q 0 0
895 male 3 Adult S 0 0
896 female 3 Adult S 1 1

Here, the "Survived" column is obviously missing since that is what we want to predict. For example, if we look at the first passenger (with Id=892), we want to be able to predict if that passenger survived (1) or if that passenger did not survive (0).

And the respective labels are stored in the variable "test_labels".

In [4]:
test_labels.head()
Out[4]:
PassengerId
892    0
893    1
894    0
895    0
896    1
Name: Survived, dtype: int64

So, as you can see, the passenger with Id=892 did not survive.

Okay, so that is what our data looks like. And now, let's see how we can code a Naive Bayes classifier from scratch.

Naive Bayes from Scratch

Therefor, let me first give you a quick recap of how the algorithm works.

In [5]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[5]:

The above image is a slide from my Naive Bayes explained tutotrial. In this slide, we are currently only considering the two features "Sex" and "Pclass" and we are looking at test passenger 3. So, we want to predict if that passenger survived or not. Therefore, we make use of the 2-step process of the Naive Bayes algorithm (indicated by the two arrows).

In the first step, we create a "look-up table" (top-right corner of the slide). Here, for each feature, we want to know how the values of that feature are distributed. And we don't just want to know that for the data set as a whole, but we want to know that for each respective class.

So, for example, looking at the feature "Sex", we want to know how many of the non-survivors are female (15%) and male (85%) and how many of the survivors are female (68%) and male (32%).

In the second step of the algorithm, we then use that look-up table to make our prediction. Namely, what we do is, we estimate how many of the 549 non-survivors we would expect to have the same combination of values as test passenger 3. So, we estimate how many of the 549 non-survivors we would expect to be both female and travel in the 1st class.

And we do that by multiplying the respective probabilities with the total number of non-survivors. So, we multiply 15% with 15% and that then with 549. Thereby, we get the result of 12.4. So, out of the 549 non-survivors, we would expect that there are ("on average") 12.4 passengers who are female and travel in the 1st class.

Then, we do the same for the survivors. And with that, we then have two estimates and whichever is higher, that is going to be our prediction for the respective test passenger. So, since in this case it is much more likely that test passenger 3 survived (88%), our prediction is going to be that she survived. And, as you can see in the table, she actually did survive. Therefore, the prediction would be correct.

So, this is how the Naive Bayes algorithm works. And now, let's finally start to implement that in code.

1. Step of the Algorithm

In the first step of the algorithm, we want to create the look-up table. So, let's first see how we are actually going to represent that in code. Namely, I thought that we could use a nested dictionary.

In [6]:
example_table = {
    
    "Sex": {"female": [0.15, 0.68],
            "male": [0.85, 0.32]},
    
    "Pclass": {1: [0.15, 0.40],
               2: [0.18, 0.25],
               3: [0.68, 0.35]},
}

So, at the first level, the keys of the dictionary are the names of the features ("Sex" and "Pclass") and the values of the dictionary are other dictionaries. At the second level, the keys of the dictionary are the respecitve values of the features (e.g. "female" and "male") and the values of the dictionary are lists that contain the respective probabilities.

But, as you can see in the image above, we are not only going to store some information about the features into our look-up table. But, we are also going to store some information about the label. Namely, we need to know the names (i.e. the values) of the different classes. And we need to know how often the different classes occur in our data.

In [7]:
example_table = {
    
    "Sex": {"female": [0.15, 0.68],
            "male": [0.85, 0.32]},
    
    "Pclass": {1: [0.15, 0.40],
               2: [0.18, 0.25],
               3: [0.68, 0.35]},
    
    "class_names": [0, 1],
    "class_counts": [549, 342]
}

One thing that I would like to point out here is that the first element of each list always refers to, in this case, class "0". And the second element of each list refers to class "1".

Okay, so this is what the look-up table should look like. So now, let's create a function that builds such a table from our data set. And let's call this function "create_table" and let's first create a skeleton for it.

In [8]:
def create_table(df, label_column):

            
    return table

So, we have two parameters for this function. The first one ("df") is our training data and it is going to be a pandas dataframe.

The second parameter is called "label_column" and it is going to be a string. With that, we can specify which column in the dataframe is the label of the data set. And we need to know that because we want to store different kinds of information for the features (probabilities) and the label (class names and counts).

And then, the function simply returns our look-up table.

So, that are the inputs and the output of the function. So now, let's build the logic for this function. And this will be the topic of the next post.

In [ ]: