In [1]:
import pandas as pd

from helper_functions import prepare_data, replace_strings

from pprint import pprint
from IPython.display import Image

Data Preparation

In [2]:
# load data
df_train = pd.read_csv("../../data/train.csv", index_col="PassengerId")
df_test = pd.read_csv("../../data/test.csv", index_col="PassengerId")
test_labels = pd.read_csv("../../data/test_labels.csv", index_col="PassengerId", squeeze=True)

# prepare data
df_train = prepare_data(df_train)
df_test = prepare_data(df_test, train_set=False)

# handle missing values in training data
embarked_mode = df_train.Embarked.mode()[0]
df_train["Embarked"].fillna(embarked_mode, inplace=True)

df_train.head()
Out[2]:
Sex Pclass Age_Group Embarked SibSp ParCh Survived
PassengerId
1 male 3 Adult S 1 0 0
2 female 1 Adult C 1 0 1
3 female 3 Adult S 0 0 1
4 female 1 Adult S 1 0 1
5 male 3 Adult S 0 0 0

Naive Bayes from Scratch

In [3]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[3]:

1. Step of the Algorithm

In [4]:
example_table = {
    
    "Sex": {"female": [0.15, 0.68],
            "male": [0.85, 0.32]},
    
    "Pclass": {1: [0.15, 0.40],
               2: [0.18, 0.25],
               3: [0.68, 0.35]},
    
    "class_names": [0, 1],
    "class_counts": [549, 342]
}
In [5]:
def create_table(df, label_column):
    table = {}

    # determine values for the label
    value_counts = df[label_column].value_counts().sort_index()
    table["class_names"] = value_counts.index.to_numpy()
    table["class_counts"] = value_counts.values

    # determine probabilities for the features
    for feature in df.drop(label_column, axis=1).columns:
        table[feature] = {}

        # determine counts
        counts = df.groupby(label_column)[feature].value_counts()
        df_counts = counts.unstack(label_column)

        # add one count to avoid "problem of rare values"
        if df_counts.isna().any(axis=None):
            df_counts.fillna(value=0, inplace=True)
            df_counts += 1

        # calculate probabilities
        df_probabilities = df_counts / df_counts.sum()
        for value in df_probabilities.index:
            probabilities = df_probabilities.loc[value].to_numpy()
            table[feature][value] = probabilities
            
    return table
In [6]:
lookup_table = create_table(df_train, label_column="Survived")
pprint(lookup_table)
{'Age_Group': {'Adult': array([0.61748634, 0.61695906]),
               'Child': array([0.05282332, 0.11695906]),
               'Teenager': array([0.10200364, 0.11403509]),
               'Unknown': array([0.2276867 , 0.15204678])},
 'Embarked': {'C': array([0.13661202, 0.27192982]),
              'Q': array([0.0856102, 0.0877193]),
              'S': array([0.77777778, 0.64035088])},
 'ParCh': {0: array([0.80215827, 0.67048711]),
           1: array([0.0971223 , 0.18911175]),
           2: array([0.07374101, 0.11747851]),
           3: array([0.00539568, 0.01146132]),
           4: array([0.00899281, 0.00286533]),
           5: array([0.00899281, 0.00573066]),
           6: array([0.00359712, 0.00286533])},
 'Pclass': {1: array([0.14571949, 0.39766082]),
            2: array([0.17668488, 0.25438596]),
            3: array([0.67759563, 0.34795322])},
 'Sex': {'female': array([0.14754098, 0.68128655]),
         'male': array([0.85245902, 0.31871345])},
 'SibSp': {0: array([0.7176259 , 0.60458453]),
           1: array([0.17625899, 0.32378223]),
           2: array([0.02877698, 0.04011461]),
           3: array([0.02338129, 0.01432665]),
           4: array([0.02877698, 0.01146132]),
           5: array([0.01079137, 0.00286533]),
           8: array([0.01438849, 0.00286533])},
 'class_counts': array([549, 342], dtype=int64),
 'class_names': array([0, 1], dtype=int64)}

In the previous post, we built the "create_table" function, which executes the first step of the algorithm. And now, in this post, we are going to build the function that will execute the second step of the algorithm.

2. Step of the Algorithm

Therefor, let's first look again at the illustration of the Naive Bayes algorithm and let's quickly recap what we actually want to do in the second step of the algorithm.

In [7]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[7]:

Here, we are currently looking at test passenger 3 which is female and travels in the first class. To predict whether this passenger survived or not, we first estimate how many of the 549 non-survivors in the training data set would have the same combination of values as test passenger 3. Therefor, we multiply 15% with 15% with 549. After that, we estimate how many of the 342 survivors would have the same combination of values as test passenger 3. And therefor, we multiply 68% with 40% with 342. Then, we have two estimations and whichever is higher, that is what we are going to predict for test passenger 3.

So now, we want to create a function that does exactly that. So, let's start again with just the skeleton of the function.

In [8]:
def predict_example(row, lookup_table):
    
    
    return prediction

So, we pass one row of the test set into the function, as well as the "lookup_table" that we created in the first step of the algorithm. And then, the function should return the prediction.

So now, let's start building the logic of the function. Therefor, let's first create a variable called "row" so that we have something to work with.

In [9]:
row = df_test.loc[904]
row
Out[9]:
Sex          female
Pclass            1
Age_Group     Adult
Embarked          S
SibSp             1
ParCh             0
Name: 904, dtype: object

And now, let's start with creating the estimates. So, how many of the non-survivors we would expect to have the same combination of values as "row" and how many of the survivors we would expect to have that same combination of values.

And luckily, in the "lookup_table" that we created in the previous post, we stored everything as NumPy arrays. Because of that, we are able to make both of the estimations in just one step since we can make use of element-wise multiplication.

For example, let's look at the probabilities for "Sex=female".

In [10]:
lookup_table["Sex"]["female"]
Out[10]:
array([0.14754098, 0.68128655])

And let's also look at the probabilities for "Pclass=1".

In [11]:
lookup_table["Pclass"][1]
Out[11]:
array([0.14571949, 0.39766082])

And let's finally also look at the "class_counts".

In [12]:
lookup_table["class_counts"]
Out[12]:
array([549, 342], dtype=int64)

Those are the three arrays that we need to make the estimations depicted in the slide.

In [13]:
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Out[13]:

And all we have to do, is to simply multiply those.

In [14]:
lookup_table["Sex"]["female"] * \
lookup_table["Pclass"][1]     * \
lookup_table["class_counts"]
Out[14]:
array([11.80327869, 92.65497076])

This way, we get the same estimates as in the slide (the small differences in the slide are due to rounding errors).

So, for our "predict_example" function we are simply going to save the "class_counts" into a variable called "class_estimates".

In [15]:
class_estimates = lookup_table["class_counts"]
class_estimates
Out[15]:
array([549, 342], dtype=int64)

And then, we are going to iteratively multiply those "class_estimates" with the respective probabilities from the "lookup_table". In order to do that, we need to access each "feature-value-pair" of the "row" so that we can then, in turn, access the respective probabilities in the "lookup_table".

The features we can access by using the "index" attribute of the row.

In [16]:
class_estimates = lookup_table["class_counts"]
row.index
Out[16]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh'], dtype='object')

This returns a list-like object containing the features. So, let's simply iterate over that.

In [17]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    print(feature)
Sex
Pclass
Age_Group
Embarked
SibSp
ParCh

So, this is how we can access all the features. But for now, let's comment out the for-loop statement and let's instead create a variable called "feature" so that we , again, actually have something to work with.

In [18]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"

So, we are pretending that we are currently in the iteration of the for-loop where we are looking at the feature "Sex".

So now, we need to access to corresponding value of the feature "Sex" for our "row". And this we can simply do by using brackets-notation.

In [19]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
row[feature]
Out[19]:
'female'

So, let's store this value in a variable called "value".

In [20]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]

And with that, we have now the "feature" and the "value" to access the corresponding probabilities in our "lookup_table".

In [21]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
lookup_table[feature][value]
Out[21]:
array([0.14754098, 0.68128655])

And, as you can see in the slide, those are the correct probabilities when the "Sex" is "female". So, let's store them in a variable called "probabilities".

In [22]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
probabilities = lookup_table[feature][value]

And with that, we have now the corresponding probabilities for a given "feature-value-pair" from the row. Therefore, let's update the "class_estimates" by multiplying them with the "probabilities".

In [23]:
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities

class_estimates
Out[23]:
array([ 81., 233.])

And, as you can see, now the number of non-survivors changed from 549 to 81 and the number of survivors changed from 342 to 233.

So now, we simply need to do the same thing when we consider all the features and not just the feature "Sex". So, let's uncomment the for-loop statement again and let's delete the line where we created the variable "feature".

In [24]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    value = row[feature]
    probabilities = lookup_table[feature][value]
    class_estimates = class_estimates * probabilities

class_estimates
Out[24]:
array([0.80148776, 7.9466947 ])

So, as you can see, we would estimate that, on average, 0.8 of the non-survivors would have the same combination of values as the "row". And we would estimate that, on average, 7.95 of the survivors would have the same combination of values as the "row". So, we would predict that this passenger did survive.

And this means that the function should return a "1". So, how do we get that?

Well, therefor we also stored the "class_names" into our "lookup_table".

In [25]:
lookup_table["class_names"]
Out[25]:
array([0, 1], dtype=int64)

So now, we simply need to know the index of the largest value in "class_estimates". And that index, we can then use to index the "class_names" to get the actual prediction.

So, to get the index of the largest value in "class_estimates", we can use the "argmax" method.

In [26]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    value = row[feature]
    probabilities = lookup_table[feature][value]
    class_estimates = class_estimates * probabilities

class_estimates.argmax()
Out[26]:
1

This returns a "1" since the second element of "class_estimates" is the larges element of that array. So, let's store that into a variable called "index_max_class"

In [27]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    value = row[feature]
    probabilities = lookup_table[feature][value]
    class_estimates = class_estimates * probabilities

index_max_class = class_estimates.argmax()
index_max_class
Out[27]:
1

And now, let's use that index to access the corresponding element of the "class_names".

In [28]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    value = row[feature]
    probabilities = lookup_table[feature][value]
    class_estimates = class_estimates * probabilities

index_max_class = class_estimates.argmax()
lookup_table["class_names"][index_max_class]
Out[28]:
1

This also returns a "1" because the second element of the "class_names" array is a "1". And this is now our prediction. So, let's store that in a corresponding variable.

In [29]:
class_estimates = lookup_table["class_counts"]
for feature in row.index:
    value = row[feature]
    probabilities = lookup_table[feature][value]
    class_estimates = class_estimates * probabilities

index_max_class = class_estimates.argmax()
prediction = lookup_table["class_names"][index_max_class]

And this is already how our "predict_example" function should work. So, let's copy this code into the skeleton of our function.

In [30]:
def predict_example(row, lookup_table):
    
    class_estimates = lookup_table["class_counts"]
    for feature in row.index:
        value = row[feature]
        probabilities = lookup_table[feature][value]
        class_estimates = class_estimates * probabilities

    index_max_class = class_estimates.argmax()
    prediction = lookup_table["class_names"][index_max_class]
    
    return prediction

So, this is our function. However, this function only predicts one row. So now, we need to apply it to our whole test set. And therefor, we can make use of the "apply" method.

df_test.apply(predict_example, axis=1, args=(lookup_table,))

Here, we pass in the function that we want to apply to the data frame. Then, we set the "axis" parameter equal to "1" because we want to apply the function to every row and not every column. And lastly, since our "predict_example" function has two parameters, we need to pass the second parameter ("lookup_table") to the "args" parameter of the "apply" method (side note: This has to be a tuple which is why there is a trailing comma after "lookup_table. Otherwise, Python doesn't interpret it as a tuple but just as the "lookup_table" itself).

So now, let's actually run this line.

In [31]:
df_test.apply(predict_example, axis=1, args=(lookup_table,))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-5e209382a07b> in <module>
----> 1 df_test.apply(predict_example, axis=1, args=(lookup_table,))

~\Miniconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   6876             kwds=kwds,
   6877         )
-> 6878         return op.get_result()
   6879 
   6880     def applymap(self, func) -> "DataFrame":

~\Miniconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

~\Miniconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    294             try:
    295                 result = libreduction.compute_reduction(
--> 296                     values, self.f, axis=self.axis, dummy=dummy, labels=labels
    297                 )
    298             except ValueError as err:

pandas\_libs\reduction.pyx in pandas._libs.reduction.compute_reduction()

pandas\_libs\reduction.pyx in pandas._libs.reduction.Reducer.get_result()

~\Miniconda3\lib\site-packages\pandas\core\apply.py in f(x)
    111 
    112             def f(x):
--> 113                 return func(x, *args, **kwds)
    114 
    115         else:

<ipython-input-30-a80bbf875f3a> in predict_example(row, lookup_table)
      4     for feature in row.index:
      5         value = row[feature]
----> 6         probabilities = lookup_table[feature][value]
      7         class_estimates = class_estimates * probabilities
      8 

KeyError: 9

When we do that, we run into an error. It says "KeyError: 9". This means that we are trying to access a dictionary with a key that doesn't exist, namely "9". So, let's investigate why this happens.

The error occurs on line 6 where we want to create the "probabilities" variable. In order to create that, we access our "lookup_table" (which is a dictionary). And the keys that we are using are the "feature" and the "value".

We know that the error can't be caused by the "feature" since there is no feature in the test set that is called "9".

In [32]:
df_test.columns
Out[32]:
Index(['Sex', 'Pclass', 'Age_Group', 'Embarked', 'SibSp', 'ParCh'], dtype='object')

So, the error must be caused by the "value". Therefore, let's check if there is a value of "9" in any of the columns.

In [33]:
(df_test == 9).any()
Out[33]:
Sex          False
Pclass       False
Age_Group    False
Embarked     False
SibSp        False
ParCh         True
dtype: bool

And, as we can see, the value of "9" occurs in the column "ParCh". So, let's filter the data frame to only inlcude passengers that traveled with 9 parents/children.

In [34]:
df_test[df_test.ParCh == 9]
Out[34]:
Sex Pclass Age_Group Embarked SibSp ParCh
PassengerId
1234 male 3 Unknown S 1 9
1257 female 3 Unknown S 1 9

So, there are two passengers in the test set that traveled with 9 parents/children. And they are the reason why we are getting the "KeyError" because the key "9" doesn't exist for feature "ParCh" in our "lookup_table".

In [35]:
lookup_table["ParCh"]
Out[35]:
{0: array([0.80215827, 0.67048711]),
 1: array([0.0971223 , 0.18911175]),
 2: array([0.07374101, 0.11747851]),
 3: array([0.00539568, 0.01146132]),
 4: array([0.00899281, 0.00286533]),
 5: array([0.00899281, 0.00573066]),
 6: array([0.00359712, 0.00286533])}

This another type of problem that can occur with rare values (as we have discussed in the second part of my Naive Bayes explained series). Namely, the value "ParCh=9" is so rare that we didn't even have any examples in our training data. It just so happened that it occured in the test set.

And because it is so rare, we are simply going to ignore the feature "ParCh" in this case (since it probably is not very predictive anyway). So, in the "predict_example" function, we are going to try to access the "probabilities" from the "lookup_table" (in order to update the "class_estimates"), except there is a "KeyError".

In [36]:
def predict_example(row, lookup_table):
    
    class_estimates = lookup_table["class_counts"]
    for feature in row.index:

        try:
            value = row[feature]
            probabilities = lookup_table[feature][value]
            class_estimates = class_estimates * probabilities

        # skip in case "value" only occurs in test set but not in train set
        # (i.e. "value" is not in "lookup_table")
        except KeyError:
            continue

    index_max_class = class_estimates.argmax()
    prediction = lookup_table["class_names"][index_max_class]
    
    return prediction

So, this what the function now looks like. So, let's try to apply it again to the whole test set.

In [37]:
df_test.apply(predict_example, axis=1, args=(lookup_table,))
Out[37]:
PassengerId
892     0
893     1
894     0
895     0
896     1
       ..
1305    0
1306    1
1307    0
1308    0
1309    0
Length: 418, dtype: int64

And now, we are actually getting our predictions. So, let's store them into a variable.

In [38]:
predictions = df_test.apply(predict_example, axis=1, args=(lookup_table,))
predictions.head()
Out[38]:
PassengerId
892    0
893    1
894    0
895    0
896    1
dtype: int64

Okay, so now that we have our predictions, let's see if they are actually correct. So, let's see how good our Naive Bayes algorithm is at making predictions. Or in other words, let's determine the accuracy of the algorithm.

Check Accuracy

Therefor, we simply need to compare our "predictions" to the actual labels.

In [39]:
predictions == test_labels
Out[39]:
PassengerId
892      True
893      True
894      True
895      True
896      True
        ...  
1305     True
1306     True
1307     True
1308     True
1309    False
Length: 418, dtype: bool

This returns a list (actually it's a pandas Series) containing Booleans. So, we have a "True" if the prediction and the label are the same, i.e. our prediction was correct. And we have a "False" if the prediction and the label are not the same, i.e. our prediction was not correct.

So, let's store this list of Booleans into a variable called "predictions_correct".

In [40]:
predictions_correct = predictions == test_labels
predictions_correct.head()
Out[40]:
PassengerId
892    True
893    True
894    True
895    True
896    True
dtype: bool

And now, since "True" or "False" values can be interpreted by Python as a 1 or 0 respectively, we can determine the accuracy of our algorithm by simply calculating the mean of our "predictions_correct" Series.

In [41]:
predictions_correct = predictions == test_labels
predictions_correct.mean()
Out[41]:
0.7655502392344498

And, as you can see, we get an accuracy of 77%. So, our algorithm is able to predict 77% of the examples in our test data correctly.

Therefore, our code seems to be working. But, just to be sure, let's compare the accuracy of our algorithm with the accuracy of the Naive Bayes algorithm implemented by Sklearn.

Comparison to Sklearn

And actually, there are several Naive Bayes implementations available in Sklearn. So, let's import some of them (side note: I actually don't really know what the exact differences of these implementations are).

In [42]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB

And now, we have to adjust our data a little bit since the Sklearn implementations, for example, can't handle strings.

In [43]:
# data preparation
df_train = replace_strings(df_train)
X_train = df_train.drop("Survived", axis=1)
y_train = df_train.Survived

X_test = replace_strings(df_test)
y_test = test_labels

So now, let's train the different Naive Bayes classifiers and see what accuracies we get.

In [44]:
# use different sklearn Naive Bayes models
clfs = [GaussianNB(), MultinomialNB(), ComplementNB(), BernoulliNB()]
clfs_names = ["GaussianNB", "MultinomialNB", "ComplementNB", "BernoulliNB"]

print("NB Model\tAccuracy")
print("--------\t--------")
for clf, clf_name in zip(clfs, clfs_names):
    clf.fit(X_train, y_train)
    acc = clf.score(X_test, y_test)
    
    print(f"{clf_name}\t{acc:.3f}")
NB Model	Accuracy
--------	--------
GaussianNB	0.763
MultinomialNB	0.768
ComplementNB	0.761
BernoulliNB	0.766

And, as you can see, we also get accuracies around 76%-77%. So, our algorithm gets the same kind of accuracy on this data set as the Sklearn implementations. Therefore, I think it is pretty save to say that our code is working correctly. And with that, we have reached the end of this tutorial.

Thanks for reading!

In [ ]: