import pandas as pd
from helper_functions import prepare_data, replace_strings
from pprint import pprint
from IPython.display import Image
# load data
df_train = pd.read_csv("../../data/train.csv", index_col="PassengerId")
df_test = pd.read_csv("../../data/test.csv", index_col="PassengerId")
test_labels = pd.read_csv("../../data/test_labels.csv", index_col="PassengerId", squeeze=True)
# prepare data
df_train = prepare_data(df_train)
df_test = prepare_data(df_test, train_set=False)
# handle missing values in training data
embarked_mode = df_train.Embarked.mode()[0]
df_train["Embarked"].fillna(embarked_mode, inplace=True)
df_train.head()
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
example_table = {
"Sex": {"female": [0.15, 0.68],
"male": [0.85, 0.32]},
"Pclass": {1: [0.15, 0.40],
2: [0.18, 0.25],
3: [0.68, 0.35]},
"class_names": [0, 1],
"class_counts": [549, 342]
}
def create_table(df, label_column):
table = {}
# determine values for the label
value_counts = df[label_column].value_counts().sort_index()
table["class_names"] = value_counts.index.to_numpy()
table["class_counts"] = value_counts.values
# determine probabilities for the features
for feature in df.drop(label_column, axis=1).columns:
table[feature] = {}
# determine counts
counts = df.groupby(label_column)[feature].value_counts()
df_counts = counts.unstack(label_column)
# add one count to avoid "problem of rare values"
if df_counts.isna().any(axis=None):
df_counts.fillna(value=0, inplace=True)
df_counts += 1
# calculate probabilities
df_probabilities = df_counts / df_counts.sum()
for value in df_probabilities.index:
probabilities = df_probabilities.loc[value].to_numpy()
table[feature][value] = probabilities
return table
lookup_table = create_table(df_train, label_column="Survived")
pprint(lookup_table)
In the previous post, we built the "create_table" function, which executes the first step of the algorithm. And now, in this post, we are going to build the function that will execute the second step of the algorithm.
Therefor, let's first look again at the illustration of the Naive Bayes algorithm and let's quickly recap what we actually want to do in the second step of the algorithm.
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
Here, we are currently looking at test passenger 3 which is female and travels in the first class. To predict whether this passenger survived or not, we first estimate how many of the 549 non-survivors in the training data set would have the same combination of values as test passenger 3. Therefor, we multiply 15% with 15% with 549. After that, we estimate how many of the 342 survivors would have the same combination of values as test passenger 3. And therefor, we multiply 68% with 40% with 342. Then, we have two estimations and whichever is higher, that is what we are going to predict for test passenger 3.
So now, we want to create a function that does exactly that. So, let's start again with just the skeleton of the function.
def predict_example(row, lookup_table):
return prediction
So, we pass one row of the test set into the function, as well as the "lookup_table" that we created in the first step of the algorithm. And then, the function should return the prediction.
So now, let's start building the logic of the function. Therefor, let's first create a variable called "row" so that we have something to work with.
row = df_test.loc[904]
row
And now, let's start with creating the estimates. So, how many of the non-survivors we would expect to have the same combination of values as "row" and how many of the survivors we would expect to have that same combination of values.
And luckily, in the "lookup_table" that we created in the previous post, we stored everything as NumPy arrays. Because of that, we are able to make both of the estimations in just one step since we can make use of element-wise multiplication.
For example, let's look at the probabilities for "Sex=female".
lookup_table["Sex"]["female"]
And let's also look at the probabilities for "Pclass=1".
lookup_table["Pclass"][1]
And let's finally also look at the "class_counts".
lookup_table["class_counts"]
Those are the three arrays that we need to make the estimations depicted in the slide.
Image(filename='../../images/Naive Bayes algorithm.png', width=1000)
And all we have to do, is to simply multiply those.
lookup_table["Sex"]["female"] * \
lookup_table["Pclass"][1] * \
lookup_table["class_counts"]
This way, we get the same estimates as in the slide (the small differences in the slide are due to rounding errors).
So, for our "predict_example" function we are simply going to save the "class_counts" into a variable called "class_estimates".
class_estimates = lookup_table["class_counts"]
class_estimates
And then, we are going to iteratively multiply those "class_estimates" with the respective probabilities from the "lookup_table". In order to do that, we need to access each "feature-value-pair" of the "row" so that we can then, in turn, access the respective probabilities in the "lookup_table".
The features we can access by using the "index" attribute of the row.
class_estimates = lookup_table["class_counts"]
row.index
This returns a list-like object containing the features. So, let's simply iterate over that.
class_estimates = lookup_table["class_counts"]
for feature in row.index:
print(feature)
So, this is how we can access all the features. But for now, let's comment out the for-loop statement and let's instead create a variable called "feature" so that we , again, actually have something to work with.
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
So, we are pretending that we are currently in the iteration of the for-loop where we are looking at the feature "Sex".
So now, we need to access to corresponding value of the feature "Sex" for our "row". And this we can simply do by using brackets-notation.
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
row[feature]
So, let's store this value in a variable called "value".
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
And with that, we have now the "feature" and the "value" to access the corresponding probabilities in our "lookup_table".
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
lookup_table[feature][value]
And, as you can see in the slide, those are the correct probabilities when the "Sex" is "female". So, let's store them in a variable called "probabilities".
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
probabilities = lookup_table[feature][value]
And with that, we have now the corresponding probabilities for a given "feature-value-pair" from the row. Therefore, let's update the "class_estimates" by multiplying them with the "probabilities".
class_estimates = lookup_table["class_counts"]
# for feature in row.index:
feature = "Sex"
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
class_estimates
And, as you can see, now the number of non-survivors changed from 549 to 81 and the number of survivors changed from 342 to 233.
So now, we simply need to do the same thing when we consider all the features and not just the feature "Sex". So, let's uncomment the for-loop statement again and let's delete the line where we created the variable "feature".
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
class_estimates
So, as you can see, we would estimate that, on average, 0.8 of the non-survivors would have the same combination of values as the "row". And we would estimate that, on average, 7.95 of the survivors would have the same combination of values as the "row". So, we would predict that this passenger did survive.
And this means that the function should return a "1". So, how do we get that?
Well, therefor we also stored the "class_names" into our "lookup_table".
lookup_table["class_names"]
So now, we simply need to know the index of the largest value in "class_estimates". And that index, we can then use to index the "class_names" to get the actual prediction.
So, to get the index of the largest value in "class_estimates", we can use the "argmax" method.
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
class_estimates.argmax()
This returns a "1" since the second element of "class_estimates" is the larges element of that array. So, let's store that into a variable called "index_max_class"
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
index_max_class = class_estimates.argmax()
index_max_class
And now, let's use that index to access the corresponding element of the "class_names".
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
index_max_class = class_estimates.argmax()
lookup_table["class_names"][index_max_class]
This also returns a "1" because the second element of the "class_names" array is a "1". And this is now our prediction. So, let's store that in a corresponding variable.
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
index_max_class = class_estimates.argmax()
prediction = lookup_table["class_names"][index_max_class]
And this is already how our "predict_example" function should work. So, let's copy this code into the skeleton of our function.
def predict_example(row, lookup_table):
class_estimates = lookup_table["class_counts"]
for feature in row.index:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
index_max_class = class_estimates.argmax()
prediction = lookup_table["class_names"][index_max_class]
return prediction
So, this is our function. However, this function only predicts one row. So now, we need to apply it to our whole test set. And therefor, we can make use of the "apply" method.
df_test.apply(predict_example, axis=1, args=(lookup_table,))
Here, we pass in the function that we want to apply to the data frame. Then, we set the "axis" parameter equal to "1" because we want to apply the function to every row and not every column. And lastly, since our "predict_example" function has two parameters, we need to pass the second parameter ("lookup_table") to the "args" parameter of the "apply" method (side note: This has to be a tuple which is why there is a trailing comma after "lookup_table. Otherwise, Python doesn't interpret it as a tuple but just as the "lookup_table" itself).
So now, let's actually run this line.
df_test.apply(predict_example, axis=1, args=(lookup_table,))
When we do that, we run into an error. It says "KeyError: 9". This means that we are trying to access a dictionary with a key that doesn't exist, namely "9". So, let's investigate why this happens.
The error occurs on line 6 where we want to create the "probabilities" variable. In order to create that, we access our "lookup_table" (which is a dictionary). And the keys that we are using are the "feature" and the "value".
We know that the error can't be caused by the "feature" since there is no feature in the test set that is called "9".
df_test.columns
So, the error must be caused by the "value". Therefore, let's check if there is a value of "9" in any of the columns.
(df_test == 9).any()
And, as we can see, the value of "9" occurs in the column "ParCh". So, let's filter the data frame to only inlcude passengers that traveled with 9 parents/children.
df_test[df_test.ParCh == 9]
So, there are two passengers in the test set that traveled with 9 parents/children. And they are the reason why we are getting the "KeyError" because the key "9" doesn't exist for feature "ParCh" in our "lookup_table".
lookup_table["ParCh"]
This another type of problem that can occur with rare values (as we have discussed in the second part of my Naive Bayes explained series). Namely, the value "ParCh=9" is so rare that we didn't even have any examples in our training data. It just so happened that it occured in the test set.
And because it is so rare, we are simply going to ignore the feature "ParCh" in this case (since it probably is not very predictive anyway). So, in the "predict_example" function, we are going to try to access the "probabilities" from the "lookup_table" (in order to update the "class_estimates"), except there is a "KeyError".
def predict_example(row, lookup_table):
class_estimates = lookup_table["class_counts"]
for feature in row.index:
try:
value = row[feature]
probabilities = lookup_table[feature][value]
class_estimates = class_estimates * probabilities
# skip in case "value" only occurs in test set but not in train set
# (i.e. "value" is not in "lookup_table")
except KeyError:
continue
index_max_class = class_estimates.argmax()
prediction = lookup_table["class_names"][index_max_class]
return prediction
So, this what the function now looks like. So, let's try to apply it again to the whole test set.
df_test.apply(predict_example, axis=1, args=(lookup_table,))
And now, we are actually getting our predictions. So, let's store them into a variable.
predictions = df_test.apply(predict_example, axis=1, args=(lookup_table,))
predictions.head()
Okay, so now that we have our predictions, let's see if they are actually correct. So, let's see how good our Naive Bayes algorithm is at making predictions. Or in other words, let's determine the accuracy of the algorithm.
Therefor, we simply need to compare our "predictions" to the actual labels.
predictions == test_labels
This returns a list (actually it's a pandas Series) containing Booleans. So, we have a "True" if the prediction and the label are the same, i.e. our prediction was correct. And we have a "False" if the prediction and the label are not the same, i.e. our prediction was not correct.
So, let's store this list of Booleans into a variable called "predictions_correct".
predictions_correct = predictions == test_labels
predictions_correct.head()
And now, since "True" or "False" values can be interpreted by Python as a 1 or 0 respectively, we can determine the accuracy of our algorithm by simply calculating the mean of our "predictions_correct" Series.
predictions_correct = predictions == test_labels
predictions_correct.mean()
And, as you can see, we get an accuracy of 77%. So, our algorithm is able to predict 77% of the examples in our test data correctly.
Therefore, our code seems to be working. But, just to be sure, let's compare the accuracy of our algorithm with the accuracy of the Naive Bayes algorithm implemented by Sklearn.
And actually, there are several Naive Bayes implementations available in Sklearn. So, let's import some of them (side note: I actually don't really know what the exact differences of these implementations are).
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB
And now, we have to adjust our data a little bit since the Sklearn implementations, for example, can't handle strings.
# data preparation
df_train = replace_strings(df_train)
X_train = df_train.drop("Survived", axis=1)
y_train = df_train.Survived
X_test = replace_strings(df_test)
y_test = test_labels
So now, let's train the different Naive Bayes classifiers and see what accuracies we get.
# use different sklearn Naive Bayes models
clfs = [GaussianNB(), MultinomialNB(), ComplementNB(), BernoulliNB()]
clfs_names = ["GaussianNB", "MultinomialNB", "ComplementNB", "BernoulliNB"]
print("NB Model\tAccuracy")
print("--------\t--------")
for clf, clf_name in zip(clfs, clfs_names):
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print(f"{clf_name}\t{acc:.3f}")
And, as you can see, we also get accuracies around 76%-77%. So, our algorithm gets the same kind of accuracy on this data set as the Sklearn implementations. Therefore, I think it is pretty save to say that our code is working correctly. And with that, we have reached the end of this tutorial.
Thanks for reading!