Naive Bayes - Python Notebook

Naive Bayes is a simple yet powerful algorithm used for classification tasks in machine learning. It is based on Bayes’ Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. The “naive” aspect of the algorithm comes from the assumption that all features are independent of each other, which is often not the case in real-world data.

Dataset¶

Our dataset,

Outlook	Temperature	Humidity	Windy	Play
sunny	hot	high	false	no
sunny	hot	high	true	no
overcast	hot	high	false	yes
rainy	mild	high	false	yes
rainy	cool	normal	false	yes
rainy	cool	normal	true	no
overcast	cool	normal	true	yes
sunny	mild	high	false	no
sunny	cool	normal	false	yes
rainy	mild	normal	false	yes
sunny	mild	normal	true	yes
overcast	mild	high	true	yes
overcast	hot	normal	false	yes
rainy	mild	high	true	no

Theory¶

Bayes’ Theorem¶

According to Bayes’ theorem, this is proportional to the prior $P(y)$ multiplied by the likelihood $P(E | y)$ :

P(y | E) \propto P(y) \cdot P(E | y)

(1)

The “naive” assumption is that all features are independent, so we can break down the likelihood:

P(E | y) = P(\text{sunny} | y) \cdot P(\text{cool} | y) \cdot P(\text{high} | y) \cdot P(\text{true} | y)

(2)

This gives us the final formula we need to compare:

Y_{predicted} = \arg\max_{y \in \{\text{yes, no}\}} \left[ P(y) \cdot P(\text{sunny} | y) \cdot P(\text{cool} | y) \cdot P(\text{high} | y) \cdot P(\text{true} | y) \right]

(3)

Likelihood Calculations (with Laplace Smoothing)¶

Our code uses Laplace (or Add-1) Smoothing to prevent zero-probability problems. The formula for each conditional probability is:

P(x_i | y) = \frac{\text{count}(x_i, y) + 1}{\text{count}(y) + k}

(4)

Where:

$\text{count}(x_i, y)$ is the number of times the feature value $x_i$ appears with class $y$ .
$\text{count}(y)$ is the total number of times class $y$ appears.
$k$ is the total number of unique values for that feature (e.g., $k=3$ for Outlook, $k=2$ for Windy).

Implementation¶

dataset = [
    ['sunny', 'hot', 'high', 'false', 'no'],
    ['sunny', 'hot', 'high', 'true', 'no'],
    ['overcast', 'hot', 'high', 'false', 'yes'],
    ['rainy', 'mild', 'high', 'false', 'yes'],
    ['rainy', 'cool', 'normal', 'false', 'yes'],
    ['rainy', 'cool', 'normal', 'true', 'no'],
    ['overcast', 'cool', 'normal', 'true', 'yes'],
    ['sunny', 'mild', 'high', 'false', 'no'],
    ['sunny', 'cool', 'normal', 'false', 'yes'],
    ['rainy', 'mild', 'normal', 'false', 'yes'],
    ['sunny', 'mild', 'normal', 'true', 'yes'],
    ['overcast', 'mild', 'high', 'true', 'yes'],
    ['overcast', 'hot', 'normal', 'false', 'yes'],
    ['rainy', 'mild', 'high', 'true', 'no']
]

def train_naive_bayes(data):
    label_counts = {}
    feature_counts = {}

    for row in data:
        outlook, temp, humidity, windy, label = row

        label_counts[label] = label_counts.get(label, 0) + 1

        if label not in feature_counts:
            feature_counts[label] = {"Outlook": {}, "Temp": {}, "Humidity": {}, "Windy": {}}

        feature_counts[label]["Outlook"][outlook] = feature_counts[label]["Outlook"].get(outlook, 0) + 1
        feature_counts[label]["Temp"][temp] = feature_counts[label]["Temp"].get(temp, 0) + 1
        feature_counts[label]["Humidity"][humidity] = feature_counts[label]["Humidity"].get(humidity, 0) + 1
        feature_counts[label]["Windy"][windy] = feature_counts[label]["Windy"].get(windy, 0) + 1

    return label_counts, feature_counts

label_counts, feature_counts = train_naive_bayes(dataset)
print("Label Counts:", label_counts)
print("Feature Counts:", feature_counts)

Label Counts: {'no': 5, 'yes': 9}
Feature Counts: {'no': {'Outlook': {'sunny': 3, 'rainy': 2}, 'Temp': {'hot': 2, 'cool': 1, 'mild': 2}, 'Humidity': {'high': 4, 'normal': 1}, 'Windy': {'false': 2, 'true': 3}}, 'yes': {'Outlook': {'overcast': 4, 'rainy': 3, 'sunny': 2}, 'Temp': {'hot': 2, 'mild': 4, 'cool': 3}, 'Humidity': {'high': 3, 'normal': 6}, 'Windy': {'false': 6, 'true': 3}}}

def predict_naive_bayes(x, label_counts, feature_counts):
    total = sum(label_counts.values())
    probs = {}
    
    feature_names = ["Outlook", "Temp", "Humidity", "Windy"]

    for label in label_counts:
        probs[label] = label_counts[label] / total

        for i, feature in enumerate(feature_names):
            value = x[i] # Get 'sunny', then 'cool', etc.
            
            count = feature_counts[label][feature].get(value, 0)
            
            num_options = len(feature_counts[label][feature])

            probs[label] *= (count + 1) / (label_counts[label] + num_options)

    return max(probs, key=probs.get)


test_sample = ['sunny', 'cool', 'high', 'true']
prediction = predict_naive_bayes(test_sample, label_counts, feature_counts)

print("Test Sample:", test_sample)
print("Predicted Class:", prediction)

Test Sample: ['sunny', 'cool', 'high', 'true']
Predicted Class: no

Sci-kit¶

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix

df = pd.DataFrame(dataset, columns=['outlook', 'temperature', 'humidity', 'windy', 'play'])
le = LabelEncoder()
for i in df.columns:
    df[i] = le.fit_transform(df[i])

x = df.drop('play', axis=1)
y = df['play']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

clf = GaussianNB()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         2
           1       0.67      0.67      0.67         3

    accuracy                           0.60         5
   macro avg       0.58      0.58      0.58         5
weighted avg       0.60      0.60      0.60         5

Confusion Matrix:
[[1 1]
 [1 2]]