Naive Bayes is a simple yet powerful algorithm used for classification tasks in machine learning. It is based on Bayes’ Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. The “naive” aspect of the algorithm comes from the assumption that all features are independent of each other, which is often not the case in real-world data.
Dataset¶
Our dataset,
| Outlook | Temperature | Humidity | Windy | Play |
|---|---|---|---|---|
| sunny | hot | high | false | no |
| sunny | hot | high | true | no |
| overcast | hot | high | false | yes |
| rainy | mild | high | false | yes |
| rainy | cool | normal | false | yes |
| rainy | cool | normal | true | no |
| overcast | cool | normal | true | yes |
| sunny | mild | high | false | no |
| sunny | cool | normal | false | yes |
| rainy | mild | normal | false | yes |
| sunny | mild | normal | true | yes |
| overcast | mild | high | true | yes |
| overcast | hot | normal | false | yes |
| rainy | mild | high | true | no |
Theory¶
Bayes’ Theorem¶
According to Bayes’ theorem, this is proportional to the prior multiplied by the likelihood :
The “naive” assumption is that all features are independent, so we can break down the likelihood:
This gives us the final formula we need to compare:
Likelihood Calculations (with Laplace Smoothing)¶
Our code uses Laplace (or Add-1) Smoothing to prevent zero-probability problems. The formula for each conditional probability is:
Where:
is the number of times the feature value appears with class .
is the total number of times class appears.
is the total number of unique values for that feature (e.g., for Outlook, for Windy).
Implementation¶
dataset = [
['sunny', 'hot', 'high', 'false', 'no'],
['sunny', 'hot', 'high', 'true', 'no'],
['overcast', 'hot', 'high', 'false', 'yes'],
['rainy', 'mild', 'high', 'false', 'yes'],
['rainy', 'cool', 'normal', 'false', 'yes'],
['rainy', 'cool', 'normal', 'true', 'no'],
['overcast', 'cool', 'normal', 'true', 'yes'],
['sunny', 'mild', 'high', 'false', 'no'],
['sunny', 'cool', 'normal', 'false', 'yes'],
['rainy', 'mild', 'normal', 'false', 'yes'],
['sunny', 'mild', 'normal', 'true', 'yes'],
['overcast', 'mild', 'high', 'true', 'yes'],
['overcast', 'hot', 'normal', 'false', 'yes'],
['rainy', 'mild', 'high', 'true', 'no']
]def train_naive_bayes(data):
label_counts = {}
feature_counts = {}
for row in data:
outlook, temp, humidity, windy, label = row
label_counts[label] = label_counts.get(label, 0) + 1
if label not in feature_counts:
feature_counts[label] = {"Outlook": {}, "Temp": {}, "Humidity": {}, "Windy": {}}
feature_counts[label]["Outlook"][outlook] = feature_counts[label]["Outlook"].get(outlook, 0) + 1
feature_counts[label]["Temp"][temp] = feature_counts[label]["Temp"].get(temp, 0) + 1
feature_counts[label]["Humidity"][humidity] = feature_counts[label]["Humidity"].get(humidity, 0) + 1
feature_counts[label]["Windy"][windy] = feature_counts[label]["Windy"].get(windy, 0) + 1
return label_counts, feature_counts
label_counts, feature_counts = train_naive_bayes(dataset)
print("Label Counts:", label_counts)
print("Feature Counts:", feature_counts)Label Counts: {'no': 5, 'yes': 9}
Feature Counts: {'no': {'Outlook': {'sunny': 3, 'rainy': 2}, 'Temp': {'hot': 2, 'cool': 1, 'mild': 2}, 'Humidity': {'high': 4, 'normal': 1}, 'Windy': {'false': 2, 'true': 3}}, 'yes': {'Outlook': {'overcast': 4, 'rainy': 3, 'sunny': 2}, 'Temp': {'hot': 2, 'mild': 4, 'cool': 3}, 'Humidity': {'high': 3, 'normal': 6}, 'Windy': {'false': 6, 'true': 3}}}
def predict_naive_bayes(x, label_counts, feature_counts):
total = sum(label_counts.values())
probs = {}
feature_names = ["Outlook", "Temp", "Humidity", "Windy"]
for label in label_counts:
probs[label] = label_counts[label] / total
for i, feature in enumerate(feature_names):
value = x[i] # Get 'sunny', then 'cool', etc.
count = feature_counts[label][feature].get(value, 0)
num_options = len(feature_counts[label][feature])
probs[label] *= (count + 1) / (label_counts[label] + num_options)
return max(probs, key=probs.get)
test_sample = ['sunny', 'cool', 'high', 'true']
prediction = predict_naive_bayes(test_sample, label_counts, feature_counts)
print("Test Sample:", test_sample)
print("Predicted Class:", prediction)Test Sample: ['sunny', 'cool', 'high', 'true']
Predicted Class: no
Sci-kit¶
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
df = pd.DataFrame(dataset, columns=['outlook', 'temperature', 'humidity', 'windy', 'play'])
le = LabelEncoder()
for i in df.columns:
df[i] = le.fit_transform(df[i])
x = df.drop('play', axis=1)
y = df['play']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
clf = GaussianNB()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))Classification Report:
precision recall f1-score support
0 0.50 0.50 0.50 2
1 0.67 0.67 0.67 3
accuracy 0.60 5
macro avg 0.58 0.58 0.58 5
weighted avg 0.60 0.60 0.60 5
Confusion Matrix:
[[1 1]
[1 2]]