# Naive Bayes Classifier : Supervised Learning Algorithm

Naive bayes is a classification algorithm (based on bayes theorem) that assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Naive Bayes Classifier and Collaborative Filtering together create a recommendation system that together can filter very useful information that can provide a very good recommendation to the user. It is widely used in a spam filter, it is widely used in text classification due to a higher success rate in multinomial classification with an independent rule. Naive Bayes is very fast and can be used to solve problems in real-time.

Let’s imagine a fruit may be considered to be an orange if it is orange, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Let’s dive into the formulae and mathematics behind, shall we?

Naive Bayes is basically us putting a naive assumption over Bayes rule in probability to make life simple. Bayes Rule is the same concept that most of us have seen at one place or the other. Very similar to high school mathematics! (I hope we paid attention then, haha)

Bayes Rule:

P(Y|X) = P(Y) * P(X|Y) / P(X)

Now, let’s discuss how do we use the Naive Bayes in classification problems and what is this “naive assumption” that we mentioned!

So let’s assume we have a dataset with ‘n’ features; and we want to predict the value for Y.

X can be represented as <X1, X2, …, Xn> and Y is a boolean variable that can take only 2 values.

The naive assumption that we make is : All Xi  are conditionally independent given Y. This means that given the value of Y. Xi doesn’t care what some other Xj is (obviously i != j), just like in the example above the features are color, shape and diameter but assume them to be independent of each other given that we know it is an apple when we apply this algorithm.

So, the term in the right hand side of that formula simply becomes the product of n terms, which are P(Xi | Y) where i varies from 1 to n.

To understand why this helps us in classification problems => one must understand what all is required when predicting probabilities for a target variable.

We want to predict the value of Y given a bunch of features, X.

We want P(Y | X), now if we were to find this we would need the joint probability distribution of X and Y. This is the main problem as estimating the joint distribution is a difficult task with limited data.

For n boolean features we need to estimate 2 ^ n probabilities/parameters. By making the assumption of conditional independence we are limiting down the parameters to linear in n.

2 * n – 1 to be exact.

How to estimate them? We can either use Maximum Likelihood estimation or Maximum a posteriori estimation.

Real life example of Naive Bayes Algorithm

A real life example of Naive Bayes is filtering spam emails. Naive Bayes classifiers are often used in text classification. This is because they perform better in multi class problems and also assume independence rules.

A detailed article on how Naive Bayes Algorithm filters spam messages is : https://towardsdatascience.com/naïve-bayes-spam-filter-from-scratch-12970ad3dae7

• Easy and quick implementation to predict the class of test data set.
• Performs well in multi class prediction too.
• If the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data.
• Performs great in case of categorical input variables.