The challenge of text classification is to attach labels to bodies of text, e.g., tax document, medical form, etc. based on the text itself. For example, think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)? We’ll take a look at one natural language processing technique for text classification called **Naive Bayes**. Download the full code here.

#### Bayes Theorem

(credit: https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes%27_Theorem_MMB_01.jpg)

Before we derive the algorithm, we need to discuss the fundamental rule that Naive Bayes uses: Bayes Theorem:

where and

Let’s take a second to break this down. On the left, we have the probability of an event

The classic example used to illustrate Bayes Theorem involves medical testing. Let’s suppose that we were getting tested for the flu. When we get a medical test, there are really 4 cases to consider when we get the results back:

**True Positive**: The test says we have the flu and we actually have the flu**True Negative**: The test says we don’t have the flu and we actually don’t have the flu**False Positive**: The test says we have the flu and we actually*don’t*have the flu**False Negative**: The test says we*don’t*have the flu and we actually*do*have the flu

Suppose we also know some information about the flu and our testing methodology: we know our test can correctly detect that a person has the flu 99.5% of the time (i.e., **true positive rate** and **true negative rate**. We also know that this specific type of flu is rare and only affects 1% of people. Given this information, we can compute the probability that any randomly selected person will have this specific type of the flu. Specifically, we want to compute the probability that the person has the specific type of flu, *given* that the person tested positive for it, i.e., event

Let’s just substitute the problem specifics into Bayes Theorem.

Now let’s try to figure out specific values for the quantities on the right-hand side. The first quantity is **prior probability**. In other words, it is the probability that any random person has the flu. We know from our problem that this number is 1%, or 0.01. Let’s substitute in those values in the numerator.

Now we have to deal with the denominator:

Anyways, when can our test return positive? Well there are two cases: either our test returns positive and the person actually has the flu (true positive) or our test returns positive and our person does not have the flu (false positive). We can’t quite simply sum both of these cases to be the denominator. We have to weight them by their respective probabilities, i.e., the probability that any person has the flu overall and the probability that any person *does not* have the flu overall. Let’s expand the denominator.

Now let’s reason about these values.

This result is a little surprising! This is saying, despite our test’s accuracy, knowing someone tested positive means that there’s only a 67% chance that they actually have the flu! Hopefully, this example illustrated how to use Bayes Theorem.

#### Deriving Naive Bayes

Now let’s convert the Bayes Theorem notation into something slightly more machine learning-oriented.

where

Let’s break this down again like we did for the original Bayes Theorem, except we’ll use the context of the text classification problem we’re trying to solve: spam detection. Our hypothesis

There’s something a bit off with this formulation though: the evidence needs to be represented as multiple pieces of evidence: the words

Excellent! We can use a conditional probability formula to expand out the numerator.

Not only does this look messy, it’s also quite messy to compute! Let’s think about the first term:

#### Naive Bayes Assumption

To help us with that equation, we can make an assumption called the **Naive Bayes assumption** to help us with the math, and eventually the code. **The assumption is that each word is independent of all other words**. *In reality, this is not true!* Knowing what words come before/after do influence the next/previous word! However, making this assumption greatly simplifies the math and, in practice, works well! This assumption is why this technique is called *Naive* Bayes. So after making that assumption, we can break down the numerator into the following.

This looks better! Now we can interpret a term

This is the Naive Bayes formulation! This returns the probability that an email message is spam given the words in that email. For text classification, however, we need an actually label, not a probability, so we simply say that an email is spam if

(where **score** since this is not a true probability anymore.

#### Numerical Stability

There’s one extra thing we’re going to do to help us with **numerical stability**. If we look at the numerator, we see we’re multiplying many probabilities together. If we do that, we could end up with *really* small numbers, and our computer might round down to zero! To prevent this, we’re going to look at the **log probability** by taking the log of each side. Using some properties of logarithms, we can manipulate our Naive Bayes formulation.

Now we’re dealing with *additions* of log probabilities instead of *multiplying* many probabilities together! Since log has really nice properties (monotonicity being the key one), we can still take the highest score to be our prediction, i.e., we don’t have to “undo” the log!

#### Dataset

We’ll be using the Enron email dataset for our training data. This is *real* email data from the Enron Corporation after the company collapsed. Before starting, download all of the numbered folders, i.e., enron1, enron2, etc., here and put all of them in a top-level folder simply called enron. Put this enron folder in the same directory as your source code so we can find the dataset!

**A WORD OF WARNING!: Since this dataset is a real dataset of emails, it contains real spam messages. Your anti-virus may prune some these emails because they are spam. Let your anti-virus prune as many as it wants. This will not affect our code as long as there are some spam and ham messages still there!**

#### Naive Bayes Code

Here is the dataset-loading code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import os import re import string import math DATA_DIR = 'enron' target_names = ['ham', 'spam'] def get_data(DATA_DIR): subfolders = ['enron%d' % i for i in range(1,7)] data = [] target = [] for subfolder in subfolders: # spam spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam')) for spam_file in spam_files: with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f: data.append(f.read()) target.append(1) # ham ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham')) for ham_file in ham_files: with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f: data.append(f.read()) target.append(0) return data, target |

This will produce two lists: the data list, where each element is the text of an email, and the target list, which is simply binary (1 meaning spam and 0 meaning ham). Now let’s create a class and add some helper functions for string manipulation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | class SpamDetector(object): """Implementation of Naive Bayes for binary classification""" def clean(self, s): translator = str.maketrans("", "", string.punctuation) return s.translate(translator) def tokenize(self, text): text = self.clean(text).lower() return re.split("\W+", text) def get_word_counts(self, words): word_counts = {} for word in words: word_counts[word] = word_counts.get(word, 0.0) + 1.0 return word_counts |

We have a function to clean up our string by removing punctuation, one to tokenize our string into words, and another to count up how many of each word appears in a list of words.

Before we start the actual algorithm, let’s first understand the algorithm. For training, we need three things: the (log) class priors, i.e., the probability that any given message is spam/ham; a vocabulary of words; and words frequency for spam and ham separately, i.e., the number of times a given word appears in a spam and ham message. Given a list of input documents, we can write this algorithm.

- Compute log class priors by counting how many messages are spam/ham, dividing by the total number of messages, and taking the log.
- For each (document, label) pair, tokenize the document into words.
- For each word, either add it to the vocabulary for spam/ham, if it isn’t already there, and update the number of counts. Also add that word to the global vocabulary.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def fit(self, X, Y): self.num_messages = {} self.log_class_priors = {} self.word_counts = {} self.vocab = set() n = len(X) self.num_messages['spam'] = sum(1 for label in Y if label == 1) self.num_messages['ham'] = sum(1 for label in Y if label == 0) self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n) self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n) self.word_counts['spam'] = {} self.word_counts['ham'] = {} for x, y in zip(X, Y): c = 'spam' if y == 1 else 'ham' counts = self.get_word_counts(self.tokenize(x)) for word, count in counts.items(): if word not in self.vocab: self.vocab.add(word) if word not in self.word_counts[c]: self.word_counts[c][word] = 0.0 self.word_counts[c][word] += count |

First, we can compute the log class priors by counting up how many spam/ham messages are in our dataset and dividing by the total number. Finally, we take the log.

Then we can iterate through our dataset. For each input, we get the word counts and iterate through each (word, frequency) pair. If the word isn’t in our global vocabulary, we add it. If it isn’t in the vocabulary for that particular class label, we also add it along with the frequency.

For example, suppose we had a “spam” message. We count up how many times each unique word appears in that spam message and add that count to the “spam” vocabulary. Suppose the word “free” appears 4 times. Then we add the word “free” to our global vocabulary and add it to the “spam” vocabulary with a count of 4.

We’re keeping track of the frequency of each word as it appears in either a spam or ham message. For example, we expect the word “free” to appear in both messages, but we expect it to be more frequent in the “spam” vocabulary than the “ham” vocabulary.

Now that we’ve extracted all of the data we need from the training data, we can write another function to actually output the class label for new data. To do this classification, we apply Naive Bayes directly. For example, given a document, we need to iterate each of the words and compute *them* all up. Then we add the log class priors and check to see which score is bigger for that document. Whichever is larger, that is the predicted label!

To compute

On additional note: remember that the log of 0 is undefined! What if we encounter a word that is in the “spam” vocabulary, but not the “ham” vocabulary? Then **Laplace Smoothing**. We simply add 1 to the numerator, but we also have to add the size of the vocabulary to the denominator to balance it.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def predict(self, X): result = [] for x in X: counts = self.get_word_counts(self.tokenize(x)) spam_score = 0 ham_score = 0 for word, _ in counts.items(): if word not in self.vocab: continue # add Laplace smoothing log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) ) log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) ) spam_score += log_w_given_spam ham_score += log_w_given_ham spam_score += self.log_class_priors['spam'] ham_score += self.log_class_priors['ham'] if spam_score > ham_score: result.append(1) else: result.append(0) return result |

In our case, the input can be a list of document texts; we return a list of predictions. Finally, we can use the class like this.

1 2 3 4 5 6 7 8 9 10 | if __name__ == '__main__': X, y = get_data(DATA_DIR) MNB = SpamDetector() MNB.fit(X[100:], y[100:]) pred = MNB.predict(X[:100]) true = y[:100] accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred)) print("{0:.4f}".format(accuracy)) |

We’re reserving the first 100 for the testing set, “train” our Naive Bayes classifier, then compute the accuracy.

To recap, we reviewed Bayes Theorem and demonstrated how to use it with an example. Then we re-worked it using hypotheses and evidence instead of just events

Naive Bayes is a simple text classification algorithm that uses basic probability laws and works quite well in practice!