Web Class: Build a Spam Detector using Text Classification

 

Transcript Part 1

Hello everybody my name is Mohit Deshpande, and in this course we’ll be building an AI that would be able to determine if an input email is spam or not.

And so you can see in some of the scores that we’re gonna be achieving as we build this AI, 99, 98% accuracy at determining if an email is spam or not. And just to kinda mess with it, I tried to, you know, I tried to give an AI an email that I just kind of fabricated up. Here’s the text of the email and you can see it correctly determined that this email is spam and we’re gonna be building this AI. So we’re gonna be learning a lot of things.

In particular, we have to look at text classification and how, you know, we’re gonna discuss some of the challenges of text classification and we’re gonna discuss one group of algorithms called Naive Bayes that are gonna be helpful for our problem. And then we’re also going to discuss this term frequency and inverse document frequency because those provide us some improvements that we can use to help bring up, the accuracy of our AI and we’re gonna be looking at one dataset in particular, called the Enron dataset that I’m going to, kind of go over a little bit in the course. It’s a publicly available dataset. You can download it, I’ll provided it in the source code, you can download it. And we’re gonna be using that to tell our AI, what is spam email and what is not a spam email.

So we’ve been making video courses since 2012, we’re super excited to have you on board. Online course is fantastic way to learn a new skill and I take a lot of them myself. ZENVA courses consist mainly of video lessons that you can watch, re-watch, at your own pace as many times as you want. We also have downloadable source code and project files that contain everything that we build in a lesson. It’s highly recommended that you code along with me, in my experience it’s the best way to learn something. And finally, we’ve seen that students who make the most out of these online courses are the same students that create a plan and stick with it, depending on your own availability and learning style, of course. And remember that these videos you can watch and re watch as many times as you want, so that really gives you more flexibility, so you can adapt to how you learn. At ZENVA we’ve taught programming and game development to over 200,000 students, over 50 courses since 2012. (now 350,000+)

Some of these students have used the the skills that they’ve learned in these courses to advance their own careers, start a company or publish their own apps and games. Thanks again for joining and I look forward to seeing all the cool stuff that you’ll be building. Now without further ado, let’s get started.

Transcript Part 2

Hello everybody. In in this video, I just wanna introduce you guys to the problem of text classification. I just wanna give this brief overview of what it is and a little bit about kind of a specific thing and we’re gonna go through an example and then we’ll then move on to how we can actually perform this kind of classification.

The problem of this text classification and actually it’s sometimes also called document classification. So, document classification– The challenge with document classification is, given an input document, there’s some text in it. We want to be able to put it into one or in some cases more, different bins. For example, if I were given some kind of document, maybe I wanna know whether it’s an invoice. Here’s a possible bin it could go into. Or maybe it’s a receipt. So here’s another possible bin it could go into, and then so on and so on. Just looking at the text of a document, I want to be able to tell what kind of document it is and as you probably guess this requires supervised learning but it’s a bit more challenging than what we’ve seen before because instead of just having those X’s and O’s on that nice two dimensional plane, we’re dealing with text data and computers are really great at handling numerical data but with text data they’re not as good.

Humans, on the other hand, we can look at text data and it’s fine. Looking at large amounts of numerical data for humans can be kind of tedious and error prone. But for computers, they love that stuff. So, we’re gonna discuss a bit later how we can take textual data, we have to take textual data and convert it into some kind of numerical representation so that we can work with it using learning algorithms. We should be able to work with it once numerical data ’cause learning algorithms operate on numerical data. And so we have to find some way to convert the words into a numeric representation so that we can work with them a little bit better.

Now in particular, a good example that’s used for text classification is, suppose I have an email and I basically wanna categorize it as spam or not spam, also called ham. But if I’m given an email, I want to determine if it’s spam or not spam and a lot of email services already have a built in way to do this. For example, like Gmail or Outlook, Microsoft’s stuff, all these companies already have ways that given a new input email, detect whether it’s spam or ham. This is a good example to use because it’s simple enough that there are only two categories. This is really just a binary problem, whether it’s spam or not. And so, it’s a bit easier to understand than maybe this top example (mumbles) you have a document or wanna put it into different classes because it might be the case that it belongs in more than one of these classes. I could get a receipt that’s also an invoice or I can mix and merge these things but with something like email classification, like spam filtering, it’s either spam or not spam. There’s no combination.

Just to recap, we want to look at the text of the email, the words on this email because some words are more indicative of an email that’s (mumbles) being spam than others. And so we want to look at the text of this email, including often information and we want to build and construct an AI that given a new email message, it can determine if it’s spam or not and then route it to either your inbox or your spam folder. Like I mentioned before, supervised learning is a great way to accomplish this task. I’m gonna give our AI lots of examples, labeled examples of messages that are spam and messages that are not spam and then it should be able to learn what kind of words or what kind of other characteristics spam messages have and then apply them to a new input document.

Like I mentioned, in order to do this, we have to have a numerical representation but let’s assume we have this right now. Assume it’s already there. We’ll deal with that a bit later. I also wanna mention that text classification isn’t just used for spam filtering. Actually, there’s a ton of different applications. Spam filtering is just one of them. But there’s also a sentiment analysis. So, sentiment analysis, which is given the text of a document, or actually, it’s very popularly used for social media, given a tweet or a Facebook post, or something like that, I want to be able to determine the sentiment of the creator. I wanna be able to tell is the sender angry or upset or other things like that. And there’s certain words and phrasings that are used in that sense and we can try to predict the sentiment of the sender, for example. This is especially popular in social media so you can mine lots of social media data. There’s been a lot of work there and then try to run sentiment analysis on that.

Another thing it’s also used for is book classification. Book classification, and what I mean by that is, given a book, maybe it’s title or something like that, we wanna be able to tell what genre it is, for example. Given some information about the book, like the title, the author or something like that, we wanna be able to tell what genre it is and this is super useful because then you don’t have to have manual people that have to make these decisions themselves because it might get tedious or maybe they have more important things to do. This is something that we can delegate to a machine learning algorithm lists from previous inaccuracies. Give it a book, you can tell what genre it is, for example or even categorize it even further.

Another thing that I wanna mention, another popular use of this is called readability. Readability. And with readability, it is more like given a passage of text, I wanna determine things like what level of reading or comprehension level you need to understand this passage or more accurately, given some of the words in this passage, what is the expected reading level, for example. So, you might notice that if you have an elementary school or a primary school reading level, the words might be just one or two syllables and the sentence structure is fairly simple but as you go up to more scientific writing or graduate school writing for example, then you notice that there are longer words. There are more complicated words. Maybe the sentence structure is different. The structure varies, it’s more complicated. Readability assessment, we can look at text. We can look at words, the sentence structure to access a passage and determine what level of reading comprehension would be assigned to that passage of text. So that’s where I’m gonna stop here.

So just to recap, text classification or document classification is this problem of given some text input, I want to assign some label to it. I can assign labels like invoice, receipt, medical record and so on. Or the example that’s nice to look at is email filtering and that is given an email, I just want it to say whether it’s spam or not spam and I can use the words inside of the email to help me make this determination. I mentioned that there are plenty of different applications of text classification. I mentioned sentiment analysis in social media, book classification to tag books what genres and readability to determine to what level we can tag a passage like primary, elementary school up to scientific, college, grad school sort of thing. And that is text classification.

Transcript Part 3

Hello everybody, my name is Mohit Deshpande and in this video, I just want to introduce this algorithm that we can use called Naive Bayes, and we’re going to kind of be discussing it over the next few videos.

In this particular video I’m gonna write down and explain the actual equation that we use and we’ll look at how we can do a concrete example and see how Naive Bayes works there. And then I have some other, just kinda wrap-up stuff to discuss Naive Bayes. And so, like we were mentioning with textual data we have to have some kind of numeric representation for it and we want to learn some information about whether or not we’re going to be using spam detection, for example, for this sequence. Because it’s a nice example, the equation turned out to be fairly simple and easy to understand. It’s pretty intuitive.

So we’re gonna stick with this notion of spam filtering or spam detection. So spam filtering, or detection. And so the uh, so with Naive Bayes and with spam filtering it’s kind of logical to assume that spam messages tend to have words in more, have a different word distribution than messages that are, that are not spam. What I mean by this is, like for example, one example we’ll look at is the word “free”, or something, right. So the word “free” is probably going to be more common in spam messages that are advertising like “free money” or “free something or the other”. For example, like the word “free” is more associated with spam messages than it is with ham or not spam messages. And so we can use that, we can word this probabilities and so we can use that to better assess given an input image, or given an input text, I should say, given an input text. We can determine whether it’s spam or not based on the words that are in that email. So I wanna write down the equation.

So Naive Bayes centers around this thing of probability in many fields called Bayes theorem. And so let me write it down. I’m gonna write it down first and then I’m going to discuss it intuitively. It might seem a little scary at first but each part intuitively makes sense that you’ll see. So, the probability that a message is spam knowing that we have some word, W, in it, is equal to the probability that of finding that same word, W, in a spam message times the probability that we actually have a spam message all divided by the same thing up here. The probability of finding that word in a spam message times the probability that message is spam to begin with plus the probability of finding that word in something that is not spam. And this symbol by the way just means “not”. Times the probability that a message is not spam. So this seems kind of big and tedious but I just want to, we’ll kind of go through it in part. And so, I should mention that these Ps by the way, this means “probability” and you can intuitively think of probability as like chances or odds, like if I roll a die, if I roll a six-sided die, then what is the probability that it lands on a five? Well, it’s just one out of six because I have one outcome that I want which is landing with a five side up, and there’s six possible outcomes.

And so, intuitively, this is also something that in the probabilities, here’s the outcome and then here are all possible outcomes. And so you can kind of intuitively think of this as being, you know, probability but that’s what just the Ps mean, probability. Might see a capital P or P, capital P or lower case R. I’m just gonna write it like this. So an this bar means “given”. And what I mean by given is that this is a conditional probability because it depends on what word that we know here. So, anyway, I just wanna kinda go through this intuitively. It seems a little scary at first but let’s get through this. So this first term here, this is what we want to find. We want to know, what is the probability that this message that we received is spam, given that it has some word, W, in it. Maybe I should make this rule. So this is spam. This is ham, or not spam. And W then is just a word, any word. So, so this is, let me get back here. So this is a probability that their input message is spam given that it has some word in it. And this is equal to the probability that we find that word in a spam message times the probability, this probability, S, means what’s the probability that this message is even spam to begin with.

So we can have like general statistics as to you know, what percent of input email, what percent of email that you get is spam, now versus ham. And so this is the probability that any input email that you get to begin with is spam or not. And this is if it’s not spam. So probably what’s gonna happen, we’re also gonna see what happens in practice is that this is actually pretty high. The probability that it could, you probably get more spam than you do ham messages. So, expect this to be kind of close to one. And so anyway, this is the probability that we find a word in that spam message. And this is like what I mentioned with the word “free”. We expect the word “free” to be found, it’s more likely to find that word in a spam message than a ham message. And so this is what this top quantity kind of looks at is what are the odds of finding this word, W, in a spam message. Then we divide by this sum, which is all possible outcomes. A possible outcome is that this word is in a spam message and probably spam.

But then there’s also maybe a smaller likelihood that this word is, you wanna also know like what are the odds of finding this word in a message that’s not spam? So would you look at a word and see is it more likely to fall into a spam message or is it more likely to fall into a not spam message? And so we can make a determination based on which of those two outcomes has a higher probability. So, is the probability that, this output probability is, if we find it’s more likely to be a spam message then we’re gonna assign it to be a spam message. And so, that’s how we can make a determination. And I, we’re going to go through an example, a more concrete example of this. But intuitively this is like saying that the probability of a spam message with some given word is the likelihood, first of all that we have a spam message to begin with, times the probability of that word being in a spam message, divided by the probability that we actually encounter this word in the first place. And that’s what this is trying to say.

So this is a word given that it’s in spam and then times the probability of spam plus the word is in not spam. So there are really only two outcomes, right? Either the word is in a spam message or it’s not in a spam message. And so this is what this is trying to account for. And we’ll try to find the probability that this is in a spam message. So this is high. What does that mean? That means that the probability that this input email is spam given this word is in it, means that this email is probably spam. And so that’s kind of, I’m gonna stop right here because I don’t want to complicate this further. We’re actually going to go through an example of this in the next video. But I’m gonna stop right here and just kinda reexplain this one more time.

So Naive Bayes centers around this principal of Bayes Theorem and intuitively I can say what this says is what we wanna find is the probability that any input email is spam given that it has some word in it. We say that this is a logical assumption to make because there are a lot of, you know, we look at the content of an email to determine whether it’s spam or ham. So we wanna say what the probability of finding you know, finding if this is a spam message given that it has a word in it, a particular word, W, in it. And so that’s equal to the probability that this word appears in spam messages times the overall probability that we even get spam messages to begin with. And then this bottom is kind of, like I said, separating this into two parts, where you know, this is W in spam, this is W in ham or not spam, messages.

So these are kinda the two things, outcomes that we can get for W. It’s also commonly called our evidence. Basically, you know, so that’s what we’re trying. So we can find all of these numbers based on our data. We can look at each word in our all the emails in our training data and see, you know, what’s the likelihood that is in a spam message? Oh hey, I found this word in all these spam messages this many times, and then I can, you know, figure out these probabilities. So all these things I can calculate from my data set. And then given a new input image, or a new input text I should say, I can determine and compute this probability. And if it’s super-high then I know that this message is spam.

So I’m gonna stop right here and in the next video I’m actually gonna go through a more concrete example so that we can understand this a bit better.

Transcript Part 4

Hello everybody. My name is Mohit Deshpande and in this video, I want to go through a more concrete example of applying this Naive Bayes approach to seeing if there is Spam or not.

So I wanna substitute concrete values for these and so we can kinda see. We’re trying to make the values, well it’s gonna be obvious that we can see that this is a good approach and that it works. And so let’s kinda use an example. So I’m gonna give you some of these numbers. So first, I’ll give you the probability of that we have a message and it’s Spam. So, and this is something that you can, again, you can compute from your data set but I’m just kinda use a overall statistic.

So the way that, there’s been like studies and to try to find this value, and so it’s been shown that about 86% of messages that you get are Spam. And so logically, if 86% of messages you get are Spam, then that means that the other 14% must be messages that are not Spam. So I can write that probability of not Spam is then 0.14, and this follows logically cause this message is either Spam or not Spam. Now suppose that the word we’re looking at is a word, free. And so now we still need some, we’re still missing some values. So we have like here, and here, and here. So we got three values down, we still need three more. And we’ll really, really just need two more because these two are the same here. Actually these two are the same here as well. And so let’s suppose that we’re looking at the word, free, because same as just like I mentioned, I tend to have the word, free, in there and regular emails that you get, Ham emails, generally don’t have the word, but they could. So let’s make this like really obvious. Let’s suppose that the probability that we’d find the word, free, in a Spam message is something really high, like 0.96.

And so what this is saying is that the odds that we find the word, free, in a Spam message is 96%. It means that in, given that a message is, if we know that the message is Spam, then we have this chance of the word, free, appears in Spam messages, 96% of the time. What we wanna find is whether this new message that we get, that we received was Spam, but given that it has the word free in it. So based on just looking at this number, you’re probably already thinking that, well this is probably gonna be a Spam message because in our data set, for example, this is such a high probability. But let’s suppose that finding the message, finding free, in a message that’s not Spam, is something like 0.02. So two, in very few cases, 2% of the time, we find the word free in something that is not Spam. And it’s important to note that these two things are not related. So you can’t do the same thing like you did here. You can’t be like, well hey, so. If this is free in Spam message then why don’t I have 0.04 here, because 0.04 is actually a probability of not finding free in a Spam message, that would be 0.04, so these two things are not related at all. So now we actually have enough values to plug this in and find our answer.

So just kinda off the bat, you can probably, I’ll try to make this obvious, but you can probably tell that this message is not gonna be Spam because the probability that you find the word, free, in a Spam message is 96%, so the word, free, is a pretty good indicator, finding the word free is a pretty good indicator that your message is Spam, and so let’s actually like go through a computation and find this probability, and we expect it to be really high. So let’s plug in values. So first I wanna find the probability that the message is Spam given that it has the word, free, in it. Well that’s equal to the probability of finding the word, free, in Spam messages times the overall probability that this, any given input is Spam to begin with. Sometimes the probability that I find the word that, the same thing here plus the probability, when I find the word, free, in a message that is not Spam, times the probability that this is not Spam.

And so I can like plug in values here. So the probability that it’s free, given that it’s, probably to finding the word, free, in Spam message is 0.96, 0.96 times probability that I actually have a Spam message, 0.86 divided by the same quantity here. More or less that these two are the same because this is one possible outcome, and this must be overall outcomes, plus probability that I find the word, free, in a Ham message was only 0.02, and then the probability of actually finding, getting a message that is not Spam is 0.14. So when I compute all of these out, I get a 0.9966, and so we can’t have this value, we know from our data set or from wherever you find these values from, we know that this input email is almost, almost positive that this input email is a Spam message and we just did that by knowing just these four values. And these values are actually things that we’d learn from our data set. Given our input data set, that they’re labeled, examples of Spam and Ham messages.

So we can compute this probability. So we can know what the probability of receiving a Spam message based on our data set. Likewise, we also know this. And then we can find this probability about our word or in case these words, and there’s actually a problem with this approach that I’m gonna address in the next video is that it’s not just, we just don’t really look at a single word, there’s actually, we don’t look at all the words in our email, and there’s kind of, I’m gonna talk about why it’s called Naive Bayes in the next video because of this assumption that we make.

But yeah, so we can see that this is actually a pretty good way that we can determine whether an email is Spam or not by looking at what’s the likelihood that I find a, what’s the likelihood that I find this word or any word in a Spam message, or what’s the likelihood that I find it in a message that isn’t Spam and then compute that probability, and so suppose I had this, we’re supposed to set a free, I had like, something like, I don’t know, like report or, well like report or research or something like that. Now then what’s probably gonna happen is that the probability that I find the word, research, for example, in a Spam message is probably gonna be pretty low or it will be lower than if I look at it. The probability of finding the word, research, in a Ham message or a message that is not Spam or from looking at my university account, for example. So based on that, then what would happen is this probability would be pretty low, probably finding the word, the probability that this message is Spam, given that it has the word, research, in it for example. That probability might be low, and if it’s low then I can conclude that this message is not Spam.

And these two things are, like finding this, and there’s a probability that it’s not Spam given that it has the word, free, in it by the way, are indeed related by the way, because, so if the probability that I have a message, if it’s Spam message given it has the word, free, it’s 0.9966, then the probability that it’s not Spam, or it’s Ham is gonna be what, 0.0044, and so that’s very low probability. And so I wanna pick which one of these gets me the highest probability, in this case, it’s far more likely that this input email is Spam instead of Ham and so I can brought that to my Spam folder. And I should mention that the algorithms and techniques that companies use or personal proprietary, and they’re probably far more complicated than this but this approach is actually not that bad. So this is kinda, it’s certainly easier to understand than some of the more advanced technique, so we’ll just be looking at, we’ll be looking at this, and so I’m gonna stop right here, just do a quick recap. So we actually ran through a computation of using a Naive Bayes technique to determine if a message was Spam, given that it has the word, free, in it.

And i kinda gave you some numbers here, and while the probability of getting any message that’s Spam is like pretty high, and consequently, the probability that you’re getting a message that is Ham is pretty low and then it looks for a data set and say, well based on my experience, have I seen the word, free, in Spam messages or Ham messages and so based on that, I can, I know whether or not this is Spam or Ham because then I just look at how many times have I seen this in Spam messages versus Ham messages and I’ll be computing this probability and we found that this message is almost, almost 100% positive that this message is almost 100% likely this message is Spam. And so that’s how we can do a computation of Naive Bayes.

All these things are things that we can learn or algorithms that can automatically compute for us. Kind of problem with is that we’re only looking at a single word but we wanna be looking at emails or sequence of words. So we wanna be able to consider a sequence of words. So I’m gonna address how we can do that in the next video.

Share this article

Leave a Reply

avatar
  Subscribe  
Notify of