**Transcript Part 1**

– Hello everybody. My name is Mohit Deshpande. And before we get into our main topic of neural networks, I first wanna talk a little bit about where they come from.

And in particular, we’re just gonna… In this video, I just wanna very briefly just go over kind of the inspiration for neurons, and this topic of neural networks. ‘Cause they haven’t been, this is not a new topic, neural networks. They’ve been around for decades. And even some of the more advanced techniques, like convolutional neural networks, have still been around for decades. But with advancements in research for having bigger and deeper neural networks, then that’s kind of started to bring these back into focus. That, along with GPUs, being great for these computationally intensive tasks, and a bunch of other factors, have sort of brought them up to the forefront. And you read all about this, these advancements in AI. And everyone’s running to use AI now.

So I just wanna talk a little bit about where the inspiration for neural networks comes from, ’cause it’s actually fairly natural. In fact, it comes from our own brains. And so I just wanna talk a little bit about how biological neurons work in your brain and then we can actually, you can see that we can take that model, a biological model, and actually formalize it into a mathematical model. And then when we have that mathematical model, then we can start saying, well, okay, if I have one neuron, what happens when I have four of them? And what happens if I layer them, right? And then that kind of helps build up from just a single neuron over to a full, deep neural network.

And so that’s kind of where I’m gonna be starting at, just from a single neuron. We’ll start talking a little about that. Then we’ll maybe use two neurons, and have a little tiny neural network. And then we’ll see what happens when we expand on that. Anyway, biological neurons. I’m gonna try to draw, that came out kinda weird, but, suppose this is your brain. Right? Not a very good picture, but I guess it serves the purpose. So suppose this is your brain here, right? And so what we know is that at this point in time, there are different regions of your brain that correspond to different tasks. For example, I happen to know that the occipital lobe is kinda towards the back of the brain, and the occipital lobe deals with things like vision-based tasks, dealing with visual cues and stuff like that. So a lot of the vision stuff, that you see, and how you can make sense of what you see, happens kind of towards the back of, towards the back of your brain. So let me take… And there are all sorts of different regions, and they can change, and there’s all this neuroscience behind this, but I’m not gonna talk about that. But let’s kind of zoom in here.

So suppose I just zoom into this, and then let’s kinda make a bigger picture. Okay, so suppose I zoom into just one particular portion. You can see that the brain is actually composed of these things called neurons. And they have kind of this nice structure in the sense that they’re just a cell in your body, like any other cell. And just like any other cell, they have a nucleus. So here is the, here is the nucleus of your neuron. And the nucleus of the cell does all sorts of different functions. It pretty much is like the little mini brain of the cell. It just has a lot of the information. And it has the purpose of the cell, and how it’s suppose to carry out tasks, and DNA, and all sorts of other stuff. But anyway, that’s not particularly what we want.

Particularly what we’re interested in is let’s first get the definition of a neuron, and then see if we can model it mathematically. So there are kinda two big portions of these neurons. They’re called the dendrites and the axon. And so let’s suppose that my dendrites are over on this side. And then my axon, is over on this side. So the dendrites are like little… Let me see if I can draw a few of them here. So they’re kind of like little connections here. And they actually go to other, they go to other neurons. So these dendrites are actually incoming connections from other neurons. See if I can kinda draw a few here. I’m just gonna, stop right there. Okay, so these are the dendrites. Let me actually do this in a different color. These are the dendrites, basically, so dendrites. And they’re kinda like incoming connections from another neuron, because all these neurons are kinda interconnected. And so these dendrites are… If there was another neuron up here, it would be connected to this through the dendrites.

And so what this neuron does, is it takes the information that it gets from these dendrites, and then does some kind of task, and then produces an output. And so that output is called the axon. And so I can kinda continue this drawing here. And you produce a single output here, but the output may go into different neurons. So here, is the axon. And so the axon, is just kinda the single output of the biological neuron. And so then this becomes, so then this goes to other neurons. So other neurons. So then this output goes out to other neurons. And this is from… This should be to other neurons, this is from, other neurons this way. And so each one of these biological neurons are connected, they’re connected to each other in complicated ways that we don’t fully understand yet. But these dendrites are connections from other neurons. And the axon produces a single output that goes to other neurons. And you can see that this is kinda like the singular building block of our brain.

So we can take these neurons, these biological neurons, and kinda stack them together. And so maybe I make a copy of this, put it over here. And then it becomes, these connections map up to these dendrites. And I can stack them in all sorts of, organize them in all sorts of different ways and, I get a brain, or I get a portion of the brain. And then I connect those portions of the brain and we’re using other neurons to other portions of the brain, and then you keep doing that until you get a fully functioning brain. So these were kind of like the building blocks. There’s kinda one question that we have to address is that, when you take the information from the dendrites, what does it actually do with that information? And so that’s where we kinda get into what’s called the firing rate, I guess, of the neuron, or also called the activation of the neuron, is that it takes these inputs, it performs some operation on them and it only fires, fires, meaning that it produces a particular output, it produces an output based on these inputs. And it performs some operation based on each single input that it gets. And then it must look at all of the neurons that it gets, kind of as a whole and then it produces some output.

And that’s not, still, there’s lots of stuff going on. And so don’t think that neural networks are actually a representation of this, ’cause the brain is far more complicated than any of our existing models of neural networks. But it’s mathematically fairly reasonable. And especially when you discuss things like convolutional networks, you can actually see, you can see the super position of the activation layers for the convolutional networks and then the activation layers of the occipital lobe, and when you look at them, they actually look pretty similar. So that kinda means that we’re on the right track. But anyway, this is really all we need. And this is kind of the singular building block of neural networks. And we’re gonna see that this is enough that we can model this mathematically into something that we call the perceptron. And you’re gonna see what happens when we can stack these perceptrons together, then we get what’s called a multilayer perceptron, or MLP. And so these are just kinda the singular building blocks of our brain.

And we’re gonna formalize them into a mathematical definition so that we can use them as the singular building block for our neural networks. Okay, so this part, I’m going to stop right here, and I’ll just do a quick recap. So in this video we discussed kind of the biological inspiration for neurons and neural networks. So our brain is kinda split up to different portions, and it’s composed of these building blocks that we call neurons. And they kind of have three big portions, they have three big parts. There’s the dendrites, which are the inputs to this neuron, quote, unquote, “inputs.” And there’s the nucleus, which actually does some of the processing. It takes the inputs of the dendrites, does some kind of processing task, and then produces an output. And then an output travels along the axon. And the axon then goes as the input to other neurons. And you get this kind of cycle over and over again. So these are kinda the fundamental building blocks of our brain. And we’re gonna model these mathematically, so that we can use them as the fundamental building blocks for our neural networks.

**Transcript Part 2**

– Hello everybody, my name is Mohit Deshpande, and in this video, I want to take our model of a biological neuron and then model it more formally using mathematics into what we call a perceptron.

If you remember this picture, we kind of had, I’ll go back here. If you remember, with this picture we had these biological neurons and like I said, there are kind of three big portions of this. There are dendrites, which were different inputs from other neurons, and we could have any number of inputs from other neurons, and then we had the nucleus, which does some kind of processing task and it has to consider all of the neurons. All the input neurons, I should say. And then, it produces some output. We can model that mathematically into something called a perceptron. I’m gonna draw a big circle here, and that’s gonna represent our cell body kind of thing. This perceptron, it kind of represents this. Just kind of two portions of this.

First, let’s go from left to right, naturally. So, if you remember with the biological neurons, we have inputs from other neurons through these dendrites. Suppose we have some input here, so suppose that this is connect with, let’s say, it’s connected with four other neurons. Let’s just kind of start small. I’m gonna call my input X. X1, X2, X3, and X4. These are the four other neurons that this one happens to be connected to. Then in here, there’s actually something inside there called a synapse, and it’s kind of like a transfer of neurotransmitters, but again, I’m not gonna talk much about that.

Basically what the synapse gives us is, well, here’s my synapse I’m gonna denote with a little circle. Synapse gives us little weights. We call them weights. Here’s W1, here’s weight two. Actually, (mumbling) let me do these in color. Here’s weight one, weight two, weight three and weight four. Now that I have these weights, we model this with a multiplicative interaction. What I mean by that is what actually goes into this neuron is actually X1, X2, X3, X4, each of these times that respective weight, so it’s gonna be X1 W1, X2 W2, X3 times W3, and then X4 times W4. These are the actual inputs that go into our neuron or into our perceptron. This kind of models this side here with these dendrites. Now, we actually have to do something with the nucleus. Actually, just to make this point clear, what I’m gonna do is I’m just put these in a different color again. This will then be X1 W1, X2 W2, X3 W3, and then X4 W4. When I get these different inputs, kind of a natural thing to do would be to take the sum of all of these inputs.

What’s really input into this neuron is gonna be X1 W1 plus X2 W2 plus X3 W3 plus X4 W4. One extra little term that’s added is plus B, and this B is called the bias. This is just a different term that we add into our model and it turns out that it works really well. You remember, with all these weights, these weights are things that we’re actually learning. X is just the inputs. The weights are what we wanna learn. Because each of these are multiplied by an input, we wanna have one other parameter that’s kind of independent of all of the inputs so to speak. And that’s the bias, because we don’t multiply it by any of the inputs. This kind of serves as a global term across all of these weight inputs. If this produces a certain value that’s kind of all off by five or something, I’ll just change my bias to five or minus five and then adjust this whole thing. It’s just kind of another additional adjustment parameter. Actually, I can represent this more succinctly using a summation notation. I can do sum I, X of I, W sub I, and then plus the bias. The bias isn’t actually in this sum, it’s just kind of extra, an extra thing. This is just a more concise way of representing this sum.

I didn’t wanna have to write this out all the time. I can just write this. That’s kind of one of the big portions of our network. When I take all these inputs, I take the sum of them, and so what I end up with is the sum over I of XI WI plus the bias, B. This is what happens when I take my inputs, and each input here is weighted by some weight, W. Then what we do is, inside, we take the sum of all of these. Notice when we take the sum, we basically get a value. We just get a value, or, depending on how you think about this, you can get a vector back depending on what these inputs are. When you get these inputs, you take this thing called a weighted sum and then you add a bias. There’s one extra thing that we have to do, remember? I mentioned that there’s some other processing that we had to do and then we have to decide whether or not this perceptron or this neuron actually fires quote unquote.

We have to also decide what value we actually fire, because the output of this is gonna be the input into some other perceptron, if we had this layered, for example, which we will. So, the question is, how do we take this input and what do we do with it, basically? Usually what we do is, we kind of cut this a little bit, we cut this up and we usually have some function that I’m gonna call G. We have some function, G, and G actually produces the output. Remember, the perceptron takes in, or any neuron, really, takes in a single, you know, actually, it outputs a single value. It takes in any number of inputs, but the output is always just kind of one thing. But then, these one things can go into any number of different perceptrons, and it’s really the same value that’s kind of the input. If I had more perceptrons that were here, this G would then go all the way over, the output of this perceptron would go into all of them. But G is just some function that we apply to this that decides what output we get. And G is interchangeable. I’m gonna talk a little bit about some of the values. We take this sum, then we apply G to it. So, G is this. We apply G, the weighed sum, XI WI plus B, and then G outputs some value, and that’s the value what we give to any other perceptrons that happen to be one layer deeper than this. G is what we call an activation function.

So, G is an activation function. Activation function. It’s also called a non-linearity. A non-linearity, because G is usually a non-linear function. The reason that G is not a linear function is because we might wanna model things that are not linear and so we can’t model things that are not linear by taking the linear combination of a ton of linear things. You know, if you do that, then the output is linear, so you need to add some non-linearities so that when you apply them, you can actually produce something that’s non-linear. I’m not gonna talk too much about that, but there are all sorts of different options for this. And actually, there’s still people inventing activation functions. There’s some recent activation functions that people have been inventing. There’s all sorts of different kinds.

There’s a sigmoid, which was one of the first ones but then we kind of moved away from it ’cause it had some issues. There’s tanh, which is hyperbolic tangent. Probably one of the better ones is ReLU, or a rectified linear unit. There is kind of a variant of that called a LeakyReLU, a LeakyReLU that tries to address some of the issues with the original ReLU. Probably one of the more simple ones is, there’s just, like, a step function, for example, and there’s so many more different kinds of activation functions. It’s really just kind of picking one, seeing if it works well. If it doesn’t, then maybe try something else. I will say that ReLUs tend to work really well, so I would recommend that you look at that. Sigmoid is probably something to avoid, unless you’re using it for an output or you’re doing something very specific with that, but internally, for perceptrons, you kind of want to avoid sigmoid. ReLU or a LeakyReLU works well.

There are others, of course, that work well, and people are inventing these all the time. Anyway, that is really our model of our perceptron. That’s just kind of a single perceptron. What we’re going to be doing, I’m gonna do a quick recap, but what we’re going to be doing is seeing what happens when we, we’re gonna use an example with these perceptrons. We’re gonna kind of go through an example of how we would use them, and we’ll see much later on what happens when you’re just dealing with one perceptron. What happens when you stack them together? What happens when you have them in layers? How many layers should I have? How many perceptrons should I have in each layer? Well, we will answer all these questions. I’ll just do a quick recap. The perceptron is supposed to be one of the simplest models of a biological neuron, and so what it does is, basically, our inputs are X1, X2, X3, X4.

We can have any number of inputs, but each input is multiplied by a corresponding weight, W1, W2, and so on. Then, the actual input to this is then X1 W1, X2 W2, X3 W3, and so on. What we do with that is we take the sum of all those, so you get X1 W1 plus X2 W2 plus X3 W3 plus so on and so on and then we can add in this additional term called a bias, and the bias is added because it tends to work out better. And it’s good, because it’s not tied to a particular input, which is also nice. It’s gonna help out. After we have this weighted sum, then we actually have to figure out what output we wanna produce, and to figure that out we use G, which is an activation function. We apply G to this weighted sum here, and we can produce N, we can produce an output. I guess I should use some shorthand here, in the sense that this is usually, like, Z. This is actually just gonna be G of Z that we apply and then we get an output, A, which is an activation.

That’s kind of a shorthand notion that I’ll be using kind of going forward. This is a perceptron model, and G is just an activation function or a non-linearity. It’s also called that. There’s lots of different examples you can pick for that. Probably the simplest is step function. We’re gonna be looking at that in the future. So, I’m gonna stop right here, but in case you don’t fully understand this, ’cause this is kind of all, this is a little abstract. We’ll actually take this idea of a perceptron, and let’s use this in a more concrete example.

**Transcript Part 3**

– Hello everybody. My name is Mohit Deshpande. And in this video, I want to kind of go through an example of using this perceptron model.

And so, you know, this might seem kind of kind of complicated, because I’m using all this summation. And these activation functions. It might seem kind of complicated, but let’s actually do an example of this. Let’s start with a small example, and then, we can see how we can scale this up. So in particular, what I want to model with this perceptron is called an AND gate. And let’s just model an AND gate. Perceptron example. So, if you remember what AND does is AND let me actually make our grid here. So if you remember AND, what I mean by AND gate is just something that, you know is similar to in programming an AND. So if you remember, if I want to take the AND return to one, only if both of the inputs are also one. And so, I can kind of draw this graphically.

So if I suppose that this is X1, which is one of my inputs. And this is X2, which is my other input. Then, I should be able to fill this out. And so here is my zero. And so, what I want you guys to do is to take a second and see if you can fill out this graph. And so let me actually get you started. Here. So. Let’s suppose that green is zero. So I’ll just kind of model this as something like being. Well, actually, here. So green, we’ll kind of keep this simple. So this means. So this means zero. And this color circle filled in means one. And so let me get you started. If X1 is zero, and X2 is zero, if I take zero and zero, that gets me zero. So this coordinate here at the origin is then going to be a circle. Just a non filled in circle. And so. There are two other things. There are three things you can fill out right here, so here’s. Here’s a point you have to fill out. Here’s a point you have to fill out. And here’s a point here that you have to fill out. What if both of these inputs are one?

And so why don’t you guys take a second, and no. For each one of these three other points, you figure out whether it should return zero or one. Assuming that X1 or X2 are the inputs. Why don’t you take a second to fill out this graph, and we’ll be right back with the answer. Okay, so. If one of these is one and the other is zero, then this should produce zero at this point here. If X2 is one, but X1 is zero, then this should produce a zero right here. But, if X1 and X2 are both one, then I can produce a one right here. So I’m going to kind of fill this in. And so here is my Here is my graph right here. And so I want to build a perceptron that models this exact thing. And you know, if you can see that what I ultimately want is kind of a way to look at this line here so that everything here or actually at this point here is one, and all the points kind of here are zero. There’s only really one point here, because we’re considering only binary values.

But yeah, this kind of a line that I want to create. Let’s say we can build a perceptron model. And so let’s you know, do our little perceptron model. And then, you know remember we have our g function here, and then we actually. How many inputs do we have? Well we have two. X1 and X2. Actually, here. Here’s what I’ll do. I’ll kind of move this around so we can I’ll look at this a bit better. What I’ve done here. Here’s my g, and then I produced some output. Then I have two inputs. I have X1 and X2. Both of these go into now the drawing, but actually what I have, is I have this weight. Weight one. And weight two. So what ultimately happens is and then I also have, by the way, I also have a bias. And so, what I want to do is find values for weight one and weight two. And bias (b), so that when these two are one, the output should also be one.

But then, you know, the question you might be asking is “Well what’s this activation function?” Well here I can define it really quickly. So my activation function g is g of x, let’s say some, actually let me use some. Let me use a different character, so it’s not as confusing. Let’s suppose like g of g of a is going to be equal to one if a is greater than zero. And it’s going to be equal to zero if a is less than or equal to zero. So this kind of defines our function g. And so, what we want to do is take the weighted sum, right? So we X1, W1 plus X2, W2, plus bias. And you apply some g to it. And then we want to produce some value. A might have been a bad choice there. But let’s just suppose this equals some value. So, let’s try this for a different value. So we have to initialize these values to to something. And so I suggest, let’s just initialize everything to something simple like one or zero.

Let’s initialize everything to one. So let my bias be one. Let this weight one be one. Actually, let me use a different color so it’s more obvious. Let’s the bias equal one. Let this weight equal one. Let this other weight equal one. So now let’s try this with different inputs. So suppose that X1 equals zero and X2 equals zero. Then what’s my output? My output it’s actually, you know compute this output. So both of these are zero. We expect the output to also be zero. Let’s see what happens. So this is going to be zero. This is where we are substituting values here. This is going to be zero, zero, times one, times one, plus one. So when I do all this sum, I get zero plus zero plus one equals one. Remember, I have to take g. So what’s g of one. One is greater than zero, so this output’s one. So right off the bat, we’re not at a good start. Because our value should be high. We produce a weighted sum z, we get a value of one. And we took g of one, then we get one. But that’s not the output that we want. We want that output to be zero for this. Right?

So we know that our initial assignments of weights is not correct. In fact, it’s too high. The value is too big. And so, what we want to do is find some way. From this intuitively, we know that we should be decreasing some of these values. Because it’s too high already. And so, to kind of make this go along a little quicker, I say let’s decrease the bias, because really, when we do this weighted sum, it’s really this bias term that we can use to bring the, this weighted sum down. And so, let’s set the bias equal to zero, for example. So now, let’s try this again. So, now when my bias is zero, I get zero times one plus zero times one plus zero. And that gets me a value of zero. So when I apply g of zero, I get. Hey, I get zero! So this seems to work for this case. But that’s not the only case we should be checking. We should be checking all these other cases. So now let’s try one one. So I can do one times one plus one times one plus zero, and what does that get me? Well that gets me a value of two.

And so when I do this computation, then when I apply g of two, well two is greater than zero, so I get one. Okay? So, almost done. Now let’s try some other cases. Let’s try where one of these is zero. So zero times one plus one times one plus zero. And so this should be equal to zero, right? So let’s try this again. So we get a value of one. But, wait a minute? G of one is equal to one. So this isn’t quite right again. And so again, the value is to high. So we need to decrease our bias again. For just the sake of simplicity, let’s it makes intuitive sense to decrease the bias, and not these weights. Because we’re considering this global parameters. So anyways, bias is zero. It isn’t quite right. So, let’s decrease it, again. So, let me change my bias, again, and. I’m just going to jump right to so let’s make our bias minus one point five. Let’s try recomputing this again. So, now. Let me compute this. We’re going to skew these computations again. Here. So now we have you know. We’re going to get rid of all this stuff here. So let’s try it with all of our our values here. And so this’ll be one times one plus zero times one minus one point five, minus one point five, minus one point five, and minus one point five. And so, let’s kind of see what we get here. So, let’s try it in all these cases. So, zero times zero plus zero times zero minus one point five is to be minus one point five. And that’s actually below zero, and so when we apply g to it, then we get. We just use this arrow here, so. Then we apply g to it, we get zero. And that’s exactly what we expect. Because this value’s less than zero.

Now let’s try this. Well one times one is one. Plus one times one is one. That’s two. Minus one point five is going to be zero point five. What I do is zero point five, then I get a value of one. And that’s exactly what we expect. So, now. Let’s try the other two cases. So this is going to be equal to one. Well one minus one point five is going to be minus zero point five. And that’s below zero, and so I output zero. And then same for this. This is also going to be minus zero point five. Then I output a zero. So, these. It turns out that this orientation of parameters is the right answer. Where we have weight one equals one. Weight two equals one. And then the bias is minus one point five. So seems that that this orientation of parameters works well for this AND gate. And remember that this works well for the AND gate, so if you. This won’t work for other gates. We’ll have to change this if we want this working with other gates. But, this kind of example, to show you how this how this works. So really quickly, let’s anyway, I’m just going to. I’ll stop right here, and I’m going to kind of give you some motivation that moves on to how does somebody scale these up.

So, in this video, we built a working AND gate using a perceptron. Using a single perceptron by setting these weights and this bias, we can get a working a working AND gate. And this kind of a goal is to find out what values of weights and bias produce the correct output. And so, we found through trial and error, basically, that weight one should be one. Weight two should be one. And the bias should be minus one point five. So we have the single we have the single perceptron.

Another question is, well what happens when we scale this up? Let’s suppose, you know, let’s use more neurons. Let’s have them oriented in layers. So let’s have more layers. So, the question is what happens when we take this single perceptron and expand it out, we add more perceptrons, and we kind of construct our first neural network, called a multilayer perceptron.

**Transcript Part 4**

– Hello, everybody, my name is Mohit Deshpande and in this video, I want to scale up our single perceptron into this multi-layered perceptron or this neural network, and so, I’m gonna kinda generalize this notion of what happens when we have multiple neurons and what happens when we have, what happens when we have more layers, for example.

So, let’s see how we can model this, so neural network, neural network, and so, we can model this like our perceptron, but we have more than just a single perceptron. In effect, we kinda have three different kinds of layers in a neural network, if we’re generalizing this. We have an input layer, and so here is my layer, this dot, dot, dot just means we can have any number, so we’ll have X1, X2, and Xn. So, I have this as my input layer, here, and what happens is, from between my input and my output, I can have just an output layer and that’s kinda what we had with our AND gate, we’re building our AND gate, is that we just had… We had two neurons here and these were the input layers, so the input layer was just X1 and X2 and then the output layer was just a single neuron here, being Y. These are actually represented as neurons as well, so the inputs are also neurons, they just have a single value.

And so, here is my input layer, and then I can have an output layer here. Suppose I have an output layer here, Y one, two, Ym, let’s say. Supposed they’re N neurons in the input layer and M neurons in the output layer. The way that the neural network works is between two layers, every neuron is connected to every other neuron between any two consecutive layers. So, for example, X1 would be connected to Y1, X2 would be connected to Y1, all of these would be connected to Y1, and Xn would be connected to Y1, and then, for any other of the intermediary ones, like Y2, X1 is connected to Y2, X2 is connected to Y2, and so on. All the way to Ym, so now X1 is connected to Ym, X2 is connected to Ym, and so on. Then every neuron in this layer here, so let me actually label this as well, this is output.

Every neuron in this input layer is connected to every neuron in the output layer, and that’s what we call fully connected. And so this is fully connected, which is why also, these are also called fully connected layers sometimes, or fc, these are also abbreviated very commonly as fc layers because every neuron is connected to every other neuron between these two layers. And so, this is kinda how we have an input and an output. So, this is a very simple one layer network here, so this is just like that layer here. There’s no what we call hidden layers, and so, we can increase this by adding what we call hidden layers and the hidden layers go in between the input and the output. The hidden layer, for example, would be, now, again, it can have any number of neurons, and so, maybe this is H1, for hidden layer, and this is like H sub p, so that there’s N inputs p neurons in the hidden layer and then M neurons in the output layer, but remember that they’re only connected, they’re fully connected between two consecutive layers.

So, what I mean by this is, X1 is not now, is not directly connected to Y1, but X1 is connected to H1 and X2 is connected to H1, and so on, and so on here, and then again with H2, X1 is connected to H2, X2 is connected to H2, and so on, and so on, until we get to this very last layer here, this H sub p. All of these are connected internally here, then H1 is then connected to Y1 and two, and three, and so on, and then same for this neuron here, and then we just get this sort of nice pattern here. So, this is then, nothing on this one, okay. So, then this is what we have, and so, in between two consecutive layers, all the neurons in those two layers are all connected to each other. That’s what we call fully connected, and so this is actually called the hidden layer because it’s…

And actually, we can have more than one hidden layer, we can have two hidden layers, and then all the neurons in the first hidden layer are connected to all the neurons in the second hidden layer, which are connected to all the neurons in the output, so we can kinda stack these as deep as we want. Let’s actually consider a more concrete problem and that is with this kind of dataset called the MNIST handwritten handwritten digits, and this is kinda like the Hello, World of neural networks and deep learning. We have these handwritten digits, there’s images where we can flatten them out into just a list of numbers and the goal is to, given a digit, be able to tell whether it’s zero, one, two, three, four, five, six, seven, eight, or nine.

We’re using MNIST, we actually have 10 output classes, and those classes are zero, one, two, all the way to nine, so that’s all 10 digits. And so, it’s kinda the goal with the MNIST dataset, is to pick which one of these digits is the correct one, given some input digit, and so, the goal of building this neural network AI is to train it on lots of examples of MNIST digits. That way, when it sees a new digit that it’s never seen before, it can be able to correct, it can correctly predict what kind of digit it is. Whether it’s zero, four, seven, or nine. How we actually compute this is that, so you might be saying, how would we actually do this computation? So, what happens is, we take our image and we flatten it out into one giant list of numbers, called a vector, and so, that kind of works as the input.

Again, we apply the same perceptron idea where these are all weights. So, all these connections that I draw in here are actually all weights, they’re all different weights. And you know, there’s also a bias here, and then there’s another bias, but these are all weights and these are all parameters that we want our neural network to learn. So, these weights and these biases are the actual learning, is the actual learning that’s happening here. And so, initially, they’re just set to any random value, so when we run this through our neural network, initially, we’re gonna get some pretty bad outputs. As this sees more and more examples and as we learn, the output is gonna get better and better, and we can visually see that.

We can visualize our, something like our accuracy, we can visualize our accuracy and it should be going up. We should be getting more accurate as we see more examples. And so, how this computation actually works is, kinda going back to that, is we take our input and when we flatten it out into a vector, we apply that perceptron rule, so then we take this particular value here, multiply it by the weight here and it goes into the input of H1. Then we take the same X1 multiplied by this other rate and it goes into H2, and so on, and so on. This hidden layer computes its own values and then it submits it to the output layer, and then, the output layer computes a value, and then you kinda, at the end of this output layer, so this output layer, we get probabilities, basically.

We get probabilities for each class and so, what that means is, when I input an image, I get a list of 10 probabilities, if I’m using MNIST, and so each of these probabilities are the likelihood that Y input is, you know, falls in each of these classes. Suppose if I get, if I run a particular example through my neural network and I get that the… I get that the output distribution, there’s a 95, one of the probabilities is a 95% five, for example. Then I know that I should be selecting, this digit should be a five because it has the highest probability. It’s very likely that this input is a five, for example, and so that’s kind of what the output there does, and the output layer can decide that. So, basically, what we’re trying to do with this is, we get this forward pass thing going on. It’s called a forward pass, so you take this input and you feed it through and you just kind of keep passing on from one layer to the other.

So, you start with the input, then you pass it on to this hidden layer, and the hidden layer passes it on to the output, and the output finally computes this. So, you just kinda keep passing on the activations from one to the other, and then eventually, at the end, you get a probability distribution and you pick the most likely one, if you’re using this softmax distribution. And you just pick the most likely one, so that’s kinda how that works. So, I’m a little out of time, so I’m gonna stop right here and just do a quick recap. So, with these neural networks, in this video, we kind of saw what happens when we scale these up, so instead of just having a single perceptron and two inputs, what happens when we have more inputs? What happens with an N input, specifically, and what happens when we have different, more layers, for example. And so, this kind of helps answer that question. Where we have an input, we can have three. Input any number of hidden layers and then an output. And then, we have this forward pass, or also called feedforward, where you just take the input and you send it off to the first hidden layer and then it computes those activations, using the same perceptron rule, and then it passes those activations on to the next hidden layer, and then the next hidden layer computes its activations, and then you just kinda keep passing it on until you get to the end and the output layer. Then the output layer’s job is to take those activations from the last hidden layer, then build a probability distribution over all the possible classes.

So, in this case, for MNIST, for example, like I mentioned, with handwritten digits, there are 10 digits and so what happens is, it will produce a distribution for all of these 10 digits. And so, it’ll tell you something like, there’s a 97% chance that this input that you gave me is an eight, and so then I just pick the one with the highest probability, the highest likelihood, then. So, then that’s what the output layer does, and then eventually, I get a single value back. Say I give an image and the output I get is, I am this percent confident that this is an eight, for example, and so that’s kinda how we do this, feedforward neural network.

The real question is, where does the learning happening? What are we actually learning? And so, in between all these layers, in these layer’s connections here, we have these weights and biases. Exactly like what we have for our single. Instead of having two, we have how ever many we need from, to go from these two different layers. We have this weight matrix and this bias vector, those are the things that we wanna learn. And so, how do we actually learn these things?

I’m gonna kind of give you an intuitive way on how we do that with gradient descent very soon. So, we can look at these weights and these biases and basically, the idea is that we can, we wanna minimize our error and increase our accuracy, and so, we can make changes to these weights, little changes to these weights and these biases so that we can increase our accuracy or minimize our error.