In this project, you’ll get to build a neural network from scratch to carry out a prediction problem on a real dataset! By building a neural network from the ground up, you’ll have a much better understanding of gradient descent, backpropagation, and other concepts that are important to know before we move to higher level tools such as PyTorch. You’ll also get to see how to apply these networks to solve real prediction problems!

The data comes from the UCI Machine Learning Database.

The project could be found in http://14.232.166.121:8880/lab? > deeplearning > project-bikesharing

]]>So by now we’ve learned how to build a deep neural network and how to train it to fit our data. Sometimes however, we go out there and train on ourselves and find out that nothing works as planned. Why? Because there are many things that can fail. Our architecture can be poorly chosen, our data can be noisy, our model could maybe be taking years to run and we need it to run faster. We need to learn ways to optimize the training of our models and this is what we’ll do next.

So let’s look at the following data form by blue and red points, and the following two classification models which separates the blue points from the red points. The question is which of these two models is better?

Well, it seems like the one on the left is simpler since it’s a line and the one on the right is more complicated since it’s a complex curve. Now the one in the right makes no mistakes. It correctly separates all the points, on the other hand, the one in the left does make some mistakes. So we’re inclined to think that the one in the right is better. In order to really find out which one is better, we introduce the concept of training and testing sets. We’ll denote them as follows: the solid color points are the training set, and the points with the white inside are the testing set. And what we’ll do is we’ll train our models in the training set without looking at the testing set, and then we’ll evaluate the results on that testing to see how we did.

So according to this, we trained the linear model and the complex model on the training set to obtain these two boundaries. Now we reintroduce the testing set and we can see that the model in the left made one mistake while the model in the right made two mistakes. So in the end, the simple model was better. Does that match our intuition?. Well, it does, because in machine learning that’s what we’re going to do. Whenever we can choose between a simple model that does the job and a complicated model that may do the job a little bit better, we always try to go for the simpler model.

So, let’s talk about life. In life, there are two mistakes one can make. One is to try to kill Godzilla using a flyswatter. The other one is to try to kill a fly using a bazooka.

What’s the problem with trying to kill Godzilla with a flyswatter? That we’re oversimplifying the problem. We’re trying a solution that is too simple and won’t do the job. In machine learning, this is called underfitting. And what’s the problem with trying to kill a fly with a bazooka? It’s overly complicated and it will lead to bad solutions and extra complexity when we can use a much simpler solution instead. In machine learning, this is called overfitting Let’s look at how overfitting and underfitting can occur in a classification problem. Let’s say we have the following data, and we need to classify it.

So what is the rule that will do the job here? Seems like an easy problem, right? The ones in the right are dogs while the ones in the left are anything but dogs. Now what if we use the following rule? We say that the ones in the right are animals and the ones in the left are anything but animals. Well, that solution is not too good, right? What is the problem? It’s too simple. It doesn’t even get the whole data set right. See? It misclassified this cat over here since the cat is an animal. This is underfitting.

Now, what about the following rule? We’ll say that the ones in the right are dogs that are yellow, orange, or grey, and the ones in the left are anything but dogs that are yellow, orange, or grey. Well, technically, this is correct as it classifies the data correctly. There is a feeling that we went too specific since just saying dogs and not dogs would have done the job. But this problem is more conceptual, right? How can we see the problem here? Well, one way to see this is by introducing a testing set. If our testing set is this dog over here, then we’d imagine that a good classifier would put it on the right with the other dogs. But this classifier will put it on the left since the dog is not yellow, orange, or grey. So, the problem here, as we said, is that the classifier is too specific. It will fit the data well but it will fail to generalize. This is overfitting.

But now, let’s see how this would look like in neural networks. So let’s say this data where, again, the blue points are labeled positive and the red points are labeled negative. And here, we have the three little bears. In the middle, we have a good model which fits the data well. On the left, we have a model that underfits since it’s too simple. It tries to fit the data with the line but the data is more complicated than that. And on the right, we have a model that overfits since it tries to fit the data with an overly complicated curve. Notice that the model in the right fits the data really well since it makes no mistakes, whereas the one in the middle makes this mistake over here. But we can see that the model in the middle will probably generalize better. The model in the middle looks at this point as noise. while the one in the right gets confused by it and tries to feed it too well. Now the model in the middle will probably be a neural network with a slightly complex architecture like this one.

So, let’s start from where we left off, which is, we have a complicated network architecture which would be more complicated than we need but we need to live with it. So, let’s look at the process of training. We start with random weights in her first epoch and we get a model like this one, which makes lots of mistakes. Now as we train, let’s say for 20 epochs we get a pretty good model. But then, let’s say we keep going for a 100 epochs, we’ll get something that fits the data much better, but we can see that this is starting to over-fit. If we go for even more, say 600 epochs, then the model heavily over-fits. We can see that the blue region is pretty much a bunch of circles around the blue points. This fits the training data really well, but it will generalize horribly. Imagine a new blue point in the blue area. This point will most likely be classified as red unless it’s super close to a blue point.

So, let’s try to evaluate these models by adding a testing set such as these points. Let’s make a plot of the error in the training set and the testing set with respect to each epoch. For the first epoch, since the model is completely random, then it badly misclassifies both the training and the testing sets. So, both the training error and the testing error are large. We can plot them over here. For the 20 epoch, we have a much better model which fit the training data pretty well, and it also does well in the testing set. So, both errors are relatively small and we’ll plot them over here. For the 100 epoch, we see that we’re starting to over-fit. The model fits the data very well but it starts making mistakes in the testing data. We realize that the training error keeps decreasing, but the testing error starts increasing, so, we plot them over here. Now, for the 600 epoch, we’re badly over-fitting.

Now, we draw the curves that connect the training and testing errors. So, in this plot, it is quite clear when we stop under-fitting and start over-fitting, the training curve is always decreasing since as we train the model, we keep fitting the training data better and better. The testing error is large when we’re under-fitting because the model is not exact. Then it decreases as the model generalizes well until it gets to a minimum point – the Goldilocks spot. And finally, once we pass that spot, the model starts over-fitting again since it stops generalizing and just starts memorizing the training data. This plot is called the model complexity graph.

Well the first observation is that both equations give us the same line, the line with equation X1+X2=0. And the reason for this is that solution two is really just a scalar multiple of solution one. So let’s see. Recall that the prediction is a sigmoid of the linear function. So in the first case, for the 0.11, it would be sigmoid of 1+1, which is sigmoid of 2, which is 0.88. This is not bad since the point is blue, so it has a label of one. For the point (-1, -1), the prediction is sigmoid of -1+-1, which is sigmoid of -2, which is 0.12. It’s also not best since a point label has a label of zero since it’s red. Now let’s see what happens with the second model. The point (1, 1) has a prediction sigmoid of 10 times 1 plus 10 times 1 which is sigmoid of 20. This is a 0.9999999979, which is really close to 1, so it’s a great prediction. And the point (-1, -1) has prediction sigmoid of 10 times negative one plus 10 times negative one, which is sigmoid of minus 20, and that is 0.0000000021. That’s a really, really close to zero so it’s a great prediction. So the answer to the quiz is the second model,

the second model is super accurate. This means it’s better, right? Well after the last section you may be a bit reluctant since this hint’s a bit towards overfitting. And your hunch is correct. The problem is overfitting but in a subtle way. Here’s what’s happening and here’s why the first model is better even if it gives a larger error.

When we apply sigmoid to small values such as X1+X2, we get the function on the left which has a nice slope to the gradient descent. When we multiply the linear function by 10 and take sigmoid of 10X1+10X2, our predictions are much better since they’re closer to zero and one. But the function becomes much steeper and it’s much harder to do great descent here. Since the derivatives are mostly close to zero and then very large when we get to the middle of the curve. Therefore, in order to do gradient descent properly, we want a model like the one in the left more than a model like the one in the right. In a conceptual way the model in the right is too certain and it gives little room for applying gradient descent. Also as we can imagine, the points that are classified incorrectly in the model in the right, will generate large errors and it will be hard to tune the model to correct them.

Now the question is, how do we prevent this type of overfitting from happening? This seems to not be easy since the bad model gives smaller errors. Well, all we have to do is we have to tweak the error function a bit. Basically we want to punish high coefficients. So what we do is we take the old error function and add a term which is big when the weights are big. There are two ways to do this. One way is to add the sums of absolute values of the weights times a constant lambda. The other one is to add the sum of the squares of the weights times that same constant. As you can see, these two are large if the weights are large. The lambda parameter will tell us how much we want to penalize the coefficients. If lambda is large, we penalized them a lot. And if lambda is small then we don’t penalize them much. And finally, if we decide to go for the absolute values, we’re doing L1 regularization and if we decide to go for the squares, then we’re doing L2 regularization.

Here are some general guidelines for deciding between L1 and L2 regularization. When we apply L1, we tend to end up with sparse vectors. That means, small weights will tend to go to zero. So if we want to reduce the number of weights and end up with a small set, we can use L1. This is also good for feature selections and sometimes we have a problem with hundreds of features, and L1 regularization will help us select which ones are important, and it will turn the rest into zeroes. L2 on the other hand, tends not to favor sparse vectors since it tries to maintain all the weights homogeneously small

This is something that happens a lot when we train neural networks. Sometimes one part of the network has very large weights and it ends up dominating all the training, while another part of the network doesn’t really play much of a role so it doesn’t get trained. So, what we’ll do to solve this is sometimes during training, we’ll turn this part off and let the rest of the network train.

More thoroughly, what we do is as we go through the epochs, we randomly turn off some of the nodes and say, you shall not pass through here. In that case, the other nodes have to pick up the slack and take more part in the training.

Here’s another problem that can occur. Let’s take a look at the sigmoid function. The curve gets pretty flat on the sides. So, if we calculate the derivative at a point way at the right or way at the left, this derivative is almost zero.This is not good cause a derivative is what tells us in what direction to move. This gets even worse in most linear perceptrons. Check this out. We call that the derivative of the error function with respect to a weight was the product of all the derivatives calculated at the nodes in the corresponding path to the output. All these derivatives are derivatives as a sigmoid function, so they’re small and the product of a bunch of small numbers is tiny. This makes the training difficult

The question of what learning rate to use is pretty much a research question itself but here’s a general rule. If your learning rate is too big then you’re taking huge steps which could be fast at the beginning but you may miss the minimum and keep going which will make your model pretty chaotic. If you have a small learning rate you will make steady steps and have a better chance of arriving to your local minimum. This may make your model very slow, but in general, a good rule of thumb is if your model’s not working, decrease the learning rate. The best learning rates are those which decrease as the model is getting closer to a solution.

]]>We want to find the weights for our neural networks. Let’s start by thinking about the goal. The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the **error**. A common metric is the sum of the squared errors (SSE):

where *y*^ is the prediction and *y* is the true value, and you take the sum over all output units j and another sum over all data points *μ*. This might seem like a really complicated equation at first, but it’s fairly simple once you understand the symbols and can say what’s going on in words.

First, the inside sum over j. This variable j represents the output units of the network. So this inside sum is saying for each output unit, find the difference between the true value y and the predicted value from the network *y*^, then square the difference, then sum up all those squares.

Then the other sum over *μ* is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

The SSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus.

Remember that the output of a neural network, the prediction, depends on the weights

and accordingly the error depends on the weights

We want the network’s prediction error to be as small as possible and the weights are the knobs we can use to make that happen. Our goal is to find weights *wij* that minimize the squared error E*E*. To do this with a neural network, typically you’d use **gradient descent**.

with gradient descent, we take multiple small steps towards our goal. In this case, we want to change the weights in steps that reduce the error. Continuing the analogy, the error is our mountain and we want to get to the bottom. Since the fastest way down a mountain is in the steepest direction, the steps taken should be in the direction that minimizes the error the most. We can find this direction by calculating the *gradient* of the squared error.

*Gradient* is another term for rate of change or slope. If you need to brush up on this concept, check out Khan Academy’s great lectures on the topic.

To calculate a rate of change, we turn to calculus, specifically derivatives. A derivative of a function f(x) gives you another function f'(x) that returns the slope of f(x) at point x*x*. For example, consider f(x)=x^2 . The derivative of x^2 is f'(x) = 2x. So, at x = 2, the slope is f'(2) = 4. Plotting this out, it looks like:

he gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights. You’ll see how the gradient descent step is derived on the next page.

Below I’ve plotted an example of the error of a neural network with two inputs, and accordingly, two weights. You can read this like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

At each step, you calculate the error and the gradient, then use those to determine how much to change each weight. Repeating this process will eventually find weights that are close to the minimum of the error function, the black dot in the middle.

Since the weights will just go wherever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum, illustrated below.

There are methods to avoid this, such as using momentum

Now we know how to get an output from a simple neural network like the one shown here.

We’d like to use the output to make predictions, but how do we build this network to make predictions without knowing the correct weights before hand? What we can is present it with data that we know to be true then set the model parameters, the weights to match that data. First we need some measure of how bad our predictions are. The obvious choice is to use the difference between the true target value, y, and the network output, y hat.

However, if the prediction is too high, this error will be negative and if their prediction is too low by the same amount the error will be positive. We’d rather treat these errors the same to make both cases positive we’ll just square the error. You might be wondering why not just take the absolute value. One bit of using the square is that it penalizes outliers more than small errors.

Also squaring the error makes the math nice later. This is the error for just one prediction though. We’d rather like to know the error for the entire dataset. So we’ll just sum up the errors for each data record denoted by the sum over mu. Now we have the total error for the network over the entire dataset. And finally, we’ll add a one half in front because it cleans up the math later.

This formulation is typically called the sum of the squared errors (SSE). Remember that y hat is the linear combination of the weights and inputs passed through that activation function. We can substitute it in here, then we see that the error depends on the weights, wi, and the input values, xi.

You can think of the data as two tables or arrays, or matrices, whatever works for you. One contains the input data, x, and the other contains the targets, y. Each record is one row here, so mu equals 1 is the first row. Then, to calculate the total error, you’re just scanning through the rows of these arrays and calculating the SSE. Then summing up all of those results. The SSE is a measure of our network’s performance.

If it’s high, the network is making bad predictions. If it’s low, the network is making good predictions. So we want to make it as small as possible. Going forward, let’s consider a simple example with only one data record to make it easier to understand how we’ll minimize the error. For the simple network the SSE is the true target minus the prediction, y- y hat and squared and divided by 2. Substituting for the prediction, you see the error is a function of the weights. Then our goal is to find weights that minimize the error.

Our goals is to find the weight at the bottom of this bowl. Starting at some random weight, we want to make a step in the direction towards the minimum. This direction is the opposite to the gradient, the slope. If we take many steps, always descending down a gradient. Eventually the weight will find the minimum of the error function and this is gradient descent. We want to update the weight, so a new weight, wi, is the old weight, wi plus this weight step, delta wi. This Greek letter, delta, typically signifies change

Writing out the gradient, you get the partial derivative with respect to the weights of the squared error. The network output, y hat, is a function of the weights. So what we have here is a function of another function that depends on the weights.

This requires using the chain rule to calculate the derivative. Here is a quick refresher on the chain rule. Say you want to take the derivative of a function p with respect to z. If p depends on another function q that depends on z. The chain rule says, you first take the derivative of p with the respect to q, then multiply it by the derivative of q with the respect to z.

This relates to our problem because we can set q to the error, y- y hat and set p to the squared error. And then we’re taking the derivative with respect to wi, the derivative of p with respect to q returns the error itself. The 2 in the exponent drops down and cancels out the 1/2.

Then we’re left with the derivative of the error with respect to wi. The target value y doesn’t depend on the weights, but y hat does. Using the Chain Rule again, the minus sign comes out in front and we’re left with the partial derivative of y hat. If you remember, y hat is equal to the activation function at h. Where h is the linear combination of the weights and input values.

Taking the derivative of y hat, and again using the chain rule. You get the derivative of the activation function at h, times the partial derivative of the linear combination. In the sum, there is only one term that depends on each weight. Writing this out for weight one, you see that only the first term with x1 depends on weight one. Then the partial derivative of the sum with respect to weight one is just x1. All the other terms are zero, then the partial derivative of this sum with respect to wi is just xi.

Finally, putting all this together, the gradient of the squared error with respect to wi is the negative of the error times the derivative of the activation function at h times the input value xi.

Then the weight step is a learning rate eta times the error, times the activation derivative, times the input value. For convenience, and to make things easy later, we can define an error term, lowercase delta, as the error times the activation function derivative at h. Then we can write our weight update as wi equals wi plus the learning rate times the error term times xi is the value of input i.

You might be working with multiple output units. You can think of this as just stacking the architecture from the single output network but connecting the input units to the new output units. Now the total error would include the error of each outputs sum together. The gradient descent step can be extended to a network with multiple output. By calculating an error term for each output unit denoted with the subscript j

hand-on lab –> http://14.232.166.121:8880/lab? > deeplearning>implement_gradient_descent> gradient_basic.ipynb & gradient.ipynb

Multilayer Perceptrons

Below, we are going to walk through the math of neural networks in a multilayer perceptron. With multiple perceptrons, we are going to move to using vectors and matrices. To brush up, be sure to view the following:

- Khan Academy’s introduction to vectors.
- Khan Academy’s introduction to matrices.

Before, we were dealing with only one output node which made the code straightforward. However now that we have multiple input units and multiple hidden units, the weights between them will require two indices: *wij* where i denotes input units and j are the hidden units.

For example, the following image shows our network, with its input units labeled* x*1,*x*2, and x3, and its hidden nodes labeled *h*1 and *h*2:

The lines indicating the weights leading to *h*1 have been colored differently from those leading to *h*2just to make it easier to read.

Now to index the weights, we take the input unit number for the *i* and the hidden unit number for the *j*. That gives us *w*11

for the weight leading from *x*1 to *h*1, and *w*12

for the weight leading from *x*1 to *h*2.

The following image includes all of the weights between the input layer and the hidden layer, labeled with their appropriate *wij* indices:

Before, we were able to write the weights as an array, indexed as w_i*w**i*.

But now, the weights need to be stored in a **matrix**, indexed as w_{ij}*wij*. Each **row** in the matrix will correspond to the weights **leading out** of a **single input unit**, and each **column** will correspond to the weights **leading in** to a **single hidden unit**. For our three input units and two hidden units, the weights matrix looks like this:

Be sure to compare the matrix above with the diagram shown before it so you can see where the different weights in the network end up in the matrix.

To initialize these weights in NumPy, we have to provide the shape of the matrix. If `features`

is a 2D array containing the input data:

This creates a 2D array (i.e. a matrix) named `weights_input_to_hidden`

with dimensions `n_inputs`

by `n_hidden`

. Remember how the input to a hidden unit is the sum of all the inputs multiplied by the hidden unit’s weights. So for each hidden layer unit, h_j*hj*, we need to calculate the following:

To do that, we now need to use matrix multiplication. If your linear algebra is rusty, I suggest taking a look at the suggested resources in the prerequisites section. For this part though, you’ll only need to know how to multiply a matrix with a vector.

In this case, we’re multiplying the inputs (a row vector here) by the weights. To do this, you take the dot (inner) product of the inputs with each column in the weights matrix. For example, to calculate the input to the first hidden unit, j = 1, you’d take the dot product of the inputs with the first column of the weights matrix, like so:

Calculating the input to the first hidden unit with the first column of the weights matrix

And for the second hidden layer input, you calculate the dot product of the inputs with the second column. And so on and so forth.

In NumPy, you can do this for all the inputs and all the outputs at once using `np.dot`

You could also define your weights matrix such that it has dimensions `n_hidden`

by `n_inputs`

then multiply like so where the inputs form a *column vector*:

**Note:** The weight indices have changed in the above image and no longer match up with the labels used in the earlier diagrams. That’s because, in matrix notation, the row index always precedes the column index, so it would be misleading to label them the way we did in the neural net diagram. Just keep in mind that this is the same weight matrix as before, but rotated so the first column is now the first row, and the second column is now the second row. If we *were* to use the labels from the earlier diagram, the weights would fit into the matrix in the following locations:

Remember, the above is **not** a correct view of the **indices**, but it uses the labels from the earlier neural net diagrams to show you where each weight ends up in the matrix.

The important thing with matrix multiplication is that *the dimensions match*. For matrix multiplication to work, there has to be the same number of elements in the dot products. In the first example, there are three columns in the input vector, and three rows in the weights matrix. In the second example, there are three columns in the weights matrix and three rows in the input vector. If the dimensions don’t match, you’ll get this:

The dot product can’t be computed for a 3×2 matrix and 3-element array. That’s because the 2 columns in the matrix don’t match the number of elements in the array. Some of the dimensions that could work would be the following:

The rule is that if you’re multiplying an array from the left, the array must have the same number of elements as there are rows in the matrix. And if you’re multiplying the *matrix* from the left, the number of columns in the matrix must equal the number of elements in the array on the right.

You see above that sometimes you’ll want a column vector, even though by default NumPy arrays work like row vectors. It’s possible to get the transpose of an array like so `arr.T`

, but for a 1D array, the transpose will return a row vector. Instead, use `arr[:,None]`

to create a column vector:

Alternatively, you can create arrays with two dimensions. Then, you can use `arr.T`

to get the column vector.

hand-on lab –> http://14.232.166.121:8880/lab? > deeplearning>implement_gradient_descent> multilayer.ipynb

Now we’ve come to the problem of how to make a multilayer neural network *learn*. Before, we saw how to update weights with gradient descent. The backpropagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two layer network).

To update the weights to hidden layers using gradient descent, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backwards to hidden layers.

For example, in the output layer, you have errors *δko* attributed to each output unit k. Then, the error attributed to hidden unit j is the output errors, scaled by the weights between the output and hidden layers (and the gradient):

Then, the gradient descent step is the same as before, just with the new errors:

where *wij* are the weights between the inputs and hidden layer and *xi* are input unit values. This form holds for however many layers there are. The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer

Here, you get the output error, *δoutput*, by propagating the errors backwards from higher layers. And the input values, *Vin* are the inputs to the layer, the hidden layer activations to the output unit for example.

Let’s walk through the steps of calculating the weight updates for a simple two layer network. Suppose there are two input values, one hidden unit, and one output unit, with sigmoid activations on the hidden and output units. The following image depicts this network. (**Note:** the input values are shown as nodes at the bottom of the image, while the network’s output value is shown as* y*^ at the top. The inputs themselves do not count as a layer, which is why this is considered a two layer network.)

Assume we’re trying to fit some binary data and the target is y = 1*y*=1. We’ll start with the forward pass, first calculating the input to the hidden unit

and the output of the hidden unit

Using this as the input to the output unit, the output of the network is

With the network output, we can start the backwards pass to calculate the weight updates for both layers. the error term for the output unit is

Now that we have the errors, we can calculate the gradient descent steps. The hidden to output weight step is the learning rate, times the output unit error, times the hidden unit activation value.

Then, for the input to hidden weights w_i*wi*, it’s the learning rate times the hidden unit error, times the input values.

From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is 0.25, so the errors in the output layer get reduced by at least 75%, and errors in the hidden layer are scaled down by at least 93.75%! You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input. This is known as the **vanishing gradient** problem. Later in the course you’ll learn about other activation functions that perform better in this regard and are more commonly used in modern network architectures.

hand-on lab –> http://14.232.166.121:8880/lab? > deeplearning>implement_gradient_descent> backprop_basic.ipynb & backprop.ipynb

Backpropagation is fundamental to deep learning. TensorFlow and other libraries will perform the backprop for you, but you should really *really* understand the algorithm. We’ll be going over backprop again, but here are some extra resources for you:

- From Andrej Karpathy: Yes, you should understand backprop
- Also from Andrej Karpathy, a lecture from Stanford’s CS231n course

So let’s start with two questions, what is deep learning, and what is it used for? The answer to the second question is pretty much everywhere. Recent applications include things such as beating humans in games such as Go, or even jeopardy, detecting spam in emails, forecasting stock prices, recognizing images in a picture, and even diagnosing illnesses sometimes with more precision than doctors. And what is at the heart of deep learning? This wonderful object called neural networks. Neural networks vaguely mimic the process of how the brain operates, with neurons that fire bits of information. As a matter of fact, the first time I heard of a neural network, this is the image that came into my head, some scary robot with artificial brain.

But then, I got to learn a bit more about neural networks and I realized that there are actually a lot scarier than that. his is how a neural network looks.

But after looking at neural networks for a while, I realized that they’re actually a lot simpler than that. When I think of a neural network, this is actually the image that comes to my mind.There is a child playing in the sand, with some red and blue shells and we are the child. Can you draw a line that separates the red and the blue shells? And the child draws this line. That’s it. That’s what a neural network does.

Given some data in the form of blue or red points, the neural network will look for the best line that separates them. And if the data is a bit more complicated like this one over here, then we’ll need a more complicated algorithm. Here, a deep neural network will do the job and find a more complex boundary that separates the points.

So, let’s start with one classification example. Let’s say we are the admissions office at a university and our job is to accept or reject students. So, in order to evaluate students, we have two pieces of information, the results of a test and their grades in school. So, let’s take a look at some sample students. We’ll start with Student 1 who got 9 out of 10 in the test and 8 out of 10 in the grades. That student did quite well and got accepted. Then we have Student 2 who got 3 out of 10 in the test and 4 out of 10 in the grades, and that student got rejected. And now, we have a new Student 3 who got 7 out of 10 in the test and 6 out of 10 in the grades, and we’re wondering if the student gets accepted or not.

So, our first way to find this out is to plot students in a graph with the horizontal axis corresponding to the score on the test and the vertical axis corresponding to the grades, and the students would fit here.

These are all the previous students who got accepted or rejected. The blue points correspond to students that got accepted, and the red points to students that got rejected. So we can see in this diagram that the students would did well in the test and grades are more likely to get accepted, and the students who did poorly in both are more likely to get rejected. does the Student 3 get accepted or rejected? you can figure it out by yourself

it seems that this data can be nicely separated by a line which is this line over here, and it seems that most students over the line get accepted and most students under the line get rejected. So this line is going to be our model. And now a question arises. The question is, how do we find this line?

So, first let’s add some math. We’re going to label the horizontal axis corresponding to the test by the variable x1, and the vertical axis corresponding to the grades by the variable x2. So this boundary line that separates the blue and the red points is going to have a linear equation. The one drawn has equation 2×1+x2-18=0. What does this mean? This means that our method for accepting or rejecting students simply says the following: take this equation as our score, the score is 2xtest+grades-18. Now when the student comes in, we check their score. If their score is a positive number, then we accept the student and if the score is a negative number then we reject the student. This is called a prediction. And that’s it. That linear equation is our model.

In the more general case, our boundary will be an equation of the following wx1+w2x2+b=0. We’ll abbreviate this equation in vector notation as wx+b=0, where w is the vector w1w2 and x is the vector x1x2. And we simply take the product of the two vectors. We’ll refer to x as the input, to w as the weights and b as the bias. Now, for a student coordinates x1x2, we’ll denote a label as Y and the label is what we’re trying to predict. So if the student gets accepted, namely the point is blue, then the label is 1. And if the student gets rejected, namely the point is red and then the label is 0.

we’ll introduce the notion of a preceptron, which is the building block of neural networks, and it’s just an encoding of our equation into a small graph. The way we’ve build it is the following. Here we have our data and our boundary line and we fit it inside a node. And now we add small nodes for the inputs which, in this case, they are the test and the grades. Here we can see an example where test equals 7 and grades equals 6. And what the perceptron does is it blocks the points 7, 6 and checks if the point is in the positive or negative area. If the point is in the positive area, then it returns a yes. And if it is in the negative area, it returns a no.

So we had a question we’re trying to answer and the question is, how do we find this line that separates the blue points from the red points in the best possible way? Let’s answer this question by first looking at a small example with three blue points and three red points. And we’re going to describe an algorithm that will find the line that splits these points properly. So the computer doesn’t know where to start. It might as well start at a random place by picking a random linear equation. This equation will define a line and a positive and negative area given in blue and red respectively.

What we’re going to do is to look at how badly this line is doing and then move it around to try to get better and better. Now the question is, how do we find how badly this line is doing? So let’s ask all the points. Here we have four points that are correctly classified. They are these two blue points in the blue area and these two red points in the red area. And these points are correctly classified, so they say, “I’m good.”

And then we have these two points that are incorrectly classified. That’s this red point in the blue area and this blue point in the red area. We want to get as much information from them so we want them to tell us something so that we can improve this line. So what is it that they can tell us?. Well, consider this. If you’re in the wrong area, you would like the line to go over you, in order to be in the right area. Thus, the points just come closer! So the line can move towards it and eventually classify it correctly.

Now, we finally have all the tools for describing the perceptron algorithm. We start with the random equation, which will determine some line, and two regions, the positive and the negative region. Now, we’ll move this line around to get a better and better fit. So, we ask all the points how they’re doing. The four correctly classified points say, “I’m good.” And the two incorrectly classified points say, “Come closer.”

let’s actually write the pseudocode for this perceptron algorithm

Here is obvious realization of the error function. We’re standing on top a mountain, Mount “Errorest” and I want to descend but it’s not that easy because it’s cloudy and the mountain is very big, so we can’t really see the big picture. What we’ll do to go down is we’ll look around us and we consider all the possible directions in which we can walk. Then we pick a direction that makes us descend the most. Let’s say it’s this one over here. So we take a step in that direction. Thus, we’ve decreased the height. Once we take the step and we start the process again and again always decreasing the height until we go all the way down the mountain, minimizing the height. In this case the key metric that we use to solve the problem is the height.

There’s a small problem with this approach. In our algorithms we’ll be taking very small steps and the reason for that is calculus, because our tiny steps will be calculated by derivatives. So what happens if we take very small steps here? We start with two errors and then move a tiny amount and we’re still at two errors. Then move a tiny amount again and we’re still two errors. Another tiny amount and we’re still at two and again and again. So not much we can do here. This is equivalent to using gradient descent to try to descend from an Aztec pyramid with flat steps. If we’re standing here in the second floor, for the two errors and we look around ourselves, we’ll always see two errors and we’ll get confused and not know what to do. On the other hand in Errorest we can detect very small variations in height and we can figure out in what direction it can decrease the most. In math terms this means that in order for us to do gradient descent our error function can not be discrete, it should be continuous.

Let’s switch to a different example for a moment. Let’s say we have a model that will predict if you receive a gift or not. So, the model use predictions in the following way. It says, the probability that you get a gift is 0.8, which automatically implies that the probability that you don’t receive a gift is 0.2. And what does the model do? What the model does is take some inputs. For example, is it your birthday or have it been good all year? And based on those inputs, it calculates a linear model which would be the score. Then, the probability that you get the gift or not is simply the sigmoid function applied to that score.

Now, what if you had more options than just getting a gift or not a gift? Let’s say we have a model that just tell us what animal we just saw, and the options are a duck, a beaver and a walrus. We want a model that tells an answer along the lines of the probability of a duck is 0.67, the probability of a beaver is 0.24, and the probability of a walrus is 0.09. Notice that the probabilities need to add to one.

Let’s say we have a linear model based on some inputs. The inputs could be, does it have a beak or not? Number of teeth. Number of feathers. Hair, no hair. Does it live in the water? Does it fly? Etc. We calculate linear function based on those inputs, and let’s say we get some scores. So, the duck gets a score of two, and the beaver gets a score of one, and the walrus gets a score of zero. And now the question is, how do we turn these scores into probabilities? we take the exponential of each score; divide each exponential score to the sum of all exponential socre

let’s say, our classes are Duck, Beaver and Walrus? What variable do we input in the algorithm? Maybe, we can input a 0 or 1 and a 2, but that would not work because it would assume dependencies between the classes that we can’t have. So, this is what we do. What we do is, we come up with one variable for each of the classes. So, our table becomes like this.

So we’re still in our quest for an algorithm that will help us pick the best model that separates our data. Well, since we’re dealing with probabilities then let’s use them in our favor. Let’s say I’m a student and I have two models. One that tells me that my probability of getting accepted is 80% and one that tells me the probability is 55%. Which model looks more accurate? Well, if I got accepted then I’d say the better model is probably the one that says 80%. What if I didn’t get accepted? Then the more accurate model is more likely the one that says 55 percent. So let me be more specific. Let’s look at the following four points: two blue and two red and two models that classify them, the one on the left and the one on the right. Quick. Which model looks better?

The model on the right is much better since it classifies the four points correctly whereas the model in the left gets two points correctly and two points incorrectly. But let’s see why the model in the right is better from the probability perspective. And by that, we’ll show you that the arrangement in the right is much more likely to happen than the one in the left. So let’s recall that our prediction is ŷ = σ(Wx+b) and that is precisely the probability of a point being labeled positive which means blue. So for the points in the figure, let’s say the model tells you that the probability of being blue are 0.9, 0.6, 0.3, and 0.2. Notice that the points in the blue region are much more likely to be blue and the points in the red region are much less likely to be blue. Now if we assume that the colors of the points are independent events then the probability for the whole arrangement is the product of the probabilities of the four points. As we saw the model on the left tells us that the probabilities of these points being of those colors is 0.0084. If we do the same thing for the model on the right. Let’s say we get that the probabilities of the two points in the right being blue are 0.7 and 0.9 and of the two points in the left being red are 0.8 and 0.6. When we multiply these we get 0.3024 which is around 30%. This is much higher than 0.0084. Thus, we confirm that the model on the right is better because it makes the arrangement of the points much more likely to have those colors. So now, what we do is the following? We start from the bad modeling, calculate the probability that the points are those colors, multiply them and we obtain the total probability is 0.0084. Now if we just had a way to maximize this probability we can increase it all the way to 0.3024. Thus, our new goal becomes precisely that, to maximize this probability. This method, as we stated before, is called maximum likelihood.

Well we’re getting somewhere now. We’ve concluded that the probability is important. And that the better model will give us a better probability. Now the question is, how we maximize the probability. Also, if remember correctly we’re talking about an error function and how minimizing this error function will take us to the best possible solution. Could these two things be connected? Could we obtain an error function from the probability? Could it be that maximizing the probability is equivalent to minimizing the error function? Maybe. So a quick recap. We have two models, the bad one on the left and the good one on the right.

And the way to tell they’re bad or good is to calculate the probability of each point being the color it is according to the model. Multiply these probabilities in order to obtain the probability of the whole arrangement and then check that the model on the right gives us a much higher probability than the model on the left. Now all we need to do is to maximize this probability. But probability is a product of numbers and products are hard. Maybe this product of four numbers doesn’t look so scary. But what if we have thousands of datapoints? That would correspond to a product of thousands of numbers, all of them between zero and one. This product would be very tiny, something like 0.0000 something and we definitely want to stay away from those numbers. Also, if I have a product of thousands of numbers and I change one of them, the product will change drastically. In summary, we really want to stay away from products. And what’s better than products? it’s sum

logarithm has this very nice identity that says that the logarithm of the product A times B is the sum of the logarithms of A and B. So this is what we do. We take our products and we take the logarithms, so now we get a sum of the logarithms of the factors. So the ln(0.6*0.2*0.1*0.7) is equal to ln(0.6) + ln(0.2) + ln(0.1) + ln(0.7) etc

We can calculate those values and get minus 0.51, minus 1.61, minus 0.23 etc. Notice that they are all negative numbers and that actually makes sense. This is because the logarithm of a number between 0 and 1 is always a negative number since the logarithm of one is zero. So it actually makes sense to think of the negative of the logarithm of the probabilities and we’ll get positive numbers. So that’s what we’ll do. We’ll take the negative of the logarithm of the probabilities. That sums up negatives of logarithms of the probabilities, we’ll call the cross entropy

Let’s look a bit closer into Cross-Entropy by switching to a different example. Let’s say we have three doors. And no this is not the Monty Hall problem. We have the green door, the red door, and the blue door, and behind each door we could have a gift or not have a gift. And the probabilities of there being a gift behind each door is 0.8 for the first one, 0.7 for the second one, 0.1 for the third one. So for example behind the green door there is an 80 percent probability of there being a gift, and a 20 percent probability of there not being a gift. So we can put the information in this table where the probabilities of there being a gift are given in the top row, and the probabilities of there not being a gift are given in the bottom row. So let’s say we want to make a bet on the outcomes. So we want to try to figure out what is the most likely scenario here. And for that we’ll assume they’re independent events. In this case, the most likely scenario is just obtained by picking the largest probability in each column. So for the first door is more likely to have a gift than not have a gift. So we’ll say there’s a gift behind the first door. For the second door, it’s also more likely that there’s a gift. So we’ll say there’s a gift behind the second door. And for the third door it’s much more likely that there’s no gift, so we’ll say there’s no gift behind the third door. And as the events are independent, the probability for this whole arrangement is the product of the three probabilities which is 0.8, times 0.7, times 0.9, which ends up being 0.504, which is roughly 50 percent.

we learned that the negative of the logarithm of the probabilities across entropy. So let’s go ahead and calculate the cross-entropy. And notice that the events with high probability have low cross-entropy and the events with low probability have high cross-entropy. For example, the second row which has probability of 0.504 gives a small cross-entropy of 0.69, and the second to last row which is very very unlikely has a probability of 0.006 gives a cross entropy a 5.12.

So when we calculate the cross-entropy, we get the negative of the logarithm of the product, which is a sum of the negatives of the logarithms of the factors. Now that was when we had two classes namely receiving a gift or not receiving a gift. What happens if we have more classes? Let’s take a look. So we have a similar problem. We still have three doors. And this problem is still not the Monty Hall problem. Behind each door there can be an animal, and the animal can be of three types. It can be a duck, it can be a beaver, or it can be a walrus. So let’s look at this table of probabilities.

So let’s look at a sample scenario. Let’s say we have our three doors, and behind the first door, there’s a duck, behind the second door there’s a walrus, and behind the third door there’s also a walrus. Recall that the probabilities are again by the table. So a duck behind the first door is 0.7 likely, a walrus behind the second door is 0.3 likely, and a walrus behind the third door is 0.4 likely. So the probability of obtaining this three animals is the product of the probabilities of the three events since they are independent events, which in this case it’s 0.084. And as we learn, that cross entropy here is given by the sums of the negatives of the logarithms of the probabilities. The Cross entropy’s and the sum of these three which is actually 2.48. But we want a formula, so let’s put some variables here.

So P11 is the probability of finding a duck behind door one. P12 is the probability of finding a duck behind door two etc. And let’s have the indicator variables Y1j if there’s a duck behind door J. Y2j if there’s a beaver behind door J, and Y3j if there’s a walrus behind door J. And these variables are zero otherwise. And so, the formula for the cross entropy is as show in the previous picture

So this is a good time for a quick recap of the last couple of lessons. Here we have two models. The bad model on the left and the good model on the right. And for each one of those we calculate the cross entropy which is the sum of the negatives of the logarithms off the probabilities of the points being their colors. And we conclude that the one on the right is better because a cross entropy is much smaller. So let’s actually calculate the formula for the error function. Let’s split into two cases. The first case being when y=1. So when the point is blue to begin with, the model tells us that the probability of being blue is the prediction y_hat. So for these two points the probabilities are 0.6 and 0.2. As we can see the point in the blue area has more probability of being blue than the point in the red area. And our error is simply the negative logarithm of this probability. So it’s precisely minus logarithm of y_hat. In the figure it’s minus logarithm of 0.6. and minus logarithm of 0.2. Now if y=0, so when the point is red, then we need to calculate the probability of the point being red. The probability of the point being red is one minus the probability of the point being blue which is precisely 1 minus the prediction y_hat. So the error is precisely the negative logarithm of this probability which is negative logarithm of 1 – y_hat. We can summarize these two formulas into this one. Error = – (1-y)(ln( 1- y_hat)) – y ln(y_hat).

So now our goal is to minimize the error function and we’ll do it as follows. We started some random weights, which will give us the predictions σ(Wx+b). As we saw, that also gives us a error function given by this formula. Remember that the summands are also error functions for each point. So each point will give us a larger function if it’s mis-classified and a smaller one if it’s correctly classified. And the way we’re going to minimize this function, is to use gradient decent. So here’s Errorest and this is us, and we’re going to try to jiggle the line around to see how we can decrease the error function. Now, the error function is the height which is E(W,b) where W and b are the weights. Now what we’ll do, is we’ll use gradient decent in order to get to the bottom of the mountain at a much smaller height, which gives us a smaller error function E of W’, b’.

In the last few videos, we learned that in order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,

The reason for this is the following, we can calculate it using the quotient formula:

the error formula is:

where the prediction is given by

Our goal is to calculate the gradient of E at a point x=(*x*1,…,*xn*), given by the partial derivatives

To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply

in order to calculate the derivative of this error with respect to the weights, we’ll first calculate

Now, we can go ahead and calculate the derivative of the error *E* at a point x

A similar calculation will show us that

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

which is equivalent to

Similarly, it updates the bias in the following way:

Now we’ve been dealing a lot with data sets that can be separated by a line, But as you can imagine the real world is much more complex than that. This is where neural networks can show their full potential. So, let’s go back to this example of where we saw some data that is not linearly separable.

So a line can not divide these red and blue points and we looked at some solutions, and if you remember, the one we considered more seriously was this curve over here. So what I’ll teach you now is to find this curve and it’s very similar than before. We’ll still use grading dissent. In a nutshell, what we’re going to do is for these data which is not separable with a line, we’re going to create a probability function where the points in the blue region are more likely to be blue and the points in the red region are more likely to be red. And this curve here that separates them is a set of points which are equally likely to be blue or red. Everything will be the same as before except this equation won’t be linear and that’s where neural networks come into play.

Now I’m going to show you how to create these nonlinear models. What we’re going to do is a very simple trick. We’re going to combine two linear models into a nonlinear model as follows. Visually it looks like this. The two models over imposed creating the model on the right. It’s almost like we’re doing arithmetic on models. It’s like saying “This line plus this line equals that curve.” Let me show you how to do this mathematically. So a linear model as we know is a whole probability space. This means that for every point it gives us the probability of the point being blue. So, for example, this point over here is in the blue region so its probability of being blue is 0.7. The same point given by the second probability space is also in the blue region so it’s probability of being blue is 0.8. Now the question is, how do we combine these two? Well, the simplest way to combine two numbers is to add them, right? So 0.8 plus 0.7 is 1.5. But now, this doesn’t look like a probability anymore since it’s bigger than one. And probabilities need to be between 0 and 1. So what can we do? How do we turn this number that is larger than 1 into something between 0 and 1? Well, we’ve been in this situation before and we have a pretty good tool that turns every number into something between 0 and 1. That’s just a sigmoid function. So that’s what we’re going to do. We applied the sigmoid function to 1.5 to get the value 0.82 and that’s the probability of this point being blue in the resulting probability space.

Neural networks have a certain special architecture with layers. The first layer is called the input layer, which contains the inputs, in this case, x1 and x2. The next layer is called the hidden layer, which is a set of linear models created with this first input layer. And then the final layer is called the output layer, where the linear models get combined to obtain a nonlinear model.

You can have different architectures. For example, here’s one with a larger hidden layer. Now we’re combining three linear models to obtain the triangular boundary in the output layer. Now what happens if the input layer has more nodes? For example, this neural network has three nodes in its input layer. Well, that just means we’re not living in two-dimensional space anymore. We’re living in three-dimensional space, and now our hidden layer, the one with the linear models, just gives us a bunch of planes in three space, and the output layer bounds a nonlinear region in three space.

So if our model is telling us if an image is a cat or dog or a bird, then we simply have each node in the output layer output a score for each one of the classes: one for the cat, one for the dog, and one for the bird. And finally, and here’s where things get pretty cool, what if we have more layers? Then we have what’s called a deep neural network. Now what happens here is our linear models combine to create nonlinear models and then these combine to create even more nonlinear models. In general, we can do this many times and obtain highly complex models with lots of hidden layers. This is where the magic of neural networks happens.

So now that we have defined what neural networks are, we need to learn how to train them. Training them really means what parameters should they have on the edges in order to model our data well. So in order to learn how to train them, we need to look carefully at how they process the input to obtain an output. Now the perceptron is defined by a linear equation say w1, x1 plus w2, x2 plus B, where w1 and w2 are the weights in the edges and B is the bias in the note. Here, w1 is bigger than w2, so we’ll denote that by drawing the edge labelled w1 much thicker than the edge labelled w2. Now, what the perceptron does is it plots the point x1, x2 and it outputs the probability that the point is blue. Here is the point is in the red area and then the output is a small number, since the point is not very likely to be blue. This process is known as feedforward.

So, our goal is to train our neural network. In order to do this, we have to define the error function. So, let’s look again at what the error function was for perceptrons. So, here’s our perceptron. In the left, we have our input vector with entries x_1 up to x_n, and one for the bias unit. And the edges with weights W_1 up to W_n, and b for the bias unit. Finally, we can see that this perceptron uses a sigmoid function. And the prediction is defined as y-hat equals sigmoid of Wx plus b. And as we saw, this function gives us a measure of the error of how badly each point is being classified.

So, what are we going to do to define the error function in a multilayer perceptron? Well, as we saw, our prediction is simply a combination of matrix multiplications and sigmoid functions.But the error function can be the exact same thing, right? It can be the exact same formula

On your left, you have a single perceptron with the input vector, the weights and the bias and the sigmoid function inside the node. And on the right, we have a formula for the prediction, which is the sigmoid function of the linear function of the input. And below, we have a formula for the error, which is the average of all points of the blue term for the blue points and the red term for the red points. And in order to descend from Errorest, we calculate the gradient. And the gradient is simply the vector formed by all the partial derivatives of the error function with respect to the weights w1 up to wn and and the bias b

and what do we do in a multilayer perceptron? Well, this time it’s a little more complicated but it’s pretty much the same thing

If we want to write this more formally, we recall that the prediction is a composition of sigmoids and matrix multiplications, where these are the matrices and the gradient is just going to be formed by all these partial derivatives. Here, it looks like a matrix but in reality it’s just a long vector. And the gradient descent is going to do the following;

So before we start calculating derivatives, let’s do a refresher on the chain rule which is the main technique we’ll use to calculate them. The chain rule says, if you have a variable x on a function f that you apply to x to get f of x, which we’re gonna call A, and then another function g, which you apply to f of x to get g of f of x, which we’re gonna call B, the chain rule says, if you want to find the partial derivative of B with respect to x, that’s just a partial derivative of B with respect to A times the partial derivative of A with respect to x

So, let us go back to our neural network with our weights and our input. And recall that the weights with superscript 1 belong to the first layer, and the weights with superscript 2 belong to the second layer. so that we can have everything in matrix notation.

And now what happens with the input? So, let us do the feedforward process. In the first layer, we take the input and multiply it by the weights and that gives us h1, which is a linear function of the input and the weights. Same thing with h2, given by this formula over here. Now, in the second layer, we would take this h1 and h2 and the new bias, apply the sigmoid function, and then apply a linear function to them by multiplying them by the weights and adding them to get a value of h. And finally, in the third layer, we just take a sigmoid function of h to get our prediction or probability between 0 and 1, which is ŷ.

Now, we are going to develop backpropagation, which is precisely the reverse of feedforward. So, we are going to calculate the derivative of this error function with respect to each of the weights in the labels by using the chain rule. So, let us recall that our error function is this formula over here, which is a function of the prediction ŷ. But, since the prediction is a function of all the weights wij, then the error function can be seen as the function on all the wij. Therefore, the gradient is simply the vector formed by all the partial derivatives of the error function E with respect to each of the weights. So, let us calculate one of these derivatives.

we have hand-on lab & assignment take a look at our jupyter server at

http://14.232.166.121:8880/lab? > deeplearning/intro-neural-networks

The term, Deep Learning, refers to training Neural Networks, sometimes very large Neural Networks. So what exactly is a Neural Network? In this video, let’s try to give you some of the basic intuitions. Let’s start to the Housing Price Prediction example. Let’s say you have a data sets with six houses, so you know the size of the houses in square feet or square meters and you know the price of the house and you want to fit a function to predict the price of the houses, the function of the size. So if you are familiar with linear regression you might say, well let’s put a straight line to these data so and we get a straight line like that

So you can think of this function that you’ve just fit the housing prices as a very simple neural network. We have as the input to the neural network the size of a house which one we call x. It goes into this node, this little circle and then it outputs the price which we call y. So this little circle, which is a single neuron in a neural network, implements this function that we drew on the left.

And all the neuron does is it inputs the size, computes this linear function, takes a max of zero, and then outputs the estimated price.Let’s say that instead of predicting the price of a house just from the size, you now have other features. You know other things about the host, such as the number of bedrooms, I should have wrote bedrooms, and you might think that one of the things that really affects the price of a house is family size, right? So can this house fit your family of three, or family of four, or family of five? And it’s really based on the size in square feet or square meters, and the number of bedrooms that determines whether or not a house can fit your family’s family size. And then maybe you know the zip codes, in different countries it’s called a postal code of a house. So that based on the size and number of bedrooms, you can estimate the family size, their zip code, based on walkability, based on zip code and wealth can estimate the school quality. And then finally you might think that the way people decide how much they will to pay for a house, is they look at the things that really matter to them. In this case family size, walkability, and school quality and that helps you predict the price.

And so by stacking together a few of the single neurons or the simple predictors we have from the previous slide, we now have a slightly larger neural network. How you manage neural network is that when you implement it, you need to give it just the input x and the output y for a number of examples in your training set and all this things in the middle, they will figure out by itself.

So, that’s a basic neural network. In turns out that as you build out your own neural networks, you probably find them to be most useful, most powerful in supervised learning incentives, meaning that you’re trying to take an input x and map it to some output y, like we just saw in the housing price prediction example

There’s been a lot of hype about neural networks. And perhaps some of that hype is justified, given how well they’re working. But it turns out that so far, almost all the economic value created by neural networks has been through one type of machine learning, called supervised learning. Let’s see what that means, and let’s go over some examples. In supervised learning, you have some input x, and you want to learn a function mapping to some output y. So for example, just now we saw the housing price prediction application where you input some features of a home and try to output or estimate the price y

Computer vision has also made huge strides in the last several years, mostly due to deep learning.So you might input an image and want to output an index, say from 1 to 1,000 trying to tell you if this picture, it might be any one of, say a 1000 different images. So, you might us that for photo tagging. I think the recent progress in speech recognition has also been very exciting, where you can now input an audio clip to a neural network, and have it output a text transcript. For image applications we often use convolution on neural networks, often abbreviated CNN. And for sequence data. So for example, audio has a temporal component, right? Audio is played out over time, so audio is most naturally represented as a one-dimensional time series or as a one-dimensional temporal sequence. And so for sequence data, you often use an RNN, a recurrent neural network. Language, English and Chinese, the alphabets or the words come one at a time. So language is also most naturally represented as sequence data. And so more complex versions of RNNs are often used for these applications. And then, for more complex applications, like autonomous driving, where you have an image, that might suggest more of a CNN convolution neural network structure and radar info which is something quite different.

a large body of research focus on how to make better representation of word in term of vector while there is still lack of attention in the manner of comparison. In most of the cases; cosine similarity would be the default choice. The core idea is to consider a word or a sentence embedding as a sample of *N* observations of some scalar random variable, where *N* is the embedding size. Then, some classical statistical correlation measures can be applied for pairs of vectors

In a nutshell, correlation describes how one set of numbers relates to the other, if they then show some relationship, we can use this insight to explore and test causation and even forecast future data. In some extends; correlation can be a good measurement of similarity in stead of traditional cosine. This paper propose some very fundamental correlation metrics such as **Pearson, Spearman and Kendall**. As their empirical analysis has shown, cosine similarity is equivalent to Pearson’s (linear) correlation coefficient for commonly used word embeddings. It comes from the fact that the values observed in practice are distributed around the zero mean

In the scenario of word similarity, a violation of the normality assumption makes cosine similarity especially inappropriate for GloVe vectors. For FastText and word2vec, the results of the Pearson coefficient and rank correlation coefficients (Spearman, Kendall) are comparable. However, the choice of cosine similarity is suboptimal for sentence vectors as centroids of word vectors (a widely used baseline for sentence representation), even for FastText. It is caused by stop word vectors behaving as outliers. The rank correlation measures are empirically preferable in this case

This paper showed that in common used word vectors, cosine similarity is equivalent to the Pearson correlation coefficient. However in some dataset that the word vectors are not “normal” and the variance is huge there should be an experimental decision for whether or not cosine similarity is a reasonable choice for measuring semantic similarity

]]>A hyper parameter is a variable that we need to set before applying a learning algorithm into a dataset. The challenge with hyper parameters is that there are no magic numbers that work everywhere. The best numbers depend on each task and each dataset. Generally speaking, we can break hyper parameters down into two categories. The first category is optimizer hyper parameters.

These are the variables related more to the optimization and training process than to the model itself. These include the learning rate, the minibatch size, and the number of training iterations or epochs.

The second category is model hyper parameters. These are the variables that are more involved in the structure of the model. These include the number of layers and hidden units and model specific hyper parameters for architectures like RNMs.

The learning rate is the most important hyperparameter. Even if you apply models that other people built to your own dataset, you’ll find that you’ll probably have to try a number of different values for the learning rate to get the model to train properly. If you took care to normalize the inputs to your model, then a good starting point is usually 0.01. And these are the usual suspects of learning rates. If you try one and your model doesn’t train, you can try the others. Which of the others should you try? That depends on the behavior of the training error. To better understand this, we’ll need to look at the intuition of the learning rate. we saw that when we use gradient descent to train a neural network model, the training task boils down to decreasing the error value calculated by a loss function as much as we can. During a learning step, we do a calculate the loss, then find the gradient.

Let’s assume this simplest case, in which our model has only one weight. The gradient will tell us which way to nudge the current weight so that our predictions become more accurate. To visualize the dynamics of the learning rate; let take a look at some situations depicted in the following:

this here is a simple example with only one parameter, and an ideal convex error curve. Things are more complicated in the real world, I’m sure you’ve seen your models are likely to have hundreds or thousands of parameters, each with its own error curve that changes as the values of the other weights change. And the learning rate has to shepherd all of them to the best values that produce the least error. To make matters even more difficult for us, we don’t actually have any guarantees that the error curves would be clean u-shapes. They might, in fact, be more complex shapes with local minima that the learning algorithm can mistake for the best values and converge on.

Now that we looked at the intuition of the learning rates, and the indications that the training error gives us that can help us tune the learning rate, let’s look at one specific case we can often face when tuning the learning rate. Think of the case where we chose a reasonable learning rate. It manages to decrease the error, but up to a point, after which it’s unable to descend, even though it didn’t reach the bottom yet. It would be stuck oscillating between values that still have a better error value than when we started training, but are not the best values possible for the model. This scenario is where it’s useful to have our training algorithm decrease the learning rate throughout the training process. This is a technique called learning rate decay. Some adaptive learning optimizers are AdamOptimizer or AdagradOptimizer

Minibatch size is another hyper parameter that no doubt you’ve run into a number of times already. It has an effect on the resource requirements of the training process but also impacts training speed and number of iterations in a way that might not be as trivial as you may think. It’s important to review a little bit of terminology here first. Historically there had been debate on whether it’s better to do online also called stochastic training where you fit a single example of the dataset to the model during a training step. And using only one example, do a forward pass, calculate the error, and then back propagate and set adjusted values for all your parameters. And then do this again for each example in the dataset. Or if it was better to feed the entire dataset to the training step and calculate that gradient using the error generated by looking at all the examples in the dataset. This is called batch training. The abstraction commonly used today is to set a minibatch size. So online training is when the minibatch size is one, and batch training is when the minibatch size is the same as the number of examples in the training set. And we can set the minibatch size to any value between these two values. The recommended starting values for your experimentation are between one and a few hundred with 32 often being a good candidate. A larger minibatch size allows computational boosts that utilizes matrix multiplication, in the training calculations. But that comes at the expense of needing more memory. In practice, small minibatch sizes have more noise in their error calculations, and this noise is often helpful in preventing the training process from stopping at local minima on the error curve rather than the global minima that creates the best model.

This is an experimental result for the effective batch size on convolutional neural nets

It shows that using the same learning rate, the accuracy of the model decreases the larger the minibatch size becomes.

To choose the right number of iterations or number of epochs for our training step, the metric we should have our eyes on is the validation error. The intuitive manual way is to have the model train for as many epochs or iterations that it takes, as long as the validation error keeps decreasing. Luckily, however, we can use a technique called early stopping to determine when to stop training a model. Early stopping roughly works by monitoring the validation error, and stopping the training when it stops decreasing.

Let’s now talk about the hyperparameters that relates to the model itself rather than the training or optimization process. The number of hidden units, in particular, is the hyperparameter I felt was the most mysterious when I started learning about machine learning. The main requirement here is to set a number of hidden units that is “large enough”. For a neural network to learn to approximate a function or a prediction task, it needs to have enough “capacity” to learn the function. The more complex the function, the more learning capacity the model will need. The number and architecture of the hidden units is the main measure for a model’s learning capacity. If we provide the model with too much capacity, however, it might tend to overfit and just try to memorize the training set. If you find your model overfitting your data, meaning that the training accuracy is much better than the validation accuracy, you might want to try to decrease the number of hidden units. You could also utilize regularization techniques like dropouts or L2 regularization. So, as far as the number of hidden units is concerned, the more, the better. A little larger than the ideal number is not a problem, but a much larger value can often lead to the model overfitting. So, if your model is not training, add more hidden units and track validation error. Keep adding hidden units until the validation starts getting worse. Another heuristic involving the first hidden layer is that setting it to a number larger than the number of the inputs has been observed to be beneficial in a number of tests. What about the number of layers? Andrej Karpathy tells us that in practice, it’s often the case that a three-layer neural net will outperform a two-layer net, but going even deeper rarely helps much more. The exception to this is convolutional neural networks where the deeper they are, the better they perform.

“These results clearly indicate the advantages of the gating units over the more traditional recurrent units. Convergence is often faster, and the final solutions tend to be better. However, our results are not conclusive in comparing the LSTM and the GRU, which suggests that the choice of the type of gated recurrent unit may depend heavily on the dataset and corresponding task.”

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

“The GRU outperformed the LSTM on all tasks with the exception of language modelling”

An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever

“Our consistent finding is that depth of at least two is beneficial. However, between two and three layers our results are mixed. Additionally, the results are mixed between the LSTM and the GRU, but both significantly outperform the RNN.”

Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei

“Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.”

Understanding LSTM Networks by Chris Olah

“In our [Neural Machine Translation] experiments, LSTM cells consistently outperformed GRU cells. Since the computational bottleneck in our architecture is the softmax operation we did not observe large difference in training speed between LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla decoder is unable to learn nearly as well as the gated variant.”

Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le

Resource and Reference

If you want to learn more about hyperparameters, these are some great resources on the topic:

- Practical recommendations for gradient-based training of deep architectures by Yoshua Bengio
- Deep Learning book – chapter 11.4: Selecting Hyperparameters by Ian Goodfellow, Yoshua Bengio, Aaron Courville
- Neural Networks and Deep Learning book – Chapter 3: How to choose a neural network’s hyper-parameters? by Michael Nielsen
- Efficient BackProp (pdf) by Yann LeCun

More specialized sources:

- How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao
- Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas
- Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei

Distributional semantics is a subfield of natural language processing predicated on the idea that word meaning is derived from its usage. The distributional hypothesis states that words used in similar contexts have similar meanings. That is, if two words often occur with the same set of words, then they are semantically similar in meaning. A broader notion is the statistical semantic hypothesis, which states that meaning can be derived from statistical patterns of word usage. Distributional semantics serve as the fundamental basis for many recent computational linguistic advances. In this survey, we introduce the notion of word embeddings that serve as core representations of text in deep learning approaches. We start with the distributional hypothesis and explain how it can be leveraged to form semantic representations of words. We discuss the common distributional semantic models including word2vec and GloVe and their variants. We address the shortcomings of embedding models and their extension to document and concept representation. Finally, we discuss several applications to natural language processing tasks

**Vector Space Model**

Vector space models (VSMs) represent a collection of documents as points in a hyperspace, or equivalently, as vectors in a vector space. They are based on the key property that the proximity of points in the hyperspace is a measure of the semantic similarity of the documents. In other words, documents with similar vector representations imply that they are semantically similar. VSMs have found widespread adoption in information retrieval applications, where a search query is achieved by returning a set of nearby documents sorted by distance**Curse of Dimensionality**

VSMs can suffer from a major drawback if they are based on high-dimensional sparse representations. Here, sparse means that a vector has many dimensions with zero values. This is termed the curse of dimensionality. As such, these VSMs require large memory resources and are computationally expensive to implement and use. For instance, a term-frequency based VSM would theoretically require as many dimensions as the number of words in the dictionary of the entire corpus of documents. In practice, it is common to set an upper bound on the number of words and hence, dimensionality of the VSM. Words that are not within the VSM are termed out-of-vocabulary (OOV). This is a meaningful gap with most VSMs in that they are unable to attribute semantic meaning to new words that they haven’t seen before and are OOV.**Word Representations**

One of the earliest use of word representations dates back to 1986. Word vectors explicitly encode linguistic regularities and patterns. Distributional semantic models can be divided into two classes, co-occurrence based and predictive models. Co occurrence based models must be trained over the entire corpus and capture global dependencies and context, while predictive models capture local dependencies within a (small) context window. The most well-known of these models, word2vec and GloVe, are known as word models since they model word dependencies across a corpus. Both learn high-quality, dense word representations from large amounts of unstructured text data. These word vectors are able to encode linguistic regularities and semantic patterns, which lead to some interesting algebraic properties.**Co-occurrence**

The distributional hypothesis tells us that co-occurrence of words can reveal much about their semantic proximity and meaning. Computational linguistics leverages this fact and uses the frequency of two words occurring alongside each other within a corpus to identify word relationships. Pointwise Mutual Information (PMI) is a commonly used information-theoretic measure of co-occurrence between two words w1 and w2:

where p(w) is the probability of the word occurring, and p(w1,w2) is joint probability of the two words co-occurring. High values of PMI indicate collocation and coincidence (and therefore strong association) between the words. It is common to estimate the single and joint probabilities based on word frequency and co-occurrence within the corpus. PMI is a useful measure for word clustering and many other tasks. **LSA**

Latent semantic analysis (LSA) is a technique that effectively leverages word cooccurrence to identify topics within a set of documents. Specifically, LSA analyzes

word associations within a set of documents by forming a document-term matrix

(see Fig. 5.2), where each cell can be the frequency of occurrence or TFIDF of

a term within a document. As this matrix can be very large (with as many rows

as words in the vocabulary of the corpus), a dimensionality reduction technique

such as singular-value decomposition is applied to find a low-rank approximation.

This low-rank space can be used to identify key terms and cluster documents or for

information retrieval**Neural Language Models**

Recall that language models seek to learn the joint probability function of sequences of words. As stated above, this is difficult due to the curse of dimensionality—the sheer size of the vocabulary used in the English language implies that there could be an impossibly huge number of sequences over which we seek to learn. A language model estimates the conditional probability of the next word wT given all previous words wt:

Many methods exist for estimating continuous representations of words, including latent semantic analysis (LSA) and latent Dirichlet allocation (LDA). The former fails to preserve linear linguistic regularities while the latter requires huge computational expense for anything beyond small datasets. In recent years, different neural network approaches have been proposed to overcome these issues. The representations learned by these neural network models are termed neural embeddings. In 2003 [Bengio 2003] presented a neural probabilistic model for learning a distributed representation of words. Instead of sparse, high-dimensional representations, the Bengio model proposed representing words and documents in lower-dimensional continuous vector spaces by using a multilayer neural network to predict the next word given the previous ones. This network is iteratively trained to maximize the conditional log-likelihood J over the training corpus using back-propagation:

where v(wt) is the feature vector for word wt, f is the mapping function representing

the neural network, and R(θ) is the regularization penalty applied to weights θ

of the network. In doing so, the model concurrently associates each word with a

distributed word feature vector as well as learning the joint probability function of

word sequences in terms of the feature vectors of the words in the sequence. For

instance, with a corpus of vocabulary size of 100,000, a one-hot encoded 100,000-

dimensional vector representation, the Bengio model can learn a much smaller 300-

dimensional continuous vector space representation.[Collobert 2008] applied word vectors to several NLP tasks and showed that word vectors could be trained in an unsupervised manner on a corpus and used to significantly enhance NLP tasks. They used a multilayer neural network trained in an end-to-end fashion. In the process, the first layer in the network learned distributed word representations that are shared across tasks. The output of this word representation layer was passed to downstream architectures that were able to output part-of-speech tags, chunks, named entities, semantic roles, and sentence likelihood. The model is an example of multitask learning enabled through the adoption of dense layer representations.**word2vec**

In 2013, [Mikolov 2013] proposed a set of neural architectures could compute continuous representations of words over large datasets. Unlike other neural network architectures for learning word vectors, these architectures were highly computationally efficient, able to handle even billion-word vocabularies, since they do not involve dense matrix multiplications. Furthermore, the high-quality representations learned by these models possessed useful translational properties that provided semantic and syntactic meaning. The proposed architectures consisted of the continuous bag-of-words (CBOW) model and the skip-gram model. They termed the group of models word2vec. They also proposed two methods to train the models based on a hierarchical softmax approach or a negative-sampling approach. The translational properties of the vectors learned through word2vec models can provide highly useful linguistic and relational similarities. In particular, Mikolov et al revealed that vector arithmetic can yield high-quality word similarities and analogies. They showed that the vector representation of the word queen can be recovered from representations of king, man, and woman by searching for the nearest vector based on cosine distance to the vector sum:

The global co-occurrence based models can be the alternative to predictive, local context window methods like word2vec. Co-occurrence methods are usually very

high dimensional and require much storage. When dimensionality reduction methods are used like in LSA, the resulting representations typically perform poorly in capturing semantic word regularities. Furthermore, frequent co-occurrence terms tend to dominate. Predictive methods like word2vec are local-context based and generally perform poorly in capturing the statistics of the corpus. [Pennington 2014] proposed a log-bilinear model that combines both global co-occurrence and shallow window methods. They termed this the GloVe model, which is play on the words Global and Vector. The GloVe model is trained via least squares using the cost function:

where V is the size of the vocabulary, Xi j is the count of times that words i and j cooccur in the corpus, f is a weighting function that acts to reduce the impact of frequent counts, and ui and vj are word vectors. It is well known that the distributional hypothesis holds for most human languages. This implies that we can train word embedding models in many languages [Coulmance** **2016], and companies such as Facebook and Google have released pre-trained word2vec and GloVe vectors for up to 157 languages [Grave 2018]. These embedding models are monolingual—they are learned on a single language. Several languages exist with multiple written forms. For instance, Japanese possesses three distinct writing systems (Hiragana, Katakana, Kanji). Mono-lingual embedding models cannot associate the meaning of a word across different written forms. The term word alignment is used to describe the NLP process by which words are related together across two written forms across languages

Embedding models suffer from a number of well-known limitations. These include out-of-vocabulary words, antonymy, polysemy, and bias. We explore these in detail in the next sections. **Out of Vocabulary**

The Zipfian distributional nature of the English language is such that there exists a huge number of infrequent words. Learning representations for these rare words would require huge amounts of (possibly unavailable) data, as well as potentially excessive training time or memory resources. Due to practical considerations, a word embedding model will contain only a limited set of the words in the English language. Even a large vocabulary will still have many out-of-vocabulary (OOV) words. Unfortunately, many important domain-specific terms tend to occur infrequently and can contribute to the number of OOV words. This is especially true with domain-shifts. As a result, OOV words can have crucial role in the performance NLP tasks. With models such as word2vec, the common approach is to use a “UNK” representation for words deemed too infrequent to include in the vocabulary. This maps many rare words to an identical vector (zero or random vectors) in the belief that their rarity implies they do not contribute significantly to semantic meaning. Thus, OOV words all provide an identical context during training. Similarly, OOV words at test time are mapped to this representation. This assumption can break down for many reasons, and a number of methods have been proposed to address this shortfall. Ideally, we would like to be able to somehow predict a vector representation that is semantically similar to either words that are outside our training corpus or that occurred too infrequently in our corpus. Character-based or subword (char-n-gram) embedding models are compositional approaches that attempt to derive a meeting from parts of a word (e.g., roots, suffixes) Subword approaches are especially useful for foreign languages that are rich in morphology such as Arabic or Icelandic **Antonymy**

Another significant limitation is an offshoot of the fundamental principle of distributional similarity from which word models are derived—that words used in similar contexts are similar in meaning. Unfortunately, two words that are antonyms of each other often co-occur with the same sets of word contexts:

I really hate spaghetti on Wednesdays.

I really love spaghetti on Wednesdays.

While word embedding models can capture synonyms and semantic relationships, they fail notably to distinguish antonyms and overall polarity of words. In other words, without intervention, word embedding models cannot differentiate between synonyms and antonyms and it is common to find antonyms closely colocated within a vector space model.

An adaptation to word2vec can be made to learn word embeddings that disambiguate polarity by incorporating thesauri information [Ono 2015]**Polysemy**

In the English language, words can sometimes have several meanings. This is known as polysemy. Sometimes these meanings can be very different or complete opposites of each other. Look up the meaning of the word bad and you might find up to 46 distinct meanings. As models such as word2vec or GloVe associate each word with a single vector representation, they are unable to deal with homonyms and polysemy. Word sense disambiguation is possible but requires more complex models. Humans do remarkably well in distinguishing the meaning of a word based on context. In the sentences above, it is relatively easy for us to distinguish the different meanings of the word play based on the part-of-speech or surrounding word context. This gives rise to multi-representation embedding models that can leverage surrounding context (cluster-weighted context embeddings) or part-of-speech (sense2vec). Sense2vec is a simple method to achieve world-sense disambiguation that leverages supervised labeling such as part-of-speech [Trask 2015]

Methods such as word2vec or GloVe ignore the internal structure of words and associate each word (or word sense) to a separate vector representation. For morphologically rich languages, there may be a significant number of rare word forms such that either a very large vocabulary must be maintained or a significant number of words are treated as out-of-vocabulary (OOV). As previously stated, out-ofvocabulary words can significantly impact performance due to the loss of context from rare words. An approach that can help deal with this limitation is the use of subword embeddings [Bojanowski 2016] where vector representations are associated with character n-grams g and words wi are represented by the sum of the n-gram vectors. While word embedding models capture semantic relationships between words, they lose this ability at the sentence level. Sentence representations are usually expressed the sum of the word vectors of the sentence. This bag-of-words approach has a major flaw in that different sentences can have identical representations as long as the same words are used. To incorporate word order information, people have attempted to use bag-of-n-grams approaches that can capture short order contexts. However, at the sentence level, they are limited by data sparsity and suffer from poor generalization due to high dimensionality [Le 2014] proposed an unsupervised algorithm to learn useful representations of sentences that capture word order information. Their approach was inspired by Word2Vec for learning word vectors and is commonly known as doc2vec. It generates fixed-length feature representations from variablelength pieces of text, making it useful for application to sentences, paragraphs, sections, or entire documents. In the past year, a number of new methods leveraging contextualized embeddings have been proposed. These are based on the notion that embeddings for words should be based on contexts in which they are used. This context can be the position and presence of surrounding words in the sentence, paragraph, or document. By generatively pre-training contextualized embeddings and language models on massive amounts of data, it became possible to discriminatively fine-tune models on a variety of tasks and achieve state-of-the-art results. This has been commonly referred to as** “NLP’s ImageNet moment”**. One of the notable methods is the **Transformer** model, an attention-based stacked encoder–decoder architecture that is pre-trained at scale [Vaswani 2017]. Another important method is **ELMo**, short for Embeddings from Language Models, which generates a set of contextualized word representations that effectively capture syntax and semantics as well as polysemy. These representations are actually the internal states of a bidirectional, character-based LSTM language model that is pre-trained on a large external corpus.

Building on the power of Transformers, a method has recently been proposed called BERT, short for Bidirectional Encoder Representations from Transformers. BERT is a transformer-based, masked language model that is bidirectionally trained to generate deep contextualized word embeddings that capture left-to-right and right-to-left contexts. These embeddings require very little fine-tuning to excel at downstream complex tasks such as entailment or question-answering BERT has broken multiple performance records and represents one of the bright breakthroughs in language representations today [Devlin 2018].

Word embeddings have been found to be very useful for many NLP tasks. In this survey we have presented an extensive overview of semantically-grounded models for constructing distributed representations of meaning. Word embeddings have been shown to provide interesting semantic properties that can be applied to most language applications.

Yoshua Bengio et al. “A neural probabilistic language model”. In: JMLR (2003), pp. 1137–1155.

Ronan Collobert and Jason Weston. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008, pp. 160–167

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “GloVe: Global Vectors for Word Representation”. In: Empirical Methods in Natural Language Processing (EMNLP). 2014, pp.1532–1543.

Jocelyn Coulmance et al. “Trans-gram, Fast Cross-lingual Word embeddings”. In: CoRR abs/1601.02502 (2016).

Edouard Grave et al. “Learning Word Vectors for 157 Languages”. In: CoRR abs/1802.06893 (2018).

Masataka Ono, Makoto Miwa, and Yutaka Sasaki. “Word Embedding based Antonym Detection using Thesauri and Distributional Information.” In: HLT-NAACL. 2015, pp.984–989.

Andrew Trask, Phil Michalak, and John Liu. “sense2vec – A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.” In: CoRR abs/1511.06388 (2015).

Piotr Bojanowski et al. “Enriching Word Vectors with Subword Information”. In: CoRR abs/1607.04606 (2016).

Quoc V. Le and Tomas Mikolov. “Distributed Representations of Sentences and Documents”. In: CoRR abs/1405.4053 (2014).

Ashish Vaswani et al. “Attention is all you need”. In: Advances in Neural Information Processing Systems. 2017, pp. 5998–6008.

Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: CoRR abs/1810.04805 (2018)

Last few decades have witnessed substantial breakthroughs on several areas of speech and language understanding research, specifically for building human to machine conversational dialog systems. Dialog systems, also known as interactive conversational agents, virtual agents or sometimes chatbots, are useful in a wide range of applications ranging from technical support services to language learning tools and entertainment. Recent success in deep neural networks has spurred the research in building data-driven dialog models. In this article, we give an overview to these recent advances on **non task oriented dialogue systems** from various perspectives and discuss some possible research directions

Unlike task-oriented dialogue systems, which aim to complete specific tasks for user, non-task-oriented dialogue systems (also known as chatbots) focus on conversing with human on open domains [Ritter 2011]. In general, chat bots are implemented either by generative methods or retrieval-based methods. Generative models are able to generate more proper responses that could have never appeared in the corpus, while retrieval-based models enjoy the advantage of informative and fluent responses [lu 2011], because they select a proper response for the current conversation from a repository with response selection algorithms. In the following sections, we will focus into the neural generative models, one of the most popular research topics in recent years, and discuss their drawbacks and possible improvements.

Nowadays, a large amount of conversational exchanges is available in social media websites such as Twitter and Reddit, which raise the prospect of building data-driven models [Ritter 2011]. proposed a generative probabilistic model, which is based on phrase-based Statistical Machine Translation, to model conversations on micro-blogging. It viewed the response generation problem as a translation problem, where a post needs to be translated into a response. However,

generating responses was found to be considerably more difficult than translating between languages. It is likely due to the wide range of plausible responses and the lack of phrase alignment between the post and the response. The success of applying deep learning in machine translation, namely Neural Machine Translation, spurs the enthusiasm of researches in neural generative dialogue systems. In the following sections, we first introduce the sequence-to-sequence models, the foundation of neural generative models. Then, we discuss hot research topics in the direction including incorporating dialogue context, improving the response diversity, modeling topics and personalities, leveraging outside knowledge base, the interactive learning and evaluation.

Given a source sequence (message) X consisting of T words and a target sequence (response) Y f length T ,the model maximizes the generation probability of Y conditioned on X; Specifically, a sequence-to-sequence model (or Seq2Seq) is in an encoder-decoder structure. The encoder reads X word by word and represents it as a context vector c through a recurrent neural network (RNN),and then the decoder estimates the generation probability of Y with c as the input. The encoder RNN calculates the context vector c by

where ht is the hidden state at time step t, f is a non-linear function such as long-short term memory unit (LSTM) and gated recurrent unit (GRU)], and c is the hidden state corresponding to the last word hT . The decoder is a standard RNN language model with an additional conditional context vector c. The probability distribution pt of candidate words at every time t is calculated as

where st is the hidden state of the decoder RNN at time t and yt1 is the word at time t1 in the response sequence. [Bahdanau 2014] [Luong 2015] improved the performance by the attention mechanism, where each word in Y is conditioned on different context vector c, with the observation that each word in Y may relate to different parts in x. In general, these models utilize neural networks to represent dialogue histories and to generate appropriate responses. Such models are able to leverage a large amount of data in order to learn meaningful natural language representations and generation strategies, while requiring a minimum amount of domain knowledge and handcrafting

The ability to take into account previous utterances is key to building dialog systems that can keep conversations active and engaging. [Serban 2016] used hierarchical models, first capturing the meaning of individual utterances and then integrating them as discourses. [Xing 2017] extended the hierarchical structure with the attention mechanism to attend to important parts within and among utterances with word level attention and utterance level attention, respectively

**Response Diversity**

A challenging problem in current sequence-to-sequence dialogue systems is that they tend to generate trivial or noncommittal, universally relevant responses with little meaning, which are often involving high frequency phrases along the lines of* I dont know* or *Im OK*. This behavior can be ascribed to the relative high frequency of generic responses like I dont know in conversational datasets, in contrast with the relative sparsity of more informative alternative responses. One promising approach to alleviate such challenge is to find a better objective function. [Li 2016] pointed out that neural models assign high probability to “safe responses when optimizing the likelihood of outputs given inputs. They used a Maximum Mutual Information (MMI), which was first introduced in speech recognition , as an optimization objective It measured the mutual dependence between inputs and outputs, where it took into consideration the inverse dependency of responses on messages. [Serban 2017] presented a latent Variable Hierarchical Recurrent Encoder-Decoder (VHRED) model that also aims to generate less bland and more specific responses. It extends the HRED model by adding a high-dimensional stochastic latent variable to the target. This additional latent variable is meant to address the challenge associated with the shallow generation process. , this process is problematic from an inference standpoint because the generation model is forced to produce a high-level structure-i.e., an entire response-on a word-by-word basis. This generation process is made easier in the VHRED model, as the model exploits a high-dimensional latent variable that determines high-level aspects of the response (topic, names, verb, etc.), so that the other parts of the model can focus on lower level aspects of generation, e.g., ensuring fluency. The VHRED model incidentally helps reducing blandness. Indeed, as the content of the response is conditioned on the latent variable, the generated response is only bland and devoid of semantic content if the latent variable determines that the response should be as such. More recently, [Zhang 2018] presented a model that also introduces an additional variable (modeled using a Gaussian kernel layer), which is added to control the level of specificity of the response, going from bland to very specific.**Speaker Consistency**

It has been shown that the popular seq2seq approach often produces conversations that are incoherent [Li 2016] where the system may for instance contradict what it had just said in the previous turn (or sometimes even in the same turn). While some of this effect can be attributed to the limitation of the learning algorithms. [Li 2016] suggested that the main cause of this inconsistency is probably due to the training data itself. This sets apart the response generation task from more traditional NLP tasks: While models for other tasks such as machine translation are trained on data that is mostly one-to-one semantically, conversational data is often one-to-many or many-to-many as the above example implies. As one-to-many training instances are akin to noise to any learning algorithm, one needs more expressive models that exploits a richer input to better account for such diverse responses.[Li 2016] did so with a persona-based response generation system, which is an extension of the LSTM mode hat uses speaker embeddings in addition to word embeddings. Intuitively, these two types of embeddings work similarly: while word embeddings form a latent space in which spacial proximity (i.e., low Euclidean distance) means two words are semantically or functionally close, speaker embeddings also constitute a latent space in which two nearby speakers tend to converse in the same way, e.g., having similar speaking styles (e.g., British English) or often talking about the same topic (e.g., sports). More recently, [Luan 2017] presented an extension of the speaker embedding model of [Li 2016] which combines a seq2seq model trained on conversational datasets with an autoencoder trained on non-conversational data, where the seq2seq and autoencoder are combined in a multitask learning setup. The tying of the decoder parameters of both seq2seq and autoencoder enables [Luan 2017] to train a response generation system for a given persona without actually requiring any conversational data available for that persona. This is an advantage of their approach, as conversational data for a given user or persona might not always be available**Word Repetitions**

Word or content repetition is a common problem with neural generation tasks other than machine translation, as has been noted with tasks such as response generation, image captioning, visual story generation, and general language modeling. While machine translation is a relatively one-to-one task where each piece of information in the source (e.g., a name) is usually conveyed exactly once in the target, other tasks such as dialogue or story generation are much less constrained, and a given word or phrase in the source can map to zero or multiple words or phrases in the target. This effectively makes the response generation task much more challenging, as generating a given word or phrase doesn’t completely preclude the need of generating the same word or phrase again. In light of the above limitations. [Shao 2017] proposed a new model that adds self-attention to the decoder, aiming at improving the generation of longer and coherent responses while incidentally mitigating the word repetition problem. Target-side attention helps the model more easily keep track of what information has been generated in the output so far so that the model can more easily discriminate against unwanted word or phrase repetitions.

Deep learning has become a basic technique in dialogue systems. Researchers investigated on applying neural networks to the different components of a traditional task-oriented dialogue system, including natural language understanding, natural language generation, dialogue state tracking. Recent years, end-to-end frameworks become popular in not only the non-task-oriented chit-chat dialogue systems, but also the task-oriented ones. Deep learning is capable of leveraging large amount of data and is promising to build up a unified intelligent dialogue system. It is blurring the boundaries between the task-oriented dialogue systems and non task-oriented systems. In particular, the chit-chat dialogues are modeled by the sequence-to-sequence model directly. The task completion models are also moving towards an end-to-end trainable style with reinforcement learning representing the state-action space and combing the whole pipelines. It is worth noting that current end-to-end models are still far from perfect. Despite the aforementioned achievements, the problems remain challenging. Next, we discuss some possible research directions:**Swift Warm-Up:** Although end-to-end models have drawn most of the recent research attention, we still need to rely on traditional pipelines in practical dialogue engineering, especially in a new domain warmup stage. The daily conversation data is quite “big”, however, the dialogue data for a specific domain is quite limited. In particular, domain specific dialogue data collection and dialogue system construction are labor some. Neural network based models are better at leveraging large amount of data. We need new way to bridge over the warm-up stage. It is promising that the dialogue agent has the ability to learn by itself from the interactions with human.**Deep Understanding**. Current neural network based dialogue systems heavily rely on the huge amount of different types of annotated data, and structured knowledge base and conversation data. They learn to speak by imitating a response again and again, just like an infant, and the responses are still lack of diversity and sometimes are not meaningful. Hence, the dialogue agent should be able to learn more effectively with a deep understanding of the language and the real world. Specifically, it remains much potential if a dialogue agent can learn from human instruction to get rid of repeatedly training. Since a great quantity of knowledge is available on the Internet, a dialogue agent can be smarter if it is capable of utilizing such unstructured knowledge resource to make comprehension. Last but not least, a dialogue agent should be able to make reasonable inference, find something new, share its knowledge across domains, instead of repeating the words like a parrot

A. Ritter, C. Cherry, and W. B. Dolan. Data-driven response generation in social media. In Conference on Empirical Methods in Natural Language Processing, pages 583–593, 2011.

Z. Ji, Z. Lu, and H. Li. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988, 2014.

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

T. Luong, I. Sutskever, Q. Le, O. Vinyals, andW. Zaremba. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 11–19, Beijing, China, July 2015. Association for Computational Linguistics.

I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models, 2016.

Xing, W. Wu, Y. Wu, M. Zhou, Y. Huang, and W. Y. Ma. Hierarchical recurrent attention network for response generation. 2017.

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2017). A

hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.

Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2016b). A persona-based neural conversation

model. In ACL.

Xing, W. Wu, Y. Wu, M. Zhou, Y. Huang, and W. Y. Ma. Hierarchical recurrent attention network for response generation. 2017.

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2017). A

hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.

Luan, Y., Brockett, C., Dolan, B., Gao, J., and Galley, M. (2017). Multi-task learning for speaker role adaptation in neural conversation models. In IJCNLP.

Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. (2017). Generating highquality and informative conversation responses with sequence-to-sequence models. In EMNLP.

automatic text summarization is the process of shortening a text while keeping the main content and ideas of the documents. With the large amount of text published online such as product’s reviews and the quick capture of these document’s content is critical for decision making. Therefore; manual text summarization is not feasible anymore.

The dominant approach for text summarization up to the blooming of Deep Learning is shallow unsupervised learning information retrieval information models. Neural summarization start in 2014 with the research work of [K˚ageb¨ack et al., 2014] showing that the neural continuous vector space models are promising for text summarization; the research also proved the superior performance in comparison to the classical shallow machine learning methods

The aim of this literature review is to survey the recent work on neural automatic text summarization models. This survey start with the fundamentals on document summarization following by details of recent significant neural text summarization models. The survey also discuss the promising directions for the future research and general conclusion

**Summarization Factors**

there are three main factors in text summarization: **input**, **output** and **purpose**; we will briefly discuss three factors**input factors**

single-document or multi-document: this factor concerns about the number of input documents that the summarization system take [Jones et al., 1999]

Monolingual, multilingual or cross-lingual: this factor related to to the number of languages that system can handle. Monolingual system has the input and output in the same language; the multilingual systems handle the input-output pairs in the same language across different languages. in contrast; cross-lingual system handle input-output pairs which are not in the same language**purpose factors**

informative or indicative: the indicative summary conveys the relevant contents of the original documents so the reader can select document that align with their interests to read further. mean while; the informative is to replace the original documents with the important contents is concerned **output factors**

extractive or abstractive: extractive summarizer selects text snippets (words, phrases, sentences) from the source documents; while abstractive generate the important text snippets to convey the main ideas of the source of documents

**Evaluation of Summarization Systems**

The most popular and cheap method for evaluating of text summarizer is Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [Lin, 2004] along with human ratings. There type of ROUGE that are commonly used are ROUGE-N, ROUGE-L and ROUGE-SU.

ROUGE-N computes the percentage of n-gram overlapping of system and reference summaries

ROUGE-L computes the sum of the longest in sequence matches of each reference sentence to the system summary

ROUGE-SU measures the percentage of skip-bigrams and unigrams overlapping **Summarization Techniques**

Most early works on single-document **extractive** summarization employ statistical techniques. Such algorithms rank each sentence based on its relation to the other sentences by using pre defined formulas Later works on text summarization address the problem by creating sentence representations of the documents and utilizing machine learning algorithms. These models manually select the appropriate features, and train supervised models to classify whether to include the sentence in the summary [wong 2008] . The core of abstractive summarization techniques is to identify the main ideas in the documents and encode them into feature representations. These encoded features are then passed to natural language generation (NLG) systems Most of the early work on abstractive summarization uses semi-manual process of identifying the main ideas of the document(s). Prior knowledge such as scripts and templates are usually used to produce summaries

With the blooming of deep learning, neural summarizers has become an attracted considerable attention for automatic summarization. The neural models often achieve better performance in comparison to the traditional model if we got a huge amount of data

Most neural-based summarizers use the following pipeline: 1) words are transformed to continuous vectors, called word embeddings, by a look-up table; 2) sentences/documents are encoded as continuous vectors using the word embeddings; 3) sentence/document representations (sometimes also word embeddings) are then fed to a model for selection (extractive summarization) or generation (abstractive summarization).

Neural networks can be used in any of the above three steps. In step 1, we can use neural networks to obtain pre-learned look-up tables (such as Word2Vec, CW vectors, and GloVe). In step 2, neural networks, such as convolutional neural networks (CNNs) or recurrent neural networks(RNNs), can be used as encoders for extracting sentence/document features. In step 3, neural network models can be used as regressors for ranking/selection (extraction) or decoders for generation (abstraction).**Extractive Models**

Extractive summarizers, which are selection-based methods, need to solve the following two critical challenges: 1) how to represent sentences; 2) how to select the most appropriate sentences, taking into account of the coverage and the redundancy.

[K˚ageb¨ack et al., 2014] proposes to represent sentences as continuous vectors that are obtained by either adding the word embeddings or using an unfolding Recursive Auto-encoder (RAE) on word embedding. The RAE is trained in an unsupervised manner by back propagation method with self-reconstruction error. The pre-computed word embeddings from Collobert and Weston’s model (CW vectors) or Mikolov et al.’s model (W2V vectors) are directly used without fine-tuning. The selection task is choosing summary S as an optimization problem that maximizes the linear combination of the diversity of the sentences R and the coverage of the input text L.

[Cao et al., 2015] propose PriorSum which uses the CNN learned features concatenated with document independent features as the sentence representation. Three document-independent features are used: 1) sentence position; 2) averaged term frequency of words in the sentence based on the document; 3) averaged term frequency of words in the sentence based on the cluster (multi-document summarization). The CNN used in PriorSum has multiple-layers with alternating convolution and pooling operations. The filters in the convolution layers have different window sizes and two-stage max-over-time-pooling operations are performed in the pooling layers. The parameters in this CNN is updated by applying the diagonal variant of AdaGrad with mini-batches PriorSum is a supervised model that requires the gold standard summaries during training. PriorSum follows the traditional supervised extractive framework: it first ranks each sentence and then selects the top k ranked non-redundant sentences as the final summary During training, each sentence in the document is associated with the ROUGE-2 score (stopwords removed) with respect to the gold standard summary. Then a linear regression model is trained to estimate these ROUGE-2 scores by updating the regression weights

[Nallapati et al., 2017] propose SummaRuNNer employs a two-layer bi-directional RNN for sentences and document representations. The first layer of the RNN is a bi-directional GRU that runs on words level: it takes word embeddings in a sentence as the inputs and produces a set of hidden states. These hidden states are averaged into a vector, which is used as the sentence representation. The second layer of the RNN is also a bi-directional GRU, and it runs on the sentence-level by taking the sentence representations obtained by the first layer as inputs. The hidden states of the second layer are then combined into a vector d (document representation) through a non-linear transformation. The authors frame the task of sentence selection as a sequentially sentence labeling problem which uses the hidden states (h1, , hm) from the second layer of the encoder RNN directly for the binary decision (modeled by a sigmoid function)**Abstractive Models**

Abstractive summarizers focus on capturing the meaning representation of the whole document and then generate an abstractive summary based on this meaning representation. Therefore, neural-based abstractive summarizers, which are generation-based methods, need to make the following two decisions: 1) how to represent the whole document by an encoder; 2) how to generate the words sequence by a decoder

[Rush et al., 2015] proposes **three encoder structures** to capture the meaning representation of a document. **Bag-of-Words Encoder**: The first encoder basically computes the summation of the word embeddings appeared in the sequence. **Convolutional Encoder** : This encoder utilizes a CNN model with multiple alternating convolution and 2-element-max-pooling layers. **Attention-Based Encoder** : This encoder produces a document representation at each time step based on the previous C words (context) generated by the decoder. The decoder uses a feed-forward neural network-based language model (NNLM) for estimating the probability distribution that generates the word at each time step t

[Nallapati et al., 2016] proposes a feature-rich hierarchical attentive encoder based on the bidirectional-GRU to represent the document. The encoder takes the input vector obtained by concatenating the word embedding with additional linguistic features. The additional linguistic features used in their model are parts-of-speech (POS) tags, named-entity (NER) tags, term-frequency (TF) and inverse document frequency (IDF) of the word. The continuous features (TF and IDF) are first discretized into a fixed number of bins and then encoded into one-hot vectors as other discrete features. All the one-hot vectors are then transformed into continuous vectors by embedding matrices and these continuous vectors are concatenated into a single long vector, which is then fed into the encoder. Hierarchical attention: The hierarchical encoder has two RNNs; one runs on the word-level and one runs on the sentence-level. The hierarchical attention proposed by the authors basically re-weigh the word attentions by the corresponding sentence-level attention. The document representation dt is then obtained by the weighted sum of the feature-rich input vectors. The decoder is based on uni-directional GRU.

[See et al., 2017] proposed a neural network architecture called Pointer-Generator Networks. The encoder of the Pointer-Generator network is simply a single-layer bidirectional LSTM. It computes the document representation dt based on the attention weights and the encoder’s hidden states. The The basic building block of decoder is a single-layer uni-directional LSTM. Moreover, the authors propose a coverage mechanism for penalizing repeated attentions on already attended words.

In summarization, one critical issue is to represent the semantic meanings of the sentences and documents. Neural-based models display superior performance on automatically extracting these feature representations. However, deep neural network models are neither transparent enough nor integrating with the prior knowledge well. More analysis and understanding of the neural-based models are needed for further exploiting these models. In addition, the current neural-based models have the following limitations: 1) they are unable to deal with sequences longer than a few thousand words due to the large memory requirement of these models; 2) they are unable to work well on small-scale datasets due to the large amount of parameters these models have; 3) they are very slow to train due to the complexity of the models. There are many very interesting and promising directions for future research on text summarization. We proposed three directions in this review: 1)using the pre-trained technique such as ELMo, UMFiT or BERT to get the better result on text summarization to tackle the limitation of data

2) using the reinforcement learning approaches, such as the actor-critic algorithm, to train the neural-based models; 3) exploiting techniques in text simplification to transform documents into simpler ones for summarizers to process.

This survey presented the potential of neural-based techniques in automatic text summarization, based on the examination of the-state-of-the-art extractive and abstractive summarizers.

Neural-based models are promising for text summarization in terms of the performance when

large-scale datasets are available for training. However, many challenges with neural-based

models still remain unsolved. Future research directions such as adding the reinforcement

learning algorithms and text simplification methods to the current neural-based models are

provided to the researchers

[K˚ageb¨ack et al., 2014] K˚ageb¨ack, M., Mogren, O., Tahmasebi, N., and Dubhashi, D. (2014). Extractive summarization using continuous vector space models. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pages 31–39.

[Jones et al., 1999] Jones, K. S. et al. (1999). Automatic summarizing: factors and directions. Advances in automatic text summarization, pages 1–12.

[Wong et al., 2008] Wong, K.-F., Wu, M., and Li, W. (2008). Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd International Conference on Computational LinguisticsVolume 1, pages 985–992. Association for Computational Linguistics

[Cao et al., 2015] Cao, Z., Wei, F., Li, S., Li, W., Zhou, M., and Wang, H. (2015). Learning summary prior representation for extractive summarization. In ACL.

[Nallapati et al., 2017] Nallapati, R., Zhai, F., and Zhou, B. (2017). SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3075–3081.

[Rush et al., 2015] Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 379–389.

[Nallapati et al., 2016] Nallapati, R., Zhou, B., dos Santos, C. N., G¨ul¸cehre, C¸ ., and Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 280–290.

[See et al., 2017] See, A., Liu, P. J., and Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 1: Long Papers, pages 1073–1083.