Consider you have a classification task at hand. You pre-processed your data and, depending on your choice, fed it to a fancy Scikit-learn algorithm and boom – your model is ready. But you were unsatisfied. Your curiosity coaxed you to look under the hood and get a hold of what’s happening on the inside. So, today let’s do something to feed your urge.

Follow along!

We generally predict outcomes from a model by multiplying the input ‘Xs’ with the given weights ‘Ws’. The score function, thus formed, is calculated as:

f(x) = Wx+b ; b stands for the bias term (i)

[source]

As seen in the above figure, every row in W gets multiplied by the input vector and every element in the output is a sum of an element in the ‘Weight*input’ vector and its corresponding bias term.

But to classify things, the score function must give out maximum score for the correct class. This does not happen on its own. We need to improve the score function and find the optimal parameters that work for every item in our input space. It’s here that Loss Functions and Updates come into the scene.

Starting with today’s topics, let’s jump into Loss Functions and understand the hell out them.

So, you might first want to know, what the heck is a Loss Function?

Yeah, you are there, calculating the ‘Loss’ when you sell an ML model to somebody. No, just kidding.

Loss functions are, in layman terms, ways to express your dissatisfaction with a model.

Recall that for the i-th example we are given the input and the label that specifies the index of the correct class. The score function takes the data in (it might be pixels from images or even videos, though videos are a bit more complex) and computes the vector f( ,W) of class scores, which we will abbreviate as S (short for scores). For example, the score for the j-th class is the

j-th element of a score matrix: \(S_j^i = f(X^{i},w)\) .

If the outcome classifies correctly, that is, the score for \(S_j^i\) is maximum, then we are happy with the model. The problem starts when this is not the case and \(S_j^i\) is not the maximum score out of the model. The loss will be high if we’re doing a poor job at classifying the training data (low score for the correct class), and it will be low if we’re doing well (high score for the correct class).

There are several loss functions, so, let’s discuss the two most commonly used in calculating the loss.

**Multi Class Support Vector Machine(SVM) Loss**

(FYI: SVM is one of the best, off-the-shelf, supervised learning algorithm as termed by Andrew Ng himself).

The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ(To understand the concepts of margin, read this lecture note). The Multi class SVM loss for the i-th example is then formally written as follows:

Here, \(L_i\) is the loss for input \(i\) . The loss is a summation over the maximum of 0 and the difference, between score for class \(j\) and actual score of input \(i\), for all the classes. We use transpose to facilitate matrix multiplication and convert each row of W into column. Here, is the j-th row of W reshaped as a column. However, this will not necessarily be the case once we start to consider more complex forms of the score function f but for now, this will do the part. The threshold function max is called the Hinge loss. In some cases, max , called the squared Hinge Loss is used but the version without square is the standard.

Total loss is the mean over the sum of all the values of \(L_i\) :

Now, let’s take an example to see how it works: Consider that we have three classes and each of these classes, for a particular input, we received the following scores :

S = [-5,15,7]. Let the first class be the true class. So, calculating loss will look something like this:

L = max(0,15-(-5)+8)+max(0,7-(-5)+8) = 48.

So, for this example, the SVM loss is 48 which is high because all the scores were greater than the correct class’s score. Here, we took Δ to be 8. In summary, the SVM loss function wants the score for the correct class to be larger than that of the incorrect class by a margin of at least Δ (delta).

One important partner of Loss Functions is the Regularizer. Consider you have a set of weights that correctly classifies all the inputs. Well that is awesome as this means we don’t need to find losses; phew! But no, this is where the problem begins. If we have such weights, we can scale them to any degree and get countless such weights but this also affects the difference between two class scores. For example, consider you increase weights from W to 2W and if we take the initial difference between the correct class and the closest incorrect class to be 8, increasing the weights will increase it to 16.

To fix this, we introduce the Regularization Penalty (R) that extends the loss function into a demonic equation:

and ultimately this:

Here, the term that sums over all the elements of W squared is the regularizer. This is also called the L2 regularization. The term lambda λ which is a hyper parameter. The reasons for choosing the L2 regularizer is coherent with the max margin property of SVM. This generalizes the classifier and makes it less prone to over fitting.

**Exit SVM. Enter mighty Softmax.**

**Exit SVM. Enter mighty Softmax.**

Another important Loss function is the Cross-Entropy Loss function with its roots in the Softmax Classifier which is a multi-class variant of Logistic Regression classifier (Logistic Classifier one which classifies the inputs between 0and 1,i.e.,

y \(/in\) [0,1]). Here is what a Logistic/Sigmoid function looks like:

You can see that as the value increases, they are clamped to zero or one.The cross-entropy Loss is expressed as follows:

; \(f_j = j^{th}\) element of the vector of class scores f

Here too, total loss is a mean over the sum of all the losses regularized with a penalty R.

Due to internal properties of Softmax, we can term the loss as calculating the probability of exponentiating the score function f for an input i with respect to the sum of exponential scores of all the scores. This is expressed as follows:

While calculation, the values \(e^{f_{y_i}}\) and \(\sum_j e^{f_{y_i}}\) can get unreasonably large and to fix this, we often multiply the numerator and denominator by any arbitrary value V. A common choice is to set the value \(log v = -max_j(f_y)\) while coding it up, we take this into account and subtract all the values of the score vector with the maximum value of ‘f’ so that the maximum value in the score vector is zero.

To summarize, both SVM loss and Cross-Entropy loss are useful. Where SVM computes the maximum of all scores and encourages to maximize the correct class score, cross-entropy loss (sometimes also referred as softmax loss) computes the log probability of scores for all labels and stresses on maximizing the probability. Below, from the plot of Loss vs. Epoch, it is evident the Logistic/Cross-entropy loss converges quickly and works better than all other.

[source]

__Updates__

__Updates__

Having found the scores, and calculated the gradients [remember how we calculated gradients for the Cost Function in the previous article] we now need to update the parameters in such a way that the model accurately maps all the Xs(inputs) to their corresponding Ys(labels).Thus, we need to apply some tricks to make the values converge to their respective minima.

So, what does the Update do? Download something… No.

The update takes small, gradient sized, steps in the direction of the minima of the Cost Function. The gradient, since you can argue that what if we take a step in the wrong direction, is the differentiation of the Cost Function which characterizes the increasing direction, thus, updating in the negative direction will only take us down the hill, minimizing the ‘Loss’.

Too technical? Consider this: You are an adventure cyclist and you want to follow a crooked path down the hill.

The update function is your map that guides you to make the right turns, ultimately get you downhill.

There are numerous ways to do this. Also, note that this area of optimization is an active area of research with countless intelligent scientists working day and night to give as new and awesome update functions.

#### First, the good ol’ **Stochastic Gradient Descent** a.k.a. SGD.

It is the simplest form of update where we simply change the parameters along the negative gradient direction. We use a hyper parameter- the learning-rate to facilitate better learning. A lower learning rate allows the model to learn better.

#### Second comes the **Momentum** update.

SGD struggles in navigating areas where the surface curves much more steeply in one dimension than in another. Such conditions are common around local optima. This results in the function converging very slowly.

This is how SGD progresses when Loss function is shallow horizontally but steep vertically. It is making huge jitters in the vertical direction while minimal progress per iteration in the horizontal direction.

To tackle this new problem, we use something with similar in interpretation to the idea of velocity in physics. Mathematically, the momentum update is written as:

where v stands for velocity and gamma stands for ‘mu’, a misnomer for momentum (In Physics, ‘mu’ stands for frictional-coefficient). The value of ‘mu’ is generally set to 0.9, but jittered a bit from 0.5 to 0.99 during cross-validation.

When using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. This can be visualized as something happening in the image below:

We can see, in the picture, that function converges in fewer iterations.

#### Following, this is **Nesterov Momentum. **

Seriously, no one likes a blind follower. And Momentum update does just that: blindly follows the slope. So, Nesterov works by giving momentum some ‘lookahead’ properties. Consider the momentum term ( \(mu\ast v\) ). This alone increases the position x by the order of , ignoring the learning parameter for now. So, it seems intuitive to evaluate the gradient at this \(x+mu\ast v\) .

In the figure, the green arrow represents Momentum and the red arrow the actual step. Evaluating gradient at the base of both these arrows (this represents ‘x’) we know that momentum will take us at to the tip of green arrow. So, we look ahead and evaluate the gradient at this ‘lookahead-position’.

To better understand Nesterov and its origin, it’s highly recommended that you take a look at Ilya Sutskever’s thesis’ section 7.2

#### Third is **AdaGrad.**

It is an adaptive learning rate introduced in this paper by Duchi et al. It works by adapting the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. Thus, it is well-suited for dealing with sparse data. Adagrad eliminates the need to manually tune the learning rate and most implementations use a default value of 0.01 and just forget about it. Quite convenient, right!

But, no good in this world comes without cons and Adagrad too has its fair share. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early.

#### Fouth, **RMSprop **

Well, believe it or not, this is an unpublished method that was actually introduced in Geoff Hinton’s Coursera class Lecture 6 Slide 29 and takes the idea of Adagrad and improves it to reduce its notorious, monotonically decreasing learning rate, i.e., it does not let the updates get monotonically small. This is done by using a running average of squared gradients. A hyperparameter called the decay rate is also used whose value is generally around 0.9.

Now, you might think, we have fixed RMSprop so why not speed it up as well and reduce training time. That’s exactly what I am thinking and this is done by the **Adaptive Moment Estimation also known as Adam** proposed at ICLR 2015. Adam is similar to RMSprop but a smoother version of the gradient is used rather than raw gradient dx. Practically, Adam is considered the go to algorithm for update today but we must not abstain from trying others.

A picture is worth a thousand words. So, to summarize whatever we saw above, here is a picture from Alec Radford which I nabbed from CS231n’s slides.

There are few other updates such as AdaDelta, Nadam, Adamax, Downpour SGD, Elastic Averaging SGD that are just updates over their corresponding predecessors. However, there is one unique method named after the mighty Newton and it might prove very helpful if you have small datasets. It’s simply called the Newton’s method or the Newton-Raphson method and is represented as:

where θ is a parameter, ∇_f(θ) is, as usual, the vector of partial derivatives of with respect to the ; and H is an n-by-n matrix (actually, n + 1-by-n + 1, assuming that we include the intercept term) called the Hessian whose elements are characterized by:

This method has faster convergence rate compared to SGD but it’s per iteration cost is expensive, when compared to an SGD update, as it involves calculating a matrix through second-order partial derivation and then inverting it. This concept of Hessian cannot be applied while working with images as images are very large and have multiple dimensions (more than 2) and hence, inverting such a matrix can be very painful.

With this we come to an end of our discussion on Updates. Now just the last topic to discuss – Huh!

__Activations__

Termed by Google’s Vincent Vanhoucke (one of the minds behind the Inception v4 model and Tensor flow) as tools used by all the lazy engineers, Activation functions play an important part in making Neural Networks useful. But, moral of the story: All engineers are lazy.

[source]

Continuing our study, **why do we need Activation functions**?

Activations are used to introduced non-linearity in out Neural Networks. Why? Because when mapping real-world problems, which are by no means linear, we need a non-linear model. So, until you don’t have a non-linear model, it does not matter how deep is your network, it will be no good than a simple two or three layer-net. To make the incoming data nonlinear, we use nonlinear mapping called activation function. Here too, we have multiple activation functions, so let’s get started.

To start the party, we have:

**Sigmoid function **

that clamps the inputs to [0,1] where 0 represents the feature is absent while 1 represents the feature is present. Although, the sigmoid has been around for a while, it is quite detrimental to use in Neural Network. A process called backpropagation (this is a concept used in neural networks to calculate gradients. If you want to more about this, check out this awesome blog post by Google’s Chris Olah) where we multiply local gradients with gradients of the activation function. So, if most of the values are closed to zero, they will automatically be killed and the features will not be taken into account. Moreover, the outputs from this activation are not zero-centered (do not have zero mean).

But, Sigmoid says, ‘I’ll be back!’ (Spoiler: LSTMs)

__Tanh Activation__

__Tanh Activation__

Represented as:

and plotted as:

[source]

tanh is a small improvement over the sigmoid that was proposed to allow more features through a network and, therefore, clamps inputs to a range of [-1,1]. This still slays gradients but is a little lenient than sigmoid and, also, its outputs are zero-centered. Using this, neural networks enjoy a small improvement over sigmoid. But, we need more.

**The Rectified Linear Unit or ReLU**

Popular in the field of Neural Networks for a few years now, this function simply thresholds inputs to zero.

This function is extremely fast, as reported in Krizhevsky et al (by a factor of 6 as compared to sigmoid/tanh) and is very simple to calculate compared to its counterparts.

But this too has its downsides. Consider if you have mainly negative weights? The layers will never activate and thus, whole network from that point will be dead. There are other proposals to fix this such as the leaky RELU which functions as:

The -0.1 is not fixed but the idea that the network allows a small amount of negative values is important.

[source]

There are is one more technique called Maxout proposed by Ian Goodfellow, the founder of Generative Adversarial Nets. In this paper Goodfellow et al., Goodfellow changes the function of the form \(f(w^T\ast x +b)\) at non-linearity to a maxout neuron of the form

On a closer look, this turns out to be a general form of the both RELU and Leaky RELU. Since it doesn’t clamp to 0, the problem of ending up with dead neurons is gone but we end up with a more expensive function with double the parameters for every layer.

[Above picture from Ian Goodfellow’s paper shows how maxout behaves for a 1D input]

With this we conclude this long (I hope not tedious) post on Loss functions, methods to minimize them(updates), and Activation functions (a concept related to Neural Networks).

To end this, below is an uber cool video from Siraj Raval (this guy rocks, really) and some enlightening papers you might want to check out.

**Code: **

Here is an implementation for everything discussed in this post.

**Further Reading**:

- Advances in optimizing Recurrent Networksby Yoshua Bengio
- L2SVM: A diversion from SVM loss.
- Diving Deep into Rectifiers

So, until next time, keep training!

[Note: In this post I’ve used some pictures that I found through google and also from Stanford’s CS231n lecture notes.]

## Share this: