# Tuning Your Keras SGD Neural Network Optimizer

Author’s Note: This article does not explicitly walk through the math behind each optimizer. This article’s focus is to conceptually walk-through each optimizer and how they perform. If you don’t know what gradient descent is, check out this link before continuing with the article.

SGD is the default optimizer for the python Keras library as of this writing. SGD differs from regular gradient descent in the way it calculates the gradient. Instead of using all the training data to calculate the gradient per epoch, it uses a randomly selected instance from the training data to estimate the gradient. This generally leads to faster convergence, but the steps are noisier because each step is an estimate.

## Momentum Optimization

Momentum optimization is an improvement on regular gradient descent. The idea behind it is essentially the idea of a ball rolling down a hill. It starts slow but picks up speed over time. The gradient for momentum optimization is used for acceleration rather than speed like in regular Gradient Descent. This simple idea is very effective. Think of what happens when regular gradient descent gets closer to a minimum. The gradient is small, so steps become really small and can take a while to converge. However, with momentum optimization, convergence will happen quicker because its steps utilize the gradients before it rather than just the current one.

Gradient Descent goes down the steep slope quite fast, but then it takes a very long time to go down the valley. In contrast, momentum optimization will roll down the valley faster and faster until it reaches the bottom (the optimum).*

Implementing momentum optimization is Keras is quite simple. You use the SGD optimizer and change a few parameters, as shown below.

`optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)`

The momentum hyperparameter is essentially an induced friction(0 = high friction and 1 = no friction). This “friction” keeps the momentum from growing too large. 0.9 is a typical value.

`optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)`