<p>Nesterov Accelerated Gradient (NAG) is an optimization algorithm used in deep learning. It is an extension of SGD with momentum. NAG calculates the gradient not only at the current position but also a step ahead in the direction of the momentum. This allows it to anticipate the momentum's effect and make more accurate updates. NAG effectively "looks ahead" before updating the parameters, which improves convergence speed and helps avoid overshooting the optimal solution. By incorporating this lookahead approach, NAG enhances the momentum-based optimization and leads to faster convergence and better performance of neural networks.</p>
<h2 id="heading-formula">Formula:</h2>
<p>$$v_{t} = \beta \cdot v_{t-1} + (1 - \beta) \cdot \nabla J(\theta - \beta \cdot v_{t-1}) $$</p><p> $$\theta = \theta - \alpha \cdot v_{t}$$</p>
<h2 id="heading-keras-code-example">Keras code Example:</h2>
<h3 id="heading-sgd">SGD:</h3>
<pre><code class="lang-python">tf.keras.optimizers.SGD(learning_rate=<span class="hljs-number">0.01</span>)
</code></pre>
<h3 id="heading-sgd-with-momentum">SGD with momentum:</h3>
<pre><code class="lang-python">tf.keras.optimizers.SGD(learning_rate=<span class="hljs-number">0.01</span>, momentum=<span class="hljs-number">0.9</span>)
</code></pre>
<h3 id="heading-nag">NAG:</h3>
<pre><code class="lang-python">tf.keras.optimizers.SGD(learning_rate=<span class="hljs-number">0.01</span>, momentum=<span class="hljs-number">0.9</span>, nesterov=<span class="hljs-literal">True</span>)
</code></pre>


Nesterov Accelerated Gradient (NAG) is an optimization algorithm used in deep learning. It is an extension of SGD with momentum. NAG calculates the gradient not only at the current position but also a step ahead in the direction of the momentum. This allows it to anticipate the momentum's effect and make more accurate updates. NAG effectively "looks ahead" before updating the parameters, which improves convergence speed and helps avoid overshooting the optimal solution. By incorporating this lookahead approach, NAG enhances the momentum-based optimization and leads to faster convergence and better performance of neural networks.

## Formula:

$$v_{t} = \beta \cdot v_{t-1} + (1 - \beta) \cdot \nabla J(\theta - \beta \cdot v_{t-1}) $$

 $$\theta = \theta - \alpha \cdot v_{t}$$

## Keras code Example:

### SGD:

```python
tf.keras.optimizers.SGD(learning_rate=0.01)
```

### SGD with momentum:

```python
tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
```

### NAG:

```python
tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
```