10. Deep Learning

Deep learning is a subset of machine learning (which itself is part of artificial intelligence) that focuses on training artificial neural networks with multiple layers to learn and make predictions from complex data. These networks are inspired by the human brain's structure, where "neurons" process information and pass it along.

Unlike traditional machine learning algorithms (e.g., linear regression or decision trees), which often require manual feature engineering (hand-picking important data characteristics), deep learning models automatically extract features from raw data through layers of processing. This makes them powerful for tasks like image recognition, natural language processing, speech synthesis, and more.

Deep learning excels with large datasets and high computational power (e.g., GPUs), but it can be "black-box" in nature—meaning it's sometimes hard to interpret why a model makes a specific decision.

The core building block is the artificial neural network (ANN), which consists of interconnected nodes (neurons) organized into layers. Data flows from the input layer, through hidden layers (where the "deep" part comes in, with many layers stacked), to the output layer. Training involves adjusting weights (connections between neurons) using algorithms like backpropagation to minimize errors.

Key Components

A typical neural network has three main parts:

Input Layer: The entry point where raw data (e.g., pixel values from an image) is fed into the network. It doesn't perform computations; it just passes data forward.
Hidden Layers: The "depth" of deep learning. These are where the magic happens—multiple stacked layers that transform the data through mathematical operations. Each layer learns increasingly abstract representations (e.g., from edges in an image to full objects).
Output Layer: The final layer that produces the prediction or classification (e.g., "cat" or "dog" in an image classifier).

Different Types of Layers

Deep learning models use various specialized layers depending on the task and architecture. Here's an overview of common layer types, grouped by their typical use. The following table summarizes their characteristics:

Layer Type	Description	Common Use Cases	How It Works
Dense (Fully Connected)	Every neuron in this layer is connected to every neuron in the previous layer. It's the most basic type.	General-purpose networks, like simple classifiers or regressors. Often used in the final stages of more complex models.	Applies a linear transformation (weights * inputs + bias) followed by an activation function (e.g., ReLU) to introduce non-linearity.
Convolutional	Uses filters (kernels) to scan input data, detecting local patterns like edges or textures. Key to "convolutional neural networks" (CNNs).	Image and video processing, computer vision (e.g., object detection in photos).	Slides filters over the input, computing dot products to create feature maps. Reduces spatial dimensions while preserving important features.
Pooling	Downsamples the output from convolutional layers, reducing computational load and preventing overfitting. Types include max pooling (takes the maximum value) and average pooling.	Follows convolutional layers in CNNs to summarize features.	Aggregates values in small regions (e.g., 2x2 grid) into a single value, making the model more robust to variations like translations.
Recurrent (e.g., RNN, LSTM, GRU)	Handles sequential data by maintaining a "memory" of previous inputs via loops. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are advanced variants that address vanishing gradient issues.	Time-series forecasting, natural language processing (e.g., machine translation), speech recognition.	Processes inputs one step at a time, using hidden states to carry information forward. Good for sequences but can struggle with long dependencies.
Embedding	Converts categorical data (e.g., words) into dense vectors of fixed size, capturing semantic relationships.	NLP tasks like word embeddings (e.g., Word2Vec). Often the first layer in text-based models.	Maps high-dimensional sparse data (e.g., one-hot encoded words) to lower-dimensional continuous space.
Attention (used in Transformers)	Allows the model to focus on relevant parts of the input dynamically, weighing their importance. Self-attention computes relationships between all elements.	Modern NLP (e.g., GPT models), machine translation, and even vision tasks.	Uses queries, keys, and values to compute attention scores, enabling parallel processing of sequences (unlike RNNs).
Normalization (e.g., Batch Normalization, Layer Normalization)	Stabilizes training by normalizing activations within a layer, reducing internal covariate shift.	Almost all deep networks to speed up training and improve performance.	Adjusts and scales activations (e.g., mean to 0, variance to 1) across mini-batches or individual layers.
Dropout	Randomly "drops out" (ignores) a fraction of neurons during training to prevent overfitting.	Regularization in any network, especially dense or convolutional ones.	Temporarily removes connections, forcing the network to learn redundant representations. Inactive during inference.
Flatten	Converts multi-dimensional data (e.g., from convolutional layers) into a 1D vector for dense layers.	Transitioning from feature extraction (CNN) to classification.	Reshapes tensors without changing values, e.g., turning a 2D feature map into a flat array.
Activation	Applies a non-linear function to the output of other layers (though often built into them). Common ones: ReLU (Rectified Linear Unit), Sigmoid, Tanh, Softmax.	Everywhere, to add non-linearity and control output ranges (e.g., Softmax for probabilities).	Transforms linear outputs; e.g., ReLU sets negative values to 0 for faster training.

Common Deep Learning Architectures

These layers are combined into architectures tailored to specific problems:

Feedforward Neural Networks (FNN): Basic stack of dense layers for simple tasks.
Convolutional Neural Networks (CNN): Convolutional + pooling layers for spatial data like images (e.g., ResNet, VGG).
Recurrent Neural Networks (RNN): Recurrent layers for sequences (e.g., LSTM for text generation).
Transformers: Attention layers for handling long-range dependencies (e.g., BERT for NLP, Vision Transformers for images).
Autoencoders: Encoder (convolutional/dense) + decoder layers for unsupervised learning like denoising.
Generative Adversarial Networks (GANs): Combines generator and discriminator networks (often convolutional) for generating realistic data.

Forward and Backward Pass for Each Layer

The forward pass computes the output of each layer given the input, while the backward pass computes gradients for learning.

Backpropagation computes the gradient of the loss with respect to the layer's inputs and parameters (e.g., weights, biases) to update them via optimizers like gradient descent. Assume a scalar loss \( L \), and upstream gradient \( \displaystyle \frac{\partial L}{\partial y} \) (where \( y \) is the layer's output) is provided from the next layer.

A. Dense (Fully Connected)

Every neuron in this layer is connected to every neuron in the previous layer. It's the most basic type. General-purpose networks, like simple classifiers or regressors. Often used in the final stages of more complex models.

A sample of a small fully-connected layer with four input and eight output neurons. Source: Linear/Fully-Connected Layers User's Guide

Parameters:

x: Input vector.

W: Weight matrix.

b: Bias vector.
\( x = [2, 3] \)

\( W = \begin{bmatrix} 1 & 2 \\ 0 & -1 \end{bmatrix} \)

\( b = [1, -1] \)

Forward Pass:

\( y = Wx + b \),

then apply activation (e.g., ReLU: \( y = \max(0, y) \)).
\( y = [9, -4] \),

ReLU: [9, 0].

Backward Pass:

1. Gradient w.r.t. input:
  
  \( \displaystyle \frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial y'} \)
  
  where \( y' \) is post-activation,
  
  and \( \displaystyle \frac{\partial L}{\partial y'} \) is adjusted for activation,
  
  e.g., for ReLU: 1 if \( y > 0 \), else 0).
2. Gradient w.r.t. weights:
  
  \( \displaystyle \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y'} \cdot x^T \).
3. Gradient w.r.t. bias:
  
  \( \displaystyle \frac{\partial L}{\partial b} = \sum \frac{\partial L}{\partial y'} \).
Assume loss gradient \( \frac{\partial L}{\partial y'} = [0.5, -0.2] \) (post-ReLU). For ReLU: mask = [1, 0], so \( \frac{\partial L}{\partial y} = [0.5, 0] \).
1. \( \begin{align*} \displaystyle \frac{\partial L}{\partial x} &= W^T \cdot [0.5, 0] \\ & = \begin{bmatrix} 1 & 0 \\ 2 & -1 \end{bmatrix} \begin{bmatrix} 0.5 \\ 0 \end{bmatrix} \\ & = [0.5, 1.0] \end{align*} \).
2. \( \begin{align*} \displaystyle \frac{\partial L}{\partial W} &= [0.5, 0]^T \cdot [2, 3] \\ & = \begin{bmatrix} 1 & 1.5 \\ 0 & 0 \end{bmatrix} \end{align*} \).
3. \( \displaystyle \frac{\partial L}{\partial b} = [0.5, 0] \).

Implementation:

Editor (session: default) Run

import numpy as np

def dense_forward(x, W, b):
    y_linear = np.dot(W, x) + b
    y = np.maximum(0, y_linear)  # ReLU
    return y, y_linear  # Cache linear for backprop

def dense_backward(dy_post_act, x, W, y_linear):
    # dy_post_act: ∂L/∂y (post-ReLU)
    dy_linear = dy_post_act * (y_linear > 0)  # ReLU derivative
    dx = np.dot(W.T, dy_linear)
    dW = np.outer(dy_linear, x)
    db = dy_linear
    return dx, dW, db

# Example
x = np.array([2, 3])
W = np.array([[1, 2], [0, -1]])
b = np.array([1, -1])
y, y_linear = dense_forward(x, W, b)
dy_post_act = np.array([0.5, -0.2])
dx, dW, db = dense_backward(dy_post_act, x, W, y_linear)
print("Forward y:", y)  # [9, 0]
print("dx:", dx)  # [0.5, 1.0]
print("dW:", dW)  # [[1, 1.5], [0, 0]]
print("db:", db)  # [0.5, 0]

Output Clear

B. Convolutional

Uses filters (kernels) to scan input data, detecting local patterns like edges or textures. Key to "convolutional neural networks" (CNNs). Image and video processing, computer vision (e.g., object detection in photos). Slides filters over the input, computing dot products to create feature maps. Reduces spatial dimensions while preserving important features.

Convolution of an image with an edge detector convolution kernel. Sources: Deep Learning in a Nutshell: Core Concepts

Calculating convolution by sliding image patches over the entire image. One image patch (yellow) of the original image (green) is multiplied by the kernel (red numbers in the yellow patch), and its sum is written to one feature map pixel (red cell in convolved feature). Image source: Deep Learning in a Nutshell: Core Concepts

Parameters:

X: Input matrix (e.g., image).

K: Convolution kernel (filter).

b: Bias term.
(2D, stride=1, no padding):

\( X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} \)

\( K = \begin{bmatrix} 1 & 0 \\ -1 & 1 \end{bmatrix} \)

\( b=1 \)

Forward Pass:

Convolution:

\( \displaystyle Y[i,j] = \sum_{m,n} X[i+m, j+n] \cdot K[m,n] + b \).
Convolution:

\( \begin{bmatrix} \begin{array}{ll} =& 1 \times 1 + 2 \times 0 \\ &+ 4 \times (-1) + 5 \times 1 \\ &+ 1 \end{array} & \begin{array}{ll} =& 2 \times 1 + 3 \times 0 \\ &+ 5 \times (-1) + 6 \times 1 \\ &+ 1 \end{array} \\ \begin{array}{ll} =& 4 \times 1 + 5 \times 0 \\ &+ 7 \times (-1) + 8 \times 1 \\ &+ 1 \end{array} & \begin{array}{ll} =& 5 \times 1 + 6 \times 0 \\ &+ 8 \times (-1) + 9 \times 1 \\ &+ 1 \end{array} \end{bmatrix} \)

\( Y = \begin{bmatrix} 3 & 3 \\ -1 & -1 \end{bmatrix} \)

Backward Pass:

1. Gradient w.r.t. input:
  
  Convolve upstream gradient \( \displaystyle \frac{\partial L}{\partial Y} \) with rotated kernel (full convolution).
2. Gradient w.r.t. kernel:
  
  Convolve input \( X \) with \( \displaystyle \frac{\partial L}{\partial Y} \).
3. Gradient w.r.t. bias:
  
  Sum of \( \displaystyle \frac{\partial L}{\partial Y} \).
\( \displaystyle \frac{\partial L}{\partial Y} = \begin{bmatrix} 0.5 & -0.5 \\ 1 & 0 \end{bmatrix} \).
1. \( \displaystyle \frac{\partial L}{\partial X} \):
  
  Full conv with rotated K (\( \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} \)) and padded \( dY \), approx. \( \begin{bmatrix} 0.5 & -0.5 & -0.5 \\ 0 & 1.5 & 0 \\ 1 & 0 & 0 \end{bmatrix} \) (simplified calc).
2. \( \displaystyle \frac{\partial L}{\partial K} = \) Conv X with dY:
  
  \( \begin{bmatrix} 0.5*1 + (-0.5)*2 + 1*4 + 0*5 \\ \ldots \end{bmatrix} \)
  
  (detailed in code).
3. \( \displaystyle \frac{\partial L}{\partial b} = 0.5 -0.5 +1 +0 = 1 \).

Implementation:

Editor (session: default) Run

import numpy as np
from scipy.signal import correlate2d, convolve2d

def conv_forward(X, K, b):
    Y = correlate2d(X, K, mode='valid') + b  # SciPy correlate for conv
    return Y, X  # Cache X

def conv_backward(dY, X, K):
    # Rotate kernel 180 degrees for full conv
    K_rot = np.rot90(K, 2)
    # Pad dY to match X shape for dx
    pad_h, pad_w = K.shape[0]-1, K.shape[1]-1
    dY_padded = np.pad(dY, ((pad_h//2, pad_h-pad_h//2), (pad_w//2, pad_w-pad_w//2)))
    dX = convolve2d(dY_padded, K_rot, mode='valid')
    dK = correlate2d(X, dY, mode='valid')
    db = np.sum(dY)
    return dX, dK, db

# Example
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
K = np.array([[1,0],[-1,1]])
b = 1
Y, _ = conv_forward(X, K, b)
dY = np.array([[0.5, -0.5],[1, 0]])
dX, dK, db = conv_backward(dY, X, K)
print("Forward Y:\n", Y)
print("dX:\n", dX)
print("dK:\n", dK)
print("db:", db)

Output Clear

C. Pooling (Max Pooling)

Downsamples the output from convolutional layers, reducing computational load and preventing overfitting. Types include max pooling (takes the maximum value) and average pooling. Follows convolutional layers in CNNs to summarize features. Aggregates values in small regions (e.g., 2x2 grid) into a single value, making the model more robust to variations like translations.

Forward Pass:

\( Y[i,j] = \max(X[i:i+k, j:j+k]) \) for pool size \( k \).
\( X = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \\ 13 & 14 & 15 & 16 \end{bmatrix} \), pool=2, stride=2,

\( Y = \begin{bmatrix} 6 & 8 \\ 14 & 16 \end{bmatrix} \).
- Max positions: e.g., 6 from X[1,1]=6, etc.

Backward Pass:

Distribute upstream gradient \( \displaystyle \frac{\partial L}{\partial Y} \) to the max position in each window; 0 elsewhere.
\( \displaystyle \frac{\partial L}{\partial Y} = \begin{bmatrix} 0.5 & -0.5 \\ 1 & 0 \end{bmatrix} \).
- \( \displaystyle \frac{\partial L}{\partial X} \):
  - 0.5 to pos of 6 (1,1),
  - -0.5 to pos of 8 (1,3),
  - 1 to pos of 14 (3,1),
  - 0 to pos of 16 (3,3).
  - Other positions 0.

Implementation:

Editor (session: default) Run

import numpy as np

def max_pool_forward(X, pool_size=2, stride=2):
    H, W = X.shape
    out_H, out_W = H // stride, W // stride
    Y = np.zeros((out_H, out_W))
    max_idx = np.zeros_like(X, dtype=bool)  # For backprop
    for i in range(0, H, stride):
        for j in range(0, W, stride):
            slice = X[i:i+pool_size, j:j+pool_size]
            max_val = np.max(slice)
            Y[i//stride, j//stride] = max_val
            max_idx[i:i+pool_size, j:j+pool_size] = (slice == max_val)
    return Y, max_idx

def max_pool_backward(dY, max_idx, pool_size=2, stride=2):
    dX = np.zeros_like(max_idx, dtype=float)
    for i in range(dY.shape[0]):
        for j in range(dY.shape[1]):
            dX[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size] = dY[i,j] * max_idx[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size]
    return dX

# Example
X = np.arange(1,17).reshape(4,4)
Y, max_idx = max_pool_forward(X)
dY = np.array([[0.5, -0.5],[1, 0]])
dX = max_pool_backward(dY, max_idx)
print("Forward Y:\n", Y)
print("dX:\n", dX)

Output Clear

D. Recurrent (LSTM)

Recurrent Neural Networks (RNNs) are powerful for sequence data. Long Short-Term Memory (LSTM) networks are a type of RNN designed to capture long-term dependencies and mitigate issues like vanishing gradients.

Parameters:

(Simplified to hidden size=1 for clarity):
- Inputs:
  
  \( x_t = [0.5] \),
  
  \( h_{t-1} = [0.1] \),
  
  \( C_{t-1} = [0.2] \)
- Weights:
  
  \( W_f = [[0.5, 0.5]] \),
  
  \( W_i = [[0.4, 0.4]] \),
  
  \( W_C = [[0.3, 0.3]] \),
  
  \( W_o = [[0.2, 0.2]] \)
- Biases: \( b_f = b_i = b_C = b_o = [0.0] \)

Forward Pass:

- Concatenate: \( \text{concat} = [h_{t-1}, x_t] \)
- Forget gate: \( f_t = \sigma(W_f \cdot \text{concat} + b_f) \)
- Input gate: \( i_t = \sigma(W_i \cdot \text{concat} + b_i) \)
- Cell candidate: \( \tilde{C}_t = \tanh(W_C \cdot \text{concat} + b_C) \)
- Cell state: \( C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \)
- Output gate: \( o_t = \sigma(W_o \cdot \text{concat} + b_o) \)
- Hidden state: \( h_t = o_t \cdot \tanh(C_t) \)
- concat = [0.1, 0.5]
- \( f_t = \sigma(0.3) \approx 0.5744 \)
- \( i_t = \sigma(0.24) \approx 0.5597 \)
- \( \tilde{C}_t = \tanh(0.18) \approx 0.1785 \)
- \( C_t \approx 0.5744 \cdot 0.2 + 0.5597 \cdot 0.1785 \approx 0.2146 \)
- \( o_t = \sigma(0.12) \approx 0.5300 \)
- \( h_t \approx 0.5300 \cdot \tanh(0.2146) \approx 0.1120 \)

Backward Pass:

Gradients are computed via chain rule:
- \( dC_t = dh_t \cdot o_t \cdot (1 - \tanh^2(C_t)) + dC_{next} \) (dC_next from future timestep)
- \( do_t = dh_t \cdot \tanh(C_t) \cdot \sigma'(o_t) \)
- \( d\tilde{C}_t = dC_t \cdot i_t \cdot (1 - \tilde{C}_t^2) \)
- \( di_t = dC_t \cdot \tilde{C}_t \cdot \sigma'(i_t) \)
- \( df_t = dC_t \cdot C_{t-1} \cdot \sigma'(f_t) \)
- \( dC_{prev} = dC_t \cdot f_t \)
- Then, backpropagate to concat: \( d\text{concat} = W_o^T \cdot do_t + W_C^T \cdot d\tilde{C}_t + W_i^T \cdot di_t + W_f^T \cdot df_t \)
- Split \( d\text{concat} \) into \( dh_{prev} \) and \( dx_t \)
- Parameter gradients: \( dW_f = df_t \cdot \text{concat}^T \), \( db_f = df_t \), and similarly for others.
(Assume upstream: \( dh_t = [0.1] \), \( dC_t = [0.05] \) from next timestep):
- \( dC_t \approx 0.1 \cdot 0.5300 \cdot (1 - \tanh^2(0.2146)) + 0.05 \approx 0.1028 + 0.05 = 0.1528 \) (detailed steps in code output)
- Resulting gradients match the executed values below (e.g., \( dx_t \approx [0.0216] \), etc.).

Editor (session: default) Run

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def dsigmoid(y):
    return y * (1 - y)

def tanh(x):
    return np.tanh(x)

def dtanh(y):
    return 1 - y**2

# Forward pass
def lstm_forward(x_t, h_prev, C_prev, W_f, W_i, W_C, W_o, b_f, b_i, b_C, b_o):
    concat = np.concatenate((h_prev, x_t), axis=0)
    f_t = sigmoid(np.dot(W_f, concat) + b_f)
    i_t = sigmoid(np.dot(W_i, concat) + b_i)
    C_tilde = tanh(np.dot(W_C, concat) + b_C)
    C_t = f_t * C_prev + i_t * C_tilde
    o_t = sigmoid(np.dot(W_o, concat) + b_o)
    h_t = o_t * tanh(C_t)
    cache = (concat, f_t, i_t, C_tilde, o_t, C_t)
    return h_t, C_t, cache

# Backward pass
def lstm_backward(dh_next, dC_next, cache, W_f, W_i, W_C, W_o):
    concat, f_t, i_t, C_tilde, o_t, C_t = cache

    # Derivatives
    dC_t = dh_next * o_t * dtanh(tanh(C_t)) + dC_next
    do_t = dh_next * tanh(C_t) * dsigmoid(o_t)
    dC_tilde = dC_t * i_t * dtanh(C_tilde)
    di_t = dC_t * C_tilde * dsigmoid(i_t)
    df_t = dC_t * C_prev * dsigmoid(f_t)
    dC_prev = dC_t * f_t

    # Gradients for gates
    dconcat_o = np.dot(W_o.T, do_t)
    dconcat_C = np.dot(W_C.T, dC_tilde)
    dconcat_i = np.dot(W_i.T, di_t)
    dconcat_f = np.dot(W_f.T, df_t)
    dconcat = dconcat_f + dconcat_i + dconcat_C + dconcat_o

    # Split for h_prev and x_t
    hidden_size = h_prev.shape[0]
    dh_prev = dconcat[:hidden_size]
    dx_t = dconcat[hidden_size:]

    # Parameter gradients
    dW_f = np.outer(df_t, concat)
    db_f = df_t
    dW_i = np.outer(di_t, concat)
    db_i = di_t
    dW_C = np.outer(dC_tilde, concat)
    db_C = dC_tilde
    dW_o = np.outer(do_t, concat)
    db_o = do_t

    return dx_t, dh_prev, dC_prev, dW_f, db_f, dW_i, db_i, dW_C, db_C, dW_o, db_o

# Numerical example (hidden size = 1)
x_t = np.array([0.5])
h_prev = np.array([0.1])
C_prev = np.array([0.2])
W_f = np.array([[0.5, 0.5]])
W_i = np.array([[0.4, 0.4]])
W_C = np.array([[0.3, 0.3]])
W_o = np.array([[0.2, 0.2]])
b_f = np.array([0.0])
b_i = np.array([0.0])
b_C = np.array([0.0])
b_o = np.array([0.0])

# Forward
h_t, C_t, cache = lstm_forward(x_t, h_prev, C_prev, W_f, W_i, W_C, W_o, b_f, b_i, b_C, b_o)
print("Forward h_t:", h_t)  # Output: [0.11199714]
print("Forward C_t:", C_t)  # Output: [0.2145628]

# Backward example: assume dh_next = [0.1], dC_next = [0.05]
dh_next = np.array([0.1])
dC_next = np.array([0.05])
dx_t, dh_prev, dC_prev, dW_f, db_f, dW_i, db_i, dW_C, db_C, dW_o, db_o = lstm_backward(dh_next, dC_next, cache, W_f, W_i, W_C, W_o)

print("Backward dx_t:", dx_t)  # Output: [0.02164056]
print("Backward dh_prev:", dh_prev)  # Output: [0.02164056]
print("Backward dC_prev:", dC_prev)  # Output: [0.05780591]
print("Backward dW_f:", dW_f)  # Output: [[0.00049199 0.00245997]]
print("Backward db_f:", db_f)  # Output: [0.00491995]
print("Backward dW_i:", dW_i)  # Output: [[0.00044162 0.00220808]]
print("Backward db_i:", db_i)  # Output: [0.00441615]
print("Backward dW_C:", dW_C)  # Output: [[0.00545376 0.02726878]]
print("Backward db_C:", db_C)  # Output: [0.05453756]
print("Backward dW_o:", dW_o)  # Output: [[0.00052643 0.00263213]]
print("Backward db_o:", db_o)  # Output: [0.00526427]

Output Clear

Notes

This is a single-timestep LSTM with hidden size 1 for simplicity. In practice, LSTMs process sequences (multiple timesteps) and have larger hidden sizes; backpropagation through time (BPTT) unrolls the network over timesteps.
The code uses NumPy; for real models, use PyTorch or TensorFlow for automatic differentiation and batching.
Outputs are approximate due to floating-point precision but match the manual calculations.
If you need a multi-timestep example, sequence processing, or integration into a full RNN, let me know!

E. Embedding

Forward Pass:

\( y = E[i] \),

where \( E \) is the embedding matrix,

\( i \) is the input index.
Index 1,

\( E = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \end{bmatrix}\),

\( y = [0.4,0.5,0.6] \).

Backward Pass:

Gradient \( \displaystyle \frac{\partial L}{\partial E[i]} += \frac{\partial L}{\partial y} \); other rows 0. (Sparse update).
\( \displaystyle \frac{\partial L}{\partial y} = [0.1, -0.1, 0.2] \),

so add to E[1].

Implementation:

Editor (session: default) Run

import numpy as np

def embedding_forward(index, E):
    return E[index]

def embedding_backward(dy, index, E_shape):
    dE = np.zeros(E_shape)
    dE[index] = dy
    return dE

# Example
E = np.array([[0.1,0.2,0.3],[0.4,0.5,0.6]])
index = 1
y = embedding_forward(index, E)
dy = np.array([0.1, -0.1, 0.2])
dE = embedding_backward(dy, index, E.shape)
print("dE:\n", dE)

Output Clear

F. Attention (Scaled Dot-Product)

Forward Pass: \( \text{Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V \).

Backward Pass: Gradients for Q, K, V via chain rule on softmax and matmuls.

Numerical Example: Use previous. Backward is matrix derivs; code handles.

Implementation:

Editor (session: default) Run

import numpy as np

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def attention_forward(Q, K, V):
    d = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d)
    weights = softmax(scores)
    attn = np.dot(weights, V)
    return attn, (scores, weights, K, V)  # Cache

def attention_backward(dattn, cache):
    scores, weights, K, V = cache
    dweights = np.dot(dattn, V.T)
    dscores = weights * (dweights - np.sum(weights * dweights, axis=-1, keepdims=True))
    dQ = np.dot(dscores, K) / np.sqrt(K.shape[-1])
    dK = np.dot(dscores.T, Q) / np.sqrt(K.shape[-1])
    dV = np.dot(weights.T, dattn)
    return dQ, dK, dV

# Example
Q = K = V = np.array([[1.,0.],[0.,1.]])
attn, cache = attention_forward(Q, K, V)
dattn = np.array([[0.1,0.2],[ -0.1,0.3]])
dQ, dK, dV = attention_backward(dattn, cache)
print("dQ:\n", dQ)

Output Clear

G. Normalization (Batch Normalization)

Forward Pass: Normalize, scale, shift.

Backward Pass: Gradients for input, gamma, beta via chain rule on mean/var.

Numerical Example: Previous forward. Backward computes dx, dgamma, dbeta.

Implementation:

Editor (session: default) Run

import numpy as np

def batch_norm_forward(x, gamma, beta, epsilon=1e-4):
    mu = np.mean(x)
    var = np.var(x)
    x_hat = (x - mu) / np.sqrt(var + epsilon)
    y = gamma * x_hat + beta
    return y, (x_hat, mu, var)

def batch_norm_backward(dy, cache, gamma):
    x_hat, mu, var = cache
    N = dy.shape[0]
    dx_hat = dy * gamma
    dvar = np.sum(dx_hat * (x - mu) * -0.5 * (var + epsilon)**(-1.5), axis=0)
    dmu = np.sum(dx_hat * -1 / np.sqrt(var + epsilon), axis=0) + dvar * np.mean(-2 * (x - mu), axis=0)
    dx = dx_hat / np.sqrt(var + epsilon) + dvar * 2 * (x - mu) / N + dmu / N
    dgamma = np.sum(dy * x_hat, axis=0)
    dbeta = np.sum(dy, axis=0)
    return dx, dgamma, dbeta

# Example
x = np.array([1,2,3,4.])
gamma, beta = 1, 0
y, cache = batch_norm_forward(x, gamma, beta)
dy = np.array([0.1,0.2,-0.1,0.3])
dx, dgamma, dbeta = batch_norm_backward(dy, cache, gamma)
print("dx:", dx)
print("dgamma:", dgamma)
print("dbeta:", dbeta)

Output Clear

H. Dropout

Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped. Source: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Forward Pass: Mask during training.

Backward Pass: Same mask applied to upstream gradient (scale by 1/(1-p)).

Numerical Example: Same as forward; backward passes dy through mask.

Implementation:

Editor (session: default) Run

import numpy as np

def dropout_forward(x, p, training=True):
    if training:
        mask = np.random.binomial(1, 1-p, size=x.shape) / (1-p)
        y = x * mask
        return y, mask
    return x, None

def dropout_backward(dy, mask):
    if mask is None:
        return dy
    return dy * mask

# Example
np.random.seed(0)
x = np.array([1,2,3,4.])
p = 0.5
y, mask = dropout_forward(x, p)
dy = np.array([0.1,0.2,0.3,0.4])
dx = dropout_backward(dy, mask)
print("dx:", dx)

Output Clear

I. Flatten

Forward Pass: Reshape to 1D.

Backward Pass: Reshape upstream gradient back to original shape.

Numerical Example: - Forward: 2x2 to [1,2,3,4]. - Backward: dy = [0.1,0.2,0.3,0.4] -> reshape to 2x2.

Implementation:

Editor (session: default) Run

import numpy as np

def flatten_forward(x):
    return x.flatten(), x.shape

def flatten_backward(dy, orig_shape):
    return dy.reshape(orig_shape)

# Example
x = np.array([[1,2],[3,4]])
y, shape = flatten_forward(x)
dy = np.array([0.1,0.2,0.3,0.4])
dx = flatten_backward(dy, shape)
print("dx:\n", dx)

Output Clear

J. Activation (ReLU)

Forward Pass:

\( y = \max(0, x) \).
\( x = [-1, 0, 2, -3] \),

\( y = [0, 0, 2, 0] \).

Backward Pass:

\( \displaystyle \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot (x > 0) \).
\( dy = [0.5,-0.5,1,0] \),

\( dx = [0,0,1,0] \) (masked).

Implementation:

Editor (session: default) Run

import numpy as np

def relu_forward(x):
    return np.maximum(0, x), x

def relu_backward(dy, x_cache):
    return dy * (x_cache > 0)

# Example
x = np.array([-1,0,2,-3])
y, x_cache = relu_forward(x)
dy = np.array([0.5,-0.5,1,0])
dx = relu_backward(dy, x_cache)
print("dx:", dx)

Output Clear

Mohd Halim Mohd Noor, Ayokunle Olalekan Ige: A Survey on State-of-the-art Deep Learning Applications and Challenges, 2025. ↩
Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola: Dive into Deep Learning, 2020. ↩
Ian Goodfellow, Yoshua Bengio, Aaron Courville: Deep Learning, 2016. ↩
Johannes Schneider, Michalis Vlachos: A Survey of Deep Learning: From Activations to Transformers, 2024. ↩
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. ↩
Adit Deshpande is a comprehensive resource for learning about deep learning, covering various topics such as neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and more. It provides detailed explanations, code examples, and practical applications of deep learning concepts, making it suitable for both beginners and advanced learners. ↩
François Fleuret offers a collection of deep learning resources, including lectures, tutorials, and research papers, aimed at helping learners understand and apply deep learning techniques effectively. ↩