Assignment2 part1

前言

​ 本文为 2024 年斯坦福 cs231n 课程作业二 的参考答案,代码框架可在官网下载,作业二包含 5 个问题:

  1. Multi-Layer Fully Connected Neural Networks
  2. Batch Normalization
  3. Dropout
  4. Convolutional Neural Networks
  5. PyTorch on CIFAR-10

注:本文给出前三个问题的答案,在 cs231n assignment2 part2 中给出最后两个问题的答案。

Q1: Multi-Layer Fully Connected Neural Networks

​ 在本练习中,我们将实现一个具有任意隐藏层数的全连接网络。阅读文件 cs231n/classifiers/fc_net.py 中的 FullyConnectedNet 类。实现网络初始化前向传播和反向传播

​ 在整个作业过程中,我们将在 cs231n/layers.py 中实现层,可以重用作业 1 中的 affine_forwardaffine_backwardrelu_forwardrelu_backwardsoftmax_loss 的实现。现在先不用担心实现 dropout 或批量/层归一化,因为这些功能将在后面添加。

​ 接下来先给出 affine layer、ReLU activation、softmax 的相关代码,在作业一中有具体的分析,就不在这里赘述了,然后完成网络初始化、前向和后向传播。

affine layer

affine_forward 的代码如下:

def affine_forward(x, w, b):
    """Computes the forward pass for an affine (fully connected) layer.

    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.

    Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    out = None
    ###########################################################################
    # TODO: Copy over your solution from Assignment 1.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    out = np.dot(np.reshape(x, (x.shape[0], -1)), w) + b

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, w, b)
    return out, cache

affine_backward 的代码如下:


def affine_backward(dout, cache):
    """Computes the backward pass for an affine (fully connected) layer.

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """
    x, w, b = cache
    dx, dw, db = None, None, None
    ###########################################################################
    # TODO: Copy over your solution from Assignment 1.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    dx = np.reshape(np.dot(dout, w.T), x.shape)
    dw = np.dot(np.reshape(x, (x.shape[0], -1)).T, dout)
    db = np.sum(dout, axis=0)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dw, db

ReLU activation

relu_forward 的代码如下:

def relu_forward(x):
    """Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    out = None
    ###########################################################################
    # TODO: Copy over your solution from Assignment 1.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    out = np.maximum(0, x)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = x
    return out, cache

relu_backward 的代码如下:

def relu_backward(dout, cache):
    """Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = None, cache
    ###########################################################################
    # TODO: Copy over your solution from Assignment 1.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    dx = dout * (x > 0)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx

softmax

softmax_loss 的代码如下:

def softmax_loss(x, y):
    """Computes the loss and gradient for softmax classification.

    Inputs:
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dx: Gradient of the loss with respect to x
    """
    loss, dx = None, None

    ###########################################################################
    # TODO: Copy over your solution from Assignment 1.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    N, C = x.shape
    x = x - np.max(x, axis=1, keepdims=True)
    p = np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
    loss = np.sum(-np.log(p[range(N), y])) / N

    p[range(N), y] -= 1
    dx = p / N

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return loss, dx

network initialization

​ 网络初始化的工作主要包括为参数的初始化,FullyConnectedNet 网络的结构为:

{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

​ affine layer具有权重参数 W,偏置参数 b,batch/layer normalization 具有参数 gamma、beta,参数的命名规范为:参数名+所在层数,比如第一层的权重参数命名为:W1。根据题目要求,我们使用 np.random.randn() * weight_scale 进行参数初始化,使用 np.zeros() 进行偏置参数 b 的初始化。并且,在使用 batch normalization 或者 layer normalization 时,初始化参数 gamma 和 beta,根据要求,gamma 初始化为 1,beta 初始化为 0。

def __init__(
    self,
    hidden_dims,
    input_dim=3 * 32 * 32,
    num_classes=10,
    dropout_keep_ratio=1,
    normalization=None,
    reg=0.0,
    weight_scale=1e-2,
    dtype=np.float32,
    seed=None,
):
    """Initialize a new FullyConnectedNet.

    Inputs:
    - hidden_dims: A list of integers giving the size of each hidden layer.
    - input_dim: An integer giving the size of the input.
    - num_classes: An integer giving the number of classes to classify.
    - dropout_keep_ratio: Scalar between 0 and 1 giving dropout strength.
        If dropout_keep_ratio=1 then the network should not use dropout at all.
    - normalization: What type of normalization the network should use. Valid values
        are "batchnorm", "layernorm", or None for no normalization (the default).
    - reg: Scalar giving L2 regularization strength.
    - weight_scale: Scalar giving the standard deviation for random
        initialization of the weights.
    - dtype: A numpy datatype object; all computations will be performed using
        this datatype. float32 is faster but less accurate, so you should use
        float64 for numeric gradient checking.
    - seed: If not None, then pass this random seed to the dropout layers.
        This will make the dropout layers deteriminstic so we can gradient check the model.
    """
    self.normalization = normalization
    self.use_dropout = dropout_keep_ratio != 1
    self.reg = reg
    self.num_layers = 1 + len(hidden_dims)
    self.dtype = dtype
    self.params = {}

    ############################################################################
    # TODO: Initialize the parameters of the network, storing all values in    #
    # the self.params dictionary. Store weights and biases for the first layer #
    # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
    # initialized from a normal distribution centered at 0 with standard       #
    # deviation equal to weight_scale. Biases should be initialized to zero.   #
    #                                                                          #
    # When using batch normalization, store scale and shift parameters for the #
    # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
    # beta2, etc. Scale parameters should be initialized to ones and shift     #
    # parameters should be initialized to zeros.                               #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # 所有层的维数
    layer_dims = [input_dim, *hidden_dims, num_classes]     
    # 初始化权重Weights和偏置biases   
    for i in range(self.num_layers):
		self.params['W' + str(i + 1)] = np.random.randn(layer_dims[i], layer_dims[i + 1]) * weight_scale
        self.params['b' + str(i + 1)] = np.zeros(layer_dims[i + 1])
    if self.normalization in ['batchnorm', 'layernorm'] and i < self.num_layers - 1:
        self.params['gamma' + str(i + 1)] = np.ones(layer_dims[i + 1])
        self.params['beta' + str(i + 1)] = np.zeros(layer_dims[i + 1])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # When using dropout we need to pass a dropout_param dictionary to each
    # dropout layer so that the layer knows the dropout probability and the mode
    # (train / test). You can pass the same dropout_param to each dropout layer.
    self.dropout_param = {}
    if self.use_dropout:
        self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
        if seed is not None:
            self.dropout_param["seed"] = seed

    # With batch normalization we need to keep track of running means and
    # variances, so we need to pass a special bn_param object to each batch
    # normalization layer. You should pass self.bn_params[0] to the forward pass
    # of the first batch normalization layer, self.bn_params[1] to the forward
    # pass of the second batch normalization layer, etc.
    self.bn_params = []
    if self.normalization == "batchnorm":
        self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
    if self.normalization == "layernorm":
        self.bn_params = [{} for i in range(self.num_layers - 1)]

    # Cast all parameters to the correct datatype.
    for k, v in self.params.items():
        self.params[k] = v.astype(dtype)

loss

​ 接下来完成前向传播和反向传播,注意我们这里没有实现 batch normalization/ layer normlization 和 dropout,因此不需要考虑这些层。

前向传播的实现思路:从第一层开始,直到最后一层,依次进行如下操作

  1. 获取对应层的参数W、b。
  2. 调用 affine_relu_forward(等价于 affine layer - ReLU activation 的前向传播)获取输出以及缓存参数,并存放于字典中,反向传播需要用到。

【注意】最后一层没有 ReLU 分线性层,因此只需要调用 affine_forward 即可。

​ 通过前向传播,我们获得了得分矩阵 scores

反向传播实现思路:首先调用 softmax_loss 获取 lossdscores,其次,我们需要添加 L2 正则化损失。从最后一层开始,直到第一层,依次进行如下操作

  1. 获取对应层的缓存对象。
  2. 调用 affine_relu_backward(等价于 affine layer - ReLU activation 的反向传播)获取各参数梯度,加上正则化损失的梯度,存放于字典 grads 中,键名与参数名相同。

【注意】最后一层没有 ReLU 分线性层,因此在反向传播中只需要调用 affine_backward 即可。

def loss(self, X, y=None):
    """Compute loss and gradient for the fully connected net.

    Inputs:
    - X: Array of input data of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

    Returns:
    If y is None, then run a test-time forward pass of the model and return:
    - scores: Array of shape (N, C) giving classification scores, where
        scores[i, c] is the classification score for X[i] and class c.

    If y is not None, then run a training-time forward and backward pass and
    return a tuple of:
    - loss: Scalar value giving the loss
    - grads: Dictionary with the same keys as self.params, mapping parameter
        names to gradients of the loss with respect to those parameters.
    """
    X = X.astype(self.dtype)
    mode = "test" if y is None else "train"

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.use_dropout:
        self.dropout_param["mode"] = mode
    if self.normalization == "batchnorm":
        for bn_param in self.bn_params:
            bn_param["mode"] = mode
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the fully connected net, computing  #
    # the class scores for X and storing them in the scores variable.          #
    #                                                                          #
    # When using dropout, you'll need to pass self.dropout_param to each       #
    # dropout forward pass.                                                    #
    #                                                                          #
    # When using batch normalization, you'll need to pass self.bn_params[0] to #
    # the forward pass for the first batch normalization layer, pass           #
    # self.bn_params[1] to the forward pass for the second batch normalization #
    # layer, etc.                                                              #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # 网络结构:
    # {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
    
    a = X
    # 保存每一层的 cache 对象,反向传播时需要用到
    caches = {}
    for i in range(self.num_layers - 1):
        W, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)]
        a, caches['layer' + str(i + 1)] = affine_relu_forward(a, W, b)

    # 最后一层的操作
    W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
    scores, caches['layer' + str(self.num_layers)] = affine_forward(a, W, b)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # If test mode return early.
    if mode == "test":
        return scores

    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the backward pass for the fully connected net. Store the #
    # loss in the loss variable and gradients in the grads dictionary. Compute #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    #                                                                          #
    # When using batch/layer normalization, you don't need to regularize the   #
    # scale and shift parameters.                                              #
    #                                                                          #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    loss, dscores = softmax_loss(scores, y)

    # 损失加上正则化项
    for i in range(self.num_layers):
        W = self.params['W' + str(i + 1)]
        loss += 0.5 * self.reg * np.sum(np.square(W))

    # 反向传播
    W = self.params['W' + str(self.num_layers)]
    # 最后一层
    fc_cache = caches['layer' + str(self.num_layers)]
    da, dW, db = affine_backward(dscores, fc_cache)
    grads['W' + str(self.num_layers)] = dW + self.reg * W
    grads['b' + str(self.num_layers)] = db
    # 从后往前
    for i in range(self.num_layers - 1, 0, -1):
        W = self.params['W' + str(i)]
        cache = caches['layer' + str(i)]
		da, dW, db = affine_relu_backward(da, cache)

        grads['W' + str(i)] = dW + self.reg * W
        grads['b' + str(i)] = db
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

​ 运行梯度测试:

​ 运行结果如下:

Running check with reg =  0
Initial loss:  2.300479089768492
W1 relative error: 1.0252674471656573e-07
W2 relative error: 2.2120479295080622e-05
W3 relative error: 4.5623278736665505e-07
b1 relative error: 4.6600944653202505e-09
b2 relative error: 2.085654276112763e-09
b3 relative error: 1.689724888469736e-10
Running check with reg =  3.14
Initial loss:  7.052114776533016
W1 relative error: 3.904541941902138e-09
W2 relative error: 6.86942277940646e-08
W3 relative error: 3.483989247437803e-08
b1 relative error: 1.4752427965311745e-08
b2 relative error: 1.4615869332918208e-09
b3 relative error: 1.3200479211447775e-10

overfit test

​ 下面使用三层全连接网络过拟合50个训练图片,通过调整超参数:learning_rateweight_scale。最终在 learning_rate 为 4e-3、weight_scale 为 3e-2 时,发生了过拟合。

​ 测试结果如下,当迭代次数为 20 次左右时,训练精度为100%,但是验证精度特别低,发生了过拟合现象。

(Iteration 1 / 40) loss: 2.727998
(Epoch 0 / 20) train acc: 0.380000; val_acc: 0.122000
(Epoch 1 / 20) train acc: 0.480000; val_acc: 0.101000
(Epoch 2 / 20) train acc: 0.600000; val_acc: 0.148000
(Epoch 3 / 20) train acc: 0.780000; val_acc: 0.149000
(Epoch 4 / 20) train acc: 0.900000; val_acc: 0.156000
(Epoch 5 / 20) train acc: 0.880000; val_acc: 0.160000
(Iteration 11 / 40) loss: 0.451161
(Epoch 6 / 20) train acc: 0.960000; val_acc: 0.156000
(Epoch 7 / 20) train acc: 0.940000; val_acc: 0.173000
(Epoch 8 / 20) train acc: 0.960000; val_acc: 0.158000
(Epoch 9 / 20) train acc: 0.980000; val_acc: 0.170000
(Epoch 10 / 20) train acc: 1.000000; val_acc: 0.153000
(Iteration 21 / 40) loss: 0.134665
(Epoch 11 / 20) train acc: 1.000000; val_acc: 0.161000
(Epoch 12 / 20) train acc: 1.000000; val_acc: 0.152000
(Epoch 13 / 20) train acc: 1.000000; val_acc: 0.161000
(Epoch 14 / 20) train acc: 1.000000; val_acc: 0.166000
(Epoch 15 / 20) train acc: 1.000000; val_acc: 0.171000
(Iteration 31 / 40) loss: 0.047912
(Epoch 16 / 20) train acc: 1.000000; val_acc: 0.171000
(Epoch 17 / 20) train acc: 1.000000; val_acc: 0.172000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.171000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.170000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.168000

​ 下面使用五层全连接网络过拟合50个训练图片,通过调整超参数:learning_rateweight_scale。最终在 learning_rate 为 1e-2、weight_scale 为 4e-2 时,发生了过拟合。

​ 测试结果如下,当迭代次数为 20 次左右时,训练精度为100%,但是验证精度特别低,发生了过拟合现象。

(Iteration 1 / 40) loss: 2.318479
(Epoch 0 / 20) train acc: 0.240000; val_acc: 0.077000
(Epoch 1 / 20) train acc: 0.220000; val_acc: 0.079000
(Epoch 2 / 20) train acc: 0.260000; val_acc: 0.082000
(Epoch 3 / 20) train acc: 0.420000; val_acc: 0.134000
(Epoch 4 / 20) train acc: 0.480000; val_acc: 0.137000
(Epoch 5 / 20) train acc: 0.500000; val_acc: 0.124000
(Iteration 11 / 40) loss: 1.320911
(Epoch 6 / 20) train acc: 0.620000; val_acc: 0.131000
(Epoch 7 / 20) train acc: 0.280000; val_acc: 0.103000
(Epoch 8 / 20) train acc: 0.380000; val_acc: 0.150000
(Epoch 9 / 20) train acc: 0.500000; val_acc: 0.149000
(Epoch 10 / 20) train acc: 0.760000; val_acc: 0.161000
(Iteration 21 / 40) loss: 1.042859
(Epoch 11 / 20) train acc: 0.840000; val_acc: 0.174000
(Epoch 12 / 20) train acc: 0.800000; val_acc: 0.172000
(Epoch 13 / 20) train acc: 0.920000; val_acc: 0.168000
(Epoch 14 / 20) train acc: 0.900000; val_acc: 0.174000
(Epoch 15 / 20) train acc: 0.960000; val_acc: 0.183000
(Iteration 31 / 40) loss: 0.186756
(Epoch 16 / 20) train acc: 1.000000; val_acc: 0.187000
(Epoch 17 / 20) train acc: 0.940000; val_acc: 0.178000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.183000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.171000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.181000

Inline Question 1

​ 在训练三层网络与训练五层网络的难度对比中,您注意到什么了吗?特别是,根据你的经验,哪个网络似乎对初始化比例(weight scale)更敏感?为什么会这样?

训练五层网络更难,与三层网络相比,训练五层网络更容易过拟合,在训练的过程中发现,五层网络对于 weight scale 更加敏感。

Update rules

​ 到目前为止,我们使用的更新规则是随机梯度下降法(SGD),更复杂的更新规则可以让深度网络的训练变得更容易。我们将实现几种最常用的更新规则,并将它们与 vanilla SGD 进行比较。

SGD+Momentum

​ 带动量的随机梯度下降是一种广泛使用的更新规则,它往往能使深度网络比普通随机梯度下降收敛得更快。更多信息,请参见 cs231n 课程笔记 的动量更新部分。动量法的核心代码如下:

# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

​ 其中 mu 为超参数,表示摩擦系数(coefficient of friction),典型值为:0.5、0.90、0.95、0.99。有了上述的核心代码,sgd_momentum 的代码如下:

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("momentum", 0.9)
    v = config.get("velocity", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    v = config['momentum'] * v - config['learning_rate'] * dw
    next_w = w + v
    # v = config['momentum'] * v + dw
    # next_w = w - config['learning_rate'] * v

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config["velocity"] = v

    return next_w, config

​ 完成后,运行以下程序,同时使用 SGD 和 SGD+momentum 训练一个六层网络,发现 SGD+momentum 更新规则收敛得更快。

​ 训练结果如下图所示,容易看出,SGD+Momentum 的收敛速度更快。

RMSProp

​ RMSProp 更新规则的核心代码如下:

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

​ 其中,decay_rate 为超参数,典型值为:0.9、0.99、0.999。rmsprop 代码如下:

def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("decay_rate", 0.99)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("cache", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    cache = config['cache']
    decay_rate = config['decay_rate']
    learning_rate = config['learning_rate']
    epsilon = config['epsilon']

    cache = decay_rate * cache + (1 - decay_rate) * dw**2
    next_w = w - learning_rate * dw / (np.sqrt(cache) + epsilon)

    config['cache'] = cache
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

​ 测试结果如下:

Adam

​ Adam 的核心代码如下:

# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)

adam 的代码如下:

def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-3)
    config.setdefault("beta1", 0.9)
    config.setdefault("beta2", 0.999)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("m", np.zeros_like(w))
    config.setdefault("v", np.zeros_like(w))
    config.setdefault("t", 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    m = config['m']
    v = config['v']
    t = config['t'] + 1 # 迭代次数
    learning_rate = config['learning_rate']
    beta1 = config['beta1']
    beta2 = config['beta2']
    epsilon = config['epsilon']

    m = beta1 * m + (1 - beta1) * dw
    mt = m / (1 - beta1**t)
    v = beta2 * v + (1 - beta2) * dw**2
    vt = v / (1 - beta2**t)
    next_w = w - learning_rate * mt / (np.sqrt(vt) + epsilon)

    config['m'] = m
    config['v'] = v
    config['t'] = t
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

​ 调试好 RMSProp 和 Adam 的实现后,运行以下命令,使用这些新的更新规则训练一对深度网络:

​ 测试结果如下:

Inline Question 2

​ 与 Adam 一样,AdaGrad 也是一种按参数优化的方法,使用以下更新规则:

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)

​ John 注意到,当他使用 AdaGrad 训练网络时,更新变得非常小,而且他的网络学习速度很慢。利用你对 AdaGrad 更新规则的了解,你认为为什么更新会变得非常小?Adam 也会遇到同样的问题吗?

】因为 AdaGrad 的学习率随着迭代次数的增加而不断减小,因此迭代次数充分大时,学习率衰减到趋近于零,因此参数更新非常小,Adam 不会出现这样的问题。

Q2: Batch Normalization

cs231n 课程笔记 的 Batch Normalization(BN) 部分简要地介绍了 BN,Batch Normalization 通常在全连接层或卷积层后、非线性层以前插入,可以理解为在网络的每一层进行预处理,可以加快神经网络训练的速度。

batchnorm_forward

​ 在论文 Batch Normalization 中,给出了批量归一化的操作,也就是前向传播算法:

​ 对于每一个特征维度,均有一组可学习的参数 γ \gamma γ (缩放参数)和 β \beta β​ (位移参数)。因此 Batch Normalization 的前向传播实现思路为:

  1. 通过 np.mean(x, axis=0) 计算每一个特征维度的均值。
  2. 通过 np.var(x, axis=0) 计算每一个特征维度的方差。
  3. 通过 (x - x_mean) / np.sqrt(x_var + eps) 进行归一化操作。
  4. 通过 gamma * x_norm + beta 进行缩放和位移操作,计算输出。

注意】以上为训练时 Batch Normalization 的前向传播操作,训练时要记录 running_meanrunning_var,在测试时 Batch Normalization 的前向传播操作直接使用 running_meanrunning_var 进行归一化操作,而不需要计算每一个特征维度的均值、方差。

batchnorm_forward 的代码如下:

def batchnorm_forward(x, gamma, beta, bn_param):
    """Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #                                                                     #
        # Note that though you should be keeping track of the running         #
        # variance, you should normalize the data based on the standard       #
        # deviation (square root of variance) instead!                        #
        # Referencing the original paper (https://arxiv.org/abs/1502.03167)   #
        # might prove to be helpful.                                          #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # 计算每一个特征维度的均值和方差
        x_mean = np.mean(x, axis=0)
        x_var = np.var(x, axis=0)
        # 归一化
        x_norm = (x - x_mean) / np.sqrt(x_var + eps)
        # 缩放和位移,计算输出
        out = gamma * x_norm + beta
        cache = (x, x_mean, x_var, x_norm, gamma, beta, eps)

        running_mean = momentum * running_mean + (1 - momentum) * x_mean
        running_var = momentum * running_var + (1 - momentum) * x_var

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == "test":
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x_norm = (x - running_mean) / np.sqrt(running_var + eps)
        out = gamma * x_norm + beta

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param["running_mean"] = running_mean
    bn_param["running_var"] = running_var

    return out, cache

​ 运行训练时的测试代码:

​ 结果如下:

Before batch normalization:
  means: [ -2.3814598  -13.18038246   1.91780462]
  stds:  [27.18502186 34.21455511 37.68611762]

After batch normalization (gamma=1, beta=0)
  means: [5.32907052e-17 7.04991621e-17 1.85962357e-17]
  stds:  [0.99999999 1.         1.        ]

After batch normalization (gamma= [1. 2. 3.] , beta= [11. 12. 13.] )
  means: [11. 12. 13.]
  stds:  [0.99999999 1.99999999 2.99999999]

​ 运行测试时的测试代码:

​ 结果如下:

After batch normalization (test-time):
  means: [-0.03927354 -0.04349152 -0.10452688]
  stds:  [1.01531428 1.01238373 0.97819988]

batchnorm_backward

​ 在论文 Batch Normalization 中,给出了梯度推导公式:

【注】详细的梯度推导过程见 batchnorm_backward 代码之后的梯度推导。

​ 根据 Batch Normalization 的前向传播算法,在作业中给出了对应的计算图

在这里插入图片描述

​ 根据计算图和链式法则,我们就可以执行反向传播,计算出 ∂ L ∂ X \frac{\partial L}{\partial X} XL计算过程如下

  1. 首先计算损失 L 相对于 Batch Normalization 的参数 γ \gamma γ β \beta β​ 的梯度。
    • 通过 np.sum(dout * x_norm, axis=0) 计算 dgamma
    • 通过 np.sum(dout, axis=0) 计算 dbeta
  2. 通过 np.sum(-0.5 * dx_norm * x_norm / (x_var + eps), axis=0) 计算损失 L 相对于 σ j 2 \sigma_j^2 σj2 的梯度。
  3. 通过 np.sum(-dx_norm / np.sqrt(x_var + eps), axis=0) 计算损失 L 相对于 μ j \mu_j μj 的梯度。
  4. 通过 dx_norm / np.sqrt(x_var + eps) + 2 / N * dx_var * (x - x_mean) + dx_mean / N 计算损失 L 相对于 x 的梯度。
def batchnorm_backward(dout, cache):
    """Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    # Referencing the original paper (https://arxiv.org/abs/1502.03167)       #
    # might prove to be helpful.                                              #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_norm, out, gamma, beta, eps = cache
    N, D = x.shape

    dx_norm = gamma * dout
    dgamma = np.sum(dout * x_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    dx_var = np.sum(-0.5 * dx_norm * x_norm / (x_var + eps), axis=0)
    dx_mean = np.sum(-dx_norm / np.sqrt(x_var + eps), axis=0)
    dx = dx_norm / np.sqrt(x_var + eps) + 2 / N * dx_var * (x - x_mean) + dx_mean / N
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

梯度推导

​ 首先定义几个符号:
X = ( x 11 x 12 ⋯ x 1 D x 21 x 22 ⋯ x 2 D ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N D ) , X ^ = ( x 11 ^ x 12 ^ ⋯ x 1 D ^ x 21 ^ x 22 ^ ⋯ x 2 D ^ ⋮ ⋮ ⋮ x N 1 ^ x N 2 ^ ⋯ x N D ^ ) , Y = ( y 11 y 12 ⋯ y 1 D y 21 y 22 ⋯ y 2 D ⋮ ⋮ ⋮ y N 1 y N 2 ⋯ y N D ) X= \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1D}\\ x_{21} & x_{22} & \cdots & x_{2D}\\ \vdots & \vdots & & \vdots\\ x_{N1} & x_{N2} & \cdots & x_{ND}\\ \end{pmatrix}, \hat{X} = \begin{pmatrix} \hat{x_{11}} & \hat{x_{12}} & \cdots & \hat{x_{1D}} \\ \hat{x_{21}} & \hat{x_{22}} & \cdots & \hat{x_{2D}} \\ \vdots & \vdots & & \vdots\\ \hat{x_{N1}} & \hat{x_{N2}} & \cdots & \hat{x_{ND}} \\ \end{pmatrix}, Y= \begin{pmatrix} y_{11} & y_{12} & \cdots & y_{1D}\\ y_{21} & y_{22} & \cdots & y_{2D}\\ \vdots & \vdots & & \vdots\\ y_{N1} & y_{N2} & \cdots & y_{ND}\\ \end{pmatrix} X= x11x21xN1x12x22xN2x1Dx2DxND ,X^= x11^x21^xN1^x12^x22^xN2^x1D^x2D^xND^ ,Y= y11y21yN1y12y22yN2y1Dy2DyND
​ 其中 X X X 为 BN 的输入, X ^ \hat{X} X^ 为归一化后的结果, Y Y Y 为最后的输出结果。参数 γ \gamma γ β \beta β 定义为:
γ = ( γ 1 γ 2 ⋯ γ D ) , β = ( β 1 β 2 ⋯ β D ) \gamma= \begin{pmatrix} \gamma_1 & \gamma_2 & \cdots &\gamma_D \end{pmatrix}, \beta= \begin{pmatrix} \beta_1 & \beta_2 & \cdots &\beta_D \end{pmatrix} γ=(γ1γ2γD),β=(β1β2βD)
​ 根据上述的符号定义,通过以下公式计算出第 j 个特征维度的均值 μ j \mu_j μj 和 方差 σ j 2 \sigma^2_j σj2
μ j = 1 m ∑ i = 1 m x i j   , σ j 2 = 1 m ∑ i = 1 m ( x i j − μ j ) 2 \mu_j=\frac{1}{m}\sum_{i=1}^m x_{ij}\, ,\sigma^2_j=\frac{1}{m}\sum_{i=1}^m(x_{ij} - \mu_j)^2 μj=m1i=1mxij,σj2=m1i=1m(xijμj)2
​ 归一化操作公式如下:
x i j ^ = x i j − μ j σ j 2 + ϵ \hat{x_{ij}}=\frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2+\epsilon}} xij^=σj2+ϵ xijμj
​ 缩放和位移操作公式如下:
y i j = γ j x i j ^ + β j y_{ij}=\gamma_j \hat{x_{ij}} + \beta_j yij=γjxij^+βj
​ 下面开始推导梯度。

​ 从最后的缩放和位移操作公式开始,可以计算如下导数:
∂ y i j ∂ γ j = x i j ^ , ∂ y i j ∂ β j = 1 , ∂ y i j ∂ x i j ^ = γ j \frac{\partial y_{ij}}{\partial \gamma_j}=\hat{x_{ij}},\frac{\partial y_{ij}}{\partial \beta_j}=1,\frac{\partial y_{ij}}{\partial \hat{x_{ij}}}=\gamma_j γjyij=xij^,βjyij=1,xij^yij=γj
​ 根据归一化操作公式,可以计算如下导数:
∂ x i j ^ ∂ σ j 2 = − 1 2 ( x i j − μ j ) ( σ j 2 + ϵ ) − 3 2 = − 1 2 x i j − μ j σ j 2 + ϵ ( σ j 2 + ϵ ) = − 1 2 ( σ j 2 + ϵ ) x i j ^ ∂ x i j ^ ∂ μ j = − 1 σ j 2 + ϵ ∂ x i j ^ ∂ x i j = 1 σ j 2 + ϵ \begin{aligned} \frac{\partial \hat{x_{ij}}}{\partial \sigma_j^2}&=-\frac{1}{2}(x_{ij}-\mu_j)(\sigma_j^2+\epsilon)^{-\frac{3}{2}}\\ &=-\frac{1}{2}\frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2+\epsilon}}(\sigma_j^2+\epsilon)\\ &=-\frac{1}{2}(\sigma_j^2+\epsilon)\hat{x_{ij}}\\ \frac{\partial \hat{x_{ij}}}{\partial \mu_j}&= -\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\\ \frac{\partial \hat{x_{ij}}}{\partial x_{ij}}&= \frac{1}{\sqrt{\sigma_j^2+\epsilon}} \end{aligned} σj2xij^μjxij^xijxij^=21(xijμj)(σj2+ϵ)23=21σj2+ϵ xijμj(σj2+ϵ)=21(σj2+ϵ)xij^=σj2+ϵ 1=σj2+ϵ 1
​ 根据方差公式,可以计算如下导数:

∂ σ j 2 ∂ x i j = 2 m ( x i j − μ j ) , ∂ σ j 2 ∂ μ j = − 2 m ∑ i = 1 m ( x i j − μ j ) \frac{\partial \sigma^2_j}{\partial x_{ij}}=\frac{2}{m}(x_{ij} - \mu_j),\frac{\partial \sigma^2_j}{\partial \mu_j}=-\frac{2}{m}\sum_{i=1}^m(x_{ij}-\mu_j) xijσj2=m2(xijμj),μjσj2=m2i=1m(xijμj)
​ 根据均值公式,可以计算如下导数:
∂ μ j ∂ x i j = 1 m \frac{\partial\mu_j}{\partial x_{ij}}=\frac{1}{m} xijμj=m1
​ 根据链式法则(chain rule):

  1. 损失 L 相对于 γ j \gamma_j γj β j \beta_j βj x i j ^ \hat{x_{ij}} xij^ 的梯度
    ∂ L ∂ γ j = ∑ i = 1 m ∂ L ∂ y i j ∂ y i j ∂ γ j = ∑ i = 1 m x i j ^ d y i j ∂ L ∂ β j = ∑ i = 1 m ∂ L ∂ y i j ∂ y i j ∂ β j = ∑ i = 1 m d y i j ∂ L ∂ x i j ^ = ∑ i = 1 m ∂ L ∂ y i j ∂ y i j ∂ x i j ^ = ∑ i = 1 m γ j d y i j \begin{aligned} \frac{\partial L}{\partial \gamma_j}&= \sum_{i=1}^m\frac{\partial L}{\partial y_{ij}}\frac{\partial y_{ij}}{{\partial \gamma_j}} =\sum_{i=1}^m \hat{x_{ij}} dy_{ij}\\ \frac{\partial L}{\partial \beta_j}&= \sum_{i=1}^m\frac{\partial L}{\partial y_{ij}}\frac{\partial y_{ij}}{\partial \beta_j}= \sum_{i=1}^m dy_{ij}\\ \frac{\partial L}{\partial \hat{x_{ij}}}&= \sum_{i=1}^m \frac{\partial L}{\partial y_{ij}}\frac{\partial y_{ij}}{\partial \hat{x_{ij}}}= \sum_{i=1}^m \gamma_jdy_{ij} \end{aligned} γjLβjLxij^L=i=1myijLγjyij=i=1mxij^dyij=i=1myijLβjyij=i=1mdyij=i=1myijLxij^yij=i=1mγjdyij

  2. 损失 L 相对于 σ j 2 \sigma_j^2 σj2 的梯度
    ∂ L ∂ σ j 2 = ∑ i = 1 m ∂ L ∂ x i j ^ ∂ x i j ^ ∂ σ j 2 = ∑ i = 1 m − 1 2 ( x i j − μ j ) ( σ j 2 + ϵ ) − 3 2 ∂ L ∂ x i j ^ = − 1 2 ( σ j 2 + ϵ ) ∑ i = 1 m x i j ^ ∂ L ∂ x i j ^ \begin{aligned} \frac{\partial L}{\partial \sigma_j^2} &=\sum_{i=1}^m\frac{\partial L}{\partial \hat{x_{ij}}}\frac{\partial \hat{x_{ij}}}{\partial \sigma_j^2}\\ &=\sum_{i=1}^m-\frac{1}{2}(x_{ij}-\mu_j)(\sigma_j^2+\epsilon)^{-\frac{3}{2}}\frac{\partial L}{\partial \hat{x_{ij}}}\\ &=-\frac{1}{2(\sigma_j^2+\epsilon)}\sum_{i=1}^m\hat{x_{ij}}\frac{\partial L}{\partial \hat{x_{ij}}} \end{aligned} σj2L=i=1mxij^Lσj2xij^=i=1m21(xijμj)(σj2+ϵ)23xij^L=2(σj2+ϵ)1i=1mxij^xij^L

  3. 损失 L 相对于 μ j \mu_j μj 的梯度
    ∂ L ∂ μ j = ∑ i = 1 m ∂ L ∂ x i j ^ ∂ x i j ^ ∂ μ j + ∂ L ∂ σ j 2 ∂ σ j 2 ∂ μ j = ∑ i = 1 m − 1 σ j 2 + ϵ ∂ L ∂ x i j ^ − 2 m ∂ L ∂ σ j 2 ∑ i = 1 m ( x i j − μ j ) = ∑ i = 1 m − 1 σ j 2 + ϵ ∂ L ∂ x i j ^ \begin{aligned} \frac{\partial L}{\partial \mu_j} &=\sum_{i=1}^m\frac{\partial L}{\partial \hat{x_{ij}}}\frac{\partial \hat{x_{ij}}}{\partial \mu_j}+\frac{\partial L}{\partial \sigma^2_j}\frac{\partial \sigma^2_j}{\partial \mu_j}\\ &=\sum_{i=1}^m-\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}}-\frac{2}{m}\frac{\partial L}{\partial \sigma_j^2}\sum_{i=1}^m(x_{ij}-\mu_j)\\ &=\sum_{i=1}^m-\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}} \end{aligned} μjL=i=1mxij^Lμjxij^+σj2Lμjσj2=i=1mσj2+ϵ 1xij^Lm2σj2Li=1m(xijμj)=i=1mσj2+ϵ 1xij^L
    其中(比较重要的一步化简)
    ∑ i = 1 m ( x i j − μ j ) = 0 \sum_{i=1}^m(x_{ij}-\mu_j) = 0 i=1m(xijμj)=0

  4. 损失 L 相对于 X X X 得梯度
    ∂ L ∂ x i j = ∂ L ∂ x i j ^ ∂ x i j ^ ∂ x i j + ∂ L ∂ μ j ∂ μ j ∂ x i j + ∂ L ∂ σ j 2 ∂ σ j 2 ∂ x i j = 1 σ j 2 + ϵ ∂ L ∂ x i j ^ + 1 m ∂ L ∂ μ j + 2 m ( x i j − μ j ) ∂ L ∂ σ j 2 \begin{aligned} \frac{\partial L}{\partial x_{ij}} &=\frac{\partial L}{\partial \hat{x_{ij}}} \frac{\partial \hat{x_{ij}}}{\partial x_{ij}}+ \frac{\partial L}{\partial \mu_j}\frac{\partial\mu_j}{\partial x_{ij}}+\frac{\partial L}{\partial \sigma^2_j}\frac{\partial \sigma_j^2}{\partial x_{ij}}\\ &=\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}}+ \frac{1}{m}\frac{\partial L}{\partial \mu_j}+ \frac{2}{m}(x_{ij} - \mu_j)\frac{\partial L}{\partial \sigma^2_j} \end{aligned} xijL=xij^Lxijxij^+μjLxijμj+σj2Lxijσj2=σj2+ϵ 1xij^L+m1μjL+m2(xijμj)σj2L

batchnorm_backward_alt

​ 由于,损失 L 相对于 σ j 2 \sigma_j^2 σj2 的梯度
∂ L ∂ σ j 2 = − 1 2 ( σ j 2 + ϵ ) ∑ i = 1 m x i j ^ ∂ L ∂ x i j ^ \frac{\partial L}{\partial \sigma_j^2} =-\frac{1}{2(\sigma_j^2+\epsilon)}\sum_{i=1}^m\hat{x_{ij}}\frac{\partial L}{\partial \hat{x_{ij}}} σj2L=2(σj2+ϵ)1i=1mxij^xij^L
​ 而损失 L 相对于 μ j \mu_j μj 的梯度
∂ L ∂ μ j = ∑ i = 1 m − 1 σ j 2 + ϵ ∂ L ∂ x i j ^ \frac{\partial L}{\partial \mu_j} =\sum_{i=1}^m-\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}} μjL=i=1mσj2+ϵ 1xij^L
​ 代入 ∂ L ∂ x i j \frac{\partial L}{\partial x_{ij}} xijL 中,化简可得:
∂ L ∂ x i j = ∂ L ∂ x i j ^ ∂ x i j ^ ∂ x i j + ∂ L ∂ μ j ∂ μ j ∂ x i j + ∂ L ∂ σ j 2 ∂ σ j 2 ∂ x i j = 1 σ j 2 + ϵ ∂ L ∂ x i j ^ + 1 m ∂ L ∂ μ j + 2 m ( x i j − μ j ) ∂ L ∂ σ j 2 = 1 σ j 2 + ϵ ∂ L ∂ x i j ^ − 1 m σ j 2 + ϵ ∑ i = 1 m ∂ L ∂ x i j ^ − 1 m 1 ( σ j 2 + ϵ ) x i j ^ ∑ i = 1 m x i j ^ ∂ L ∂ x i j ^ = 1 σ j 2 + ϵ ( ∂ L ∂ x i j ^ − 1 m ∑ i = 1 m ∂ L ∂ x i j ^ − 1 m x i j ^ ∑ i = 1 m ∂ L ∂ x i j ^ ) \begin{aligned} \frac{\partial L}{\partial x_{ij}} &=\frac{\partial L}{\partial \hat{x_{ij}}} \frac{\partial \hat{x_{ij}}}{\partial x_{ij}}+ \frac{\partial L}{\partial \mu_j}\frac{\partial\mu_j}{\partial x_{ij}}+\frac{\partial L}{\partial \sigma^2_j}\frac{\partial \sigma_j^2}{\partial x_{ij}}\\ &=\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}}+ \frac{1}{m}\frac{\partial L}{\partial \mu_j}+ \frac{2}{m}(x_{ij} - \mu_j)\frac{\partial L}{\partial \sigma^2_j}\\ &=\frac{1}{\sqrt{\sigma_j^2+\epsilon}}\frac{\partial L}{\partial \hat{x_{ij}}}- \frac{1}{m\sqrt{\sigma_j^2+\epsilon}}\sum_{i=1}^m\frac{\partial L}{\partial \hat{x_{ij}}} -\frac{1}{m}\frac{1}{\sqrt{(\sigma_j^2+\epsilon)}} \hat{x_{ij}}\sum_{i=1}^m\hat{x_{ij}}\frac{\partial L}{\partial \hat{x_{ij}}}\\ &=\frac{1}{\sqrt{\sigma_j^2+\epsilon}}(\frac{\partial L}{\partial \hat{x_{ij}}}-\frac{1}{m}\sum_{i=1}^m\frac{\partial L}{\partial \hat{x_{ij}}}-\frac{1}{m}\hat{x_{ij}}\sum_{i=1}^m\frac{\partial L}{\partial \hat{x_{ij}}}) \end{aligned} xijL=xij^Lxijxij^+μjLxijμj+σj2Lxijσj2=σj2+ϵ 1xij^L+m1μjL+m2(xijμj)σj2L=σj2+ϵ 1xij^Lmσj2+ϵ 1i=1mxij^Lm1(σj2+ϵ) 1xij^i=1mxij^xij^L=σj2+ϵ 1(xij^Lm1i=1mxij^Lm1xij^i=1mxij^L)
​ 因此,可通过表达式 (dx_norm - np.mean(dx_norm, axis=0) - x_norm * np.mean(x_norm * dx_norm, axis=0)) / np.sqrt(x_var + eps) 计算 dx。

def batchnorm_backward_alt(dout, cache):
    """Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass.
    See the jupyter notebook for more hints.

    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    #                                                                         #
    # After computing the gradient with respect to the centered inputs, you   #
    # should be able to compute gradients with respect to the inputs in a     #
    # single statement; our implementation fits on a single 80-character line.#
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_norm, gamma, beta, eps = cache
    N, D = x.shape

    dx_norm = gamma * dout
    dgamma = np.sum(dout * x_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    dx = (dx_norm - np.mean(dx_norm, axis=0) - x_norm * np.mean(x_norm * dx_norm, axis=0)) / np.sqrt(x_var + eps)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

​ 测试结果如下,可见梯度的计算与 batchnorm_backward 一致,且计算速度为原来的 1.57 倍,速度得到了轻微的提升。

Fully Connected Networks with Batch Normalization

​ 现在我们已经有了批量归一化的有效实现,返回 cs231n/classifiers/fc_net.py 文件中的 FullyConnectedNet,添加批量归一化。具体来说,在构造函数中将归一化标志设置为 "batchnorm" 时,应在每个 ReLU 非线性之前插入一个批量归一化层。网络最后一层的输出不应进行归一化处理。提示:定义类似 cs231n/layer_utils.py 文件中的附加辅助层可能会有所帮助。

​ 首先,在 layer_utils.py 文件中添加辅助层,用以实现 affine - batchnorm - ReLU 的结构。

def affine_batchnorm_relu_forward(x, w, b, gamma, beta, bn_params):
    """
    Convenience layer that performs an affine transform followed by a Batch Normalization and ReLU.

    Inputs:
    - x: Input to the affine layer
    - w, b: Weights for the affine layer
    - gamma, beta, bn_params: parameters in batch normalization layer

    Returns a tuple of:
    - out: Output from the ReLU
    - cache: Object to give to the backward pass
    """
    a, fc_cache = affine_forward(x, w, b)
    a, bn_cache = batchnorm_forward(a, gamma, beta, bn_params)
    out, relu_cache = relu_forward(a)
    cache = (fc_cache, bn_cache, relu_cache)

    return out, cache


def affine_batchnorm_relu_backward(dout, cache):
    """
    Backward pass for the affine-batchnorm-relu convenience layer.
    """
    fc_cache, bn_cache, relu_cache = cache
    da = relu_backward(dout, relu_cache)
    da, dgamma, dbeta = batchnorm_backward(da, bn_cache)
    dx, dw, db = affine_backward(da, fc_cache)

	return dx, dw, db, dgamma, dbeta

​ 然后在 fc_net.py 中添加 Batch Normalization 的效果,修改后的代码如下:

def loss(self, X, y=None):
    """Compute loss and gradient for the fully connected net.

    Inputs:
    - X: Array of input data of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

    Returns:
    If y is None, then run a test-time forward pass of the model and return:
    - scores: Array of shape (N, C) giving classification scores, where
        scores[i, c] is the classification score for X[i] and class c.

    If y is not None, then run a training-time forward and backward pass and
    return a tuple of:
    - loss: Scalar value giving the loss
    - grads: Dictionary with the same keys as self.params, mapping parameter
        names to gradients of the loss with respect to those parameters.
    """
    X = X.astype(self.dtype)
    mode = "test" if y is None else "train"

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.use_dropout:
        self.dropout_param["mode"] = mode
    if self.normalization == "batchnorm":
        for bn_param in self.bn_params:
            bn_param["mode"] = mode
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the fully connected net, computing  #
    # the class scores for X and storing them in the scores variable.          #
    #                                                                          #
    # When using dropout, you'll need to pass self.dropout_param to each       #
    # dropout forward pass.                                                    #
    #                                                                          #
    # When using batch normalization, you'll need to pass self.bn_params[0] to #
    # the forward pass for the first batch normalization layer, pass           #
    # self.bn_params[1] to the forward pass for the second batch normalization #
    # layer, etc.                                                              #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
    a = X
    # 保存每一层的 cache 对象,反向传播时需要用到
    caches = {}
    for i in range(self.num_layers - 1):
        W, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)]
        if self.normalization == 'batchnorm':
        	gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)]
        a, caches['layer' + str(i + 1)] = affine_batchnorm_relu_forward(a, W, b, gamma, beta, self.bn_params[i])
        else:
        	a, caches['layer' + str(i + 1)] = affine_relu_forward(a, W, b)
    # 最后一层的操作
    W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
    scores, caches['layer' + str(self.num_layers)] = affine_forward(a, W, b)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # If test mode return early.
    if mode == "test":
        return scores

    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the backward pass for the fully connected net. Store the #
    # loss in the loss variable and gradients in the grads dictionary. Compute #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    #                                                                          #
    # When using batch/layer normalization, you don't need to regularize the   #
    # scale and shift parameters.                                              #
    #                                                                          #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    loss, dscores = softmax_loss(scores, y)

    # 损失加上正则化项
    for i in range(self.num_layers):
        W = self.params['W' + str(i + 1)]
        loss += 0.5 * self.reg * np.sum(np.square(W))

    # 反向传播
    W = self.params['W' + str(self.num_layers)]
    fc_cache = caches['layer' + str(self.num_layers)]
    da, dW, db = affine_backward(dscores, fc_cache)
    grads['W' + str(self.num_layers)] = dW + self.reg * W
    grads['b' + str(self.num_layers)] = db
    for i in range(self.num_layers - 1, 0, -1):
        W = self.params['W' + str(i)]
        cache = caches['layer' + str(i)]
        if self.normalization == 'batchnorm':
            da, dW, db, dgamma, dbeta = affine_batchnorm_relu_backward(da, cache)
            grads['gamma' + str(i)] = dgamma
            grads['beta' + str(i)] = dbeta
        else:
        	da, dW, db = affine_relu_backward(da, cache)

        grads['W' + str(i)] = dW + self.reg * W
        grads['b' + str(i)] = db
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

​ 运行以下程序,在 1000 个训练示例子集上训练一个六层网络,包括批量归一化和非批量归一化。

​ 结果如下:

Solver with batch norm:
(Iteration 1 / 200) loss: 2.340974
(Epoch 0 / 10) train acc: 0.107000; val_acc: 0.115000
(Epoch 1 / 10) train acc: 0.313000; val_acc: 0.266000
(Iteration 21 / 200) loss: 2.039365
(Epoch 2 / 10) train acc: 0.385000; val_acc: 0.278000
(Iteration 41 / 200) loss: 2.041103
(Epoch 3 / 10) train acc: 0.494000; val_acc: 0.308000
(Iteration 61 / 200) loss: 1.753903
(Epoch 4 / 10) train acc: 0.532000; val_acc: 0.308000
(Iteration 81 / 200) loss: 1.246584
(Epoch 5 / 10) train acc: 0.574000; val_acc: 0.314000
(Iteration 101 / 200) loss: 1.320591
(Epoch 6 / 10) train acc: 0.636000; val_acc: 0.339000
(Iteration 121 / 200) loss: 1.157329
(Epoch 7 / 10) train acc: 0.685000; val_acc: 0.327000
(Iteration 141 / 200) loss: 1.141054
(Epoch 8 / 10) train acc: 0.774000; val_acc: 0.336000
(Iteration 161 / 200) loss: 0.700217
(Epoch 9 / 10) train acc: 0.808000; val_acc: 0.328000
(Iteration 181 / 200) loss: 0.889906
(Epoch 10 / 10) train acc: 0.789000; val_acc: 0.326000

Solver without batch norm:
(Iteration 1 / 200) loss: 2.302332
(Epoch 0 / 10) train acc: 0.129000; val_acc: 0.131000
(Epoch 1 / 10) train acc: 0.283000; val_acc: 0.250000
(Iteration 21 / 200) loss: 2.041970
(Epoch 2 / 10) train acc: 0.316000; val_acc: 0.277000
(Iteration 41 / 200) loss: 1.900473
(Epoch 3 / 10) train acc: 0.373000; val_acc: 0.282000
(Iteration 61 / 200) loss: 1.713156
(Epoch 4 / 10) train acc: 0.390000; val_acc: 0.310000
(Iteration 81 / 200) loss: 1.662209
(Epoch 5 / 10) train acc: 0.434000; val_acc: 0.300000
(Iteration 101 / 200) loss: 1.696059
(Epoch 6 / 10) train acc: 0.535000; val_acc: 0.345000
(Iteration 121 / 200) loss: 1.557987
(Epoch 7 / 10) train acc: 0.530000; val_acc: 0.304000
(Iteration 141 / 200) loss: 1.432189
(Epoch 8 / 10) train acc: 0.628000; val_acc: 0.339000
(Iteration 161 / 200) loss: 1.034116
(Epoch 9 / 10) train acc: 0.654000; val_acc: 0.342000
(Iteration 181 / 200) loss: 0.905794
(Epoch 10 / 10) train acc: 0.712000; val_acc: 0.328000

​ 将训练过程可视化如下图所示,可见带有 Batch Normalization 的网络训练速度更快。

Batch Normalization and Initialization

​ 现在,我们将进行一个小实验,研究 Batch Normalization 和权重初始化之间的相互作用。

​ 第一个单元将使用不同的权重初始化比例,在有 Batch Normalization 和无 Batch Normalization 的情况下训练八层网络。第二个单元将绘制训练准确率、验证集准确率和训练损失与权重初始化比例的函数关系图。

​ 运行结果如下:

Running weight scale 1 / 20
Running weight scale 2 / 20
Running weight scale 3 / 20
Running weight scale 4 / 20
Running weight scale 5 / 20
Running weight scale 6 / 20
Running weight scale 7 / 20
Running weight scale 8 / 20
Running weight scale 9 / 20
Running weight scale 10 / 20
Running weight scale 11 / 20
Running weight scale 12 / 20
Running weight scale 13 / 20
Running weight scale 14 / 20
Running weight scale 15 / 20
Running weight scale 16 / 20
/content/drive/My Drive/cs231n/assignments/assignment2/cs231n/layers.py:145: RuntimeWarning: divide by zero encountered in log
  loss = np.sum(-np.log(p[range(N), y])) / N
Running weight scale 17 / 20
Running weight scale 18 / 20
Running weight scale 19 / 20
Running weight scale 20 / 20

​ 运行结果如下图所示:

Inline Question 1

​ 描述该实验的结果。weight initialization scale 对有/无 Batch Normalization 的模型有何不同影响,为什么?

实验结果

  • weight initialization scale 对没有 Batch Normalization 的模型有很大的影响,根据上面的实验结果,可以得到当 weight_scale 大约为 1 0 − 1 10^{-1} 101​ 时,没有 Batch Normalization 的模型的训练精度和验证精度都达到了最高,而 weight_scale 在其他的取值下,训练精度和验证精度都比较低。

  • weight initialization scale 对有 Batch Normalization 的模型的影响不是很大,在大多数情况下,相同 weight_scale 时,有 Batch Normalization 的模型会比没有 Batch Normalization 的模型的训练精度和验证精度高。

Batch Normalization and Batch Size

​ 现在,我们将进行一个小实验,研究 Batch Normalization 和 Batch Size 之间的相互作用。第一个单元将使用不同的批量大小,在有批量归一化和无批量归一化的情况下训练 6 层网络。第二个单元将绘制随时间变化的训练准确率和验证集准确率。

​ 运行结果如下:

No normalization: batch size =  5
Normalization: batch size =  5
Normalization: batch size =  10
Normalization: batch size =  50

Inline Question 2

​ 描述该实验的结果。这说明 Batch Normalizaiton 和 Batch Size 之间有什么关系?为什么会观察到这种关系?

实验结果:Batch Size 越大,Batch Normalization 的模型的训练就会越快,模型更快收敛。

Layer Nomalization

​ 事实证明,Batch Normalization 可以有效地简化网络训练,但它对 Batch Size 的依赖性使其在复杂网络中的作用大打折扣,因为硬件限制会对输入Batch Size 设置上限。为了缓解这一问题,人们提出了 Batch Normalization 的几种替代方法,其中之一就是 Layer Normalization。我们不是对 Batch 进行归一化,而是对特征进行归一化。换句话说,在使用层归一化时,与单个数据点相对应的每个特征向量都会根据该特征向量中所有项的总和进行归一化。

Inline Question 3

​ 这些数据预处理步骤中,哪个类似于 Batch Normalization,哪个类似于 Layer Normalization?

  1. 缩放数据集中的每幅图像,使图像中每行像素的 RGB 通道总和为 1。
  2. 缩放数据集中的每幅图像,使图像中所有像素的 RGB 通道总和为 1。
  3. 从数据集中的每张图像中减去数据集的平均图像。
  4. 根据给定的阈值,将所有 RGB 值设置为 0 或 1。

】1 类似于 Batch Normalization,2 类似于 Layer Normalization。

​ 现在,您将实现 Layer Normalization。这一步应该相对简单,因为从概念上讲,实现方法与批量归一化几乎相同。但一个显著的区别是,对于 Layer Normalization,我们不跟踪移动时刻,测试阶段与训练阶段相同,直接计算每个数据点的均值和方差

layernorm_forward

​ Layer Normalization 的操作如下几个公式所示,与 Batch Normalization 非常相似,可以通过矩阵转置的方式,复用 Batch Normalization 的代码。
μ i = 1 D ∑ j = 1 D x i j \begin{equation} \mu_i=\frac{1}{D}\sum_{j=1}^Dx_{ij}\tag{1} \end{equation} μi=D1j=1Dxij(1)

σ i 2 = 1 D ∑ j = 1 D ( x i j − μ i ) 2 \begin{equation} \sigma_i^2=\frac{1}{D}\sum_{j=1}^D(x_{ij}-\mu_i)^2\tag{2} \end{equation} σi2=D1j=1D(xijμi)2(2)

x i j ^ = x i j − μ i σ i 2 + ϵ (3) \hat{x_{ij}}=\frac{x_{ij}-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}\tag{3} xij^=σi2+ϵ xijμi(3)

y i j = γ j x i j ^ + β j (4) y_{ij}=\gamma_j\hat{x_{ij}}+\beta_j\tag{4} yij=γjxij^+βj(4)

def layernorm_forward(x, gamma, beta, ln_param):
    """Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.

    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get("eps", 1e-5)
    ###########################################################################
    # TODO: Implement the training-time forward pass for layer norm.          #
    # Normalize the incoming data, and scale and  shift the normalized data   #
    #  using gamma and beta.                                                  #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of  batch normalization, and inserting a line or two of  #
    # well-placed code. In particular, can you think of any matrix            #
    # transformations you could perform, that would enable you to copy over   #
    # the batch norm code and leave it almost unchanged?                      #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    N, D = x.shape
    x_mean = np.mean(x, axis=1, keepdims=True)
    x_var = np.var(x, axis=1, keepdims=True)
    x_norm = (x - x_mean) / np.sqrt(x_var + eps)
    out = gamma * x_norm + beta

    cache = (x, x_mean, x_var, x_norm, gamma, beta, eps)


    # cache = (x, x_mean, x_var, x_norm, gamma, beta, eps)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return out, cache

​ 运行以下测试:

​ 测试结果如下:

Before layer normalization:
  means: [-59.06673243 -47.60782686 -43.31137368 -26.40991744]
  stds:  [10.07429373 28.39478981 35.28360729  4.01831507]

After layer normalization (gamma=1, beta=0)
  means: [ 4.81096644e-16 -7.40148683e-17  2.22044605e-16 -5.92118946e-16]
  stds:  [0.99999995 0.99999999 1.         0.99999969]

After layer normalization (gamma= [3. 3. 3.] , beta= [5. 5. 5.] )
  means: [5. 5. 5. 5.]
  stds:  [2.99999985 2.99999998 2.99999999 2.99999907]
layernorm_backward

​ 对于 layernorm_backward,基本上和 batchnorm_backward 的代码相同,有一些细微的变化,可以通过梯度推导验证,这里不再赘述。

def layernorm_backward(dout, cache):
    """Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_norm, gamma, beta, eps = cache
    N, D = x.shape

    dx_norm = gamma * dout
    dgamma = np.sum(dout * x_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    dx_var = np.sum(-0.5 * dx_norm * x_norm / (x_var + eps), axis=1, keepdims=True)
    dx_mean = np.sum(-dx_norm / np.sqrt(x_var + eps), axis=1, keepdims=True)
    
    dx = dx_norm / np.sqrt(x_var + eps) + 2 / D * dx_var * (x - x_mean) + dx_mean / D

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dgamma, dbeta

​ 运行以下测试:

​ 测试结果如下:

dx error:  1.433616168873336e-09
dgamma error:  4.519489546032799e-12
dbeta error:  2.276445013433725e-12
layernorm_backward_alt

​ 这个函数作业中没有要求实现,但与 batchnorm_backward_alt 非常相似,只需要修改一点地方即可,这个实现会稍微比 layernorm_backward 快一点。

def layernorm_backward_alt(dout, cache):
    """Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_norm, gamma, beta, eps = cache
    N, D = x.shape

    dx_norm = gamma * dout
    dgamma = np.sum(dout * x_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    dx = (dx_norm - np.mean(dx_norm, axis=1, keepdims=True) - x_norm * np.mean(x_norm * dx_norm, axis=1, keepdims=True)) / np.sqrt(x_var + eps)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dgamma, dbeta
Layer Normalization and Batch Size

​ 现在,我们将使用 Layer Normalization 而不是 Batch Normalization 来运行之前的 Batch Size 实验。与之前的实验相比,你会发现 Batch Size 对训练历史的影响明显变小了!

image-20240531171102562 1717146684681

Inline Question 4

​ Layer Normalization 在什么情况下可能不起作用?

  1. 在非常深的网络中使用
  2. 特征维度非常小
  3. 正则化项较大

】2、3。

  • 因为 Layer Normalization 是对每个训练样本的特征进行归一化,而 Batch Normalization 是对小批量训练样本进行归一化,可以类比 Batch Normalization 与 Batch Size 的关系,Batch Size 越大,Batch Normalization 的模型训练速度越快,因此特征维度越大,Layer Normalization 的模型训练速度越快。
  • 正则化项较大时,模型可能会出现欠拟合情况,Layer Normalizaiton 可能作用不大。

Q3: Dropout

Dropout 是一种正则化神经网络的技术,它在前向传递过程中随机将一些输出激活设置为零。在本练习中,您将实现一个 dropout layer,并修改您的全连接网络以选择性地使用 dropout。

dropout_forward

cs231n 课程笔记 的正则化部分,给出了 dropout 的实现代码,我们仿照里面给出的 Inverted Dropout 实现代码,实现思路为:

​ 若为训练模式:

  1. 生成一个掩码矩阵 mask,里面的各个元素为 1 的概率为 p,为 0 的概率为 (1 - p),mask 除以概率 p这里除以 p,在测试时就不需要乘以 p)。
  2. mask 矩阵与 输入矩阵 x 按元素相乘。

​ 若为测试模式,则输出即为 x。

def dropout_forward(x, dropout_param):
    """Forward pass for inverted dropout.

    Note that this is different from the vanilla version of dropout.
    Here, p is the probability of keeping a neuron output, as opposed to
    the probability of dropping a neuron output.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.
    """
    p, mode = dropout_param["p"], dropout_param["mode"]
    if "seed" in dropout_param:
        np.random.seed(dropout_param["seed"])

    mask = None
    out = None

    if mode == "train":
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        mask = (np.random.rand(*x.shape) < p) / p
        out = x * mask
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == "test":
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        out = x

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache

Running tests with p =  0.25
Mean of input:  10.000207878477502
Mean of train-time output:  10.014059116977283
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.749784
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.4
Mean of input:  10.000207878477502
Mean of train-time output:  9.977917658761159
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.600796
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.7
Mean of input:  10.000207878477502
Mean of train-time output:  9.987811912159426
Mean of test-time output:  10.000207878477502
Fraction of train-time output set to zero:  0.30074
Fraction of test-time output set to zero:  0.0

dropout_backward

​ dropout 的反向传播梯度推导类似于 ReLU 激活函数:

  • 训练模式下,dx = dout * mask
  • 测试模式下,dx = dout
def dropout_backward(dout, cache):
    """Backward pass for inverted dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param["mode"]

    dx = None
    if mode == "train":
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        dx = dout * mask

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == "test":
        dx = dout
    return dx

Inline Question 1

​ 如果我们在 dropout layer 中不将通过 inverse dropout 传递的值除以 p,会发生什么情况?为什么会这样?

Fully Connected Networks with Dropout

​ 修改 fc_net.py 的 loss 实现,添加 dropout,实现比较简单,基于添加 Batch Normalization 的 FCNet 的代码基础上进行修改,代码如下:

def loss(self, X, y=None):
	"""Compute loss and gradient for the fully connected net.

    Inputs:
    - X: Array of input data of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

    Returns:
    If y is None, then run a test-time forward pass of the model and return:
    - scores: Array of shape (N, C) giving classification scores, where
        scores[i, c] is the classification score for X[i] and class c.

    If y is not None, then run a training-time forward and backward pass and
    return a tuple of:
    - loss: Scalar value giving the loss
    - grads: Dictionary with the same keys as self.params, mapping parameter
        names to gradients of the loss with respect to those parameters.
    """
    X = X.astype(self.dtype)
    mode = "test" if y is None else "train"

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.use_dropout:
        self.dropout_param["mode"] = mode
    if self.normalization == "batchnorm":
        for bn_param in self.bn_params:
            bn_param["mode"] = mode
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the fully connected net, computing  #
    # the class scores for X and storing them in the scores variable.          #
    #                                                                          #
    # When using dropout, you'll need to pass self.dropout_param to each       #
    # dropout forward pass.                                                    #
    #                                                                          #
    # When using batch normalization, you'll need to pass self.bn_params[0] to #
    # the forward pass for the first batch normalization layer, pass           #
    # self.bn_params[1] to the forward pass for the second batch normalization #
    # layer, etc.                                                              #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
    a = X
    # 保存每一层的 cache 对象,反向传播时需要用到
    caches = {}
    for i in range(self.num_layers - 1):
        W, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)]
        if self.normalization == 'batchnorm':
            gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)]
            a, caches['layer' + str(i + 1)] = affine_batchnorm_relu_forward(a, W, b, gamma, beta, self.bn_params[i])
        elif self.normalization == 'layernorm':
            gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)]
            a, caches['layer' + str(i + 1)] = affine_layernorm_relu_forward(a, W, b, gamma, beta, self.bn_params[i])
        else:
        	a, caches['layer' + str(i + 1)] = affine_relu_forward(a, W, b)

        if self.use_dropout:
        	a, caches['dropout' + str(i + 1)] = dropout_forward(a, self.dropout_param)
    # 最后一层的操作
    W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
    scores, caches['layer' + str(self.num_layers)] = affine_forward(a, W, b)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # If test mode return early.
    if mode == "test":
        return scores

    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the backward pass for the fully connected net. Store the #
    # loss in the loss variable and gradients in the grads dictionary. Compute #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    #                                                                          #
    # When using batch/layer normalization, you don't need to regularize the   #
    # scale and shift parameters.                                              #
    #                                                                          #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    loss, dscores = softmax_loss(scores, y)

    # 损失加上正则化项
    for i in range(self.num_layers):
        W = self.params['W' + str(i + 1)]
        loss += 0.5 * self.reg * np.sum(np.square(W))

    # 反向传播
    W = self.params['W' + str(self.num_layers)]
    fc_cache = caches['layer' + str(self.num_layers)]
    da, dW, db = affine_backward(dscores, fc_cache)
    grads['W' + str(self.num_layers)] = dW + self.reg * W
    grads['b' + str(self.num_layers)] = db
    for i in range(self.num_layers - 1, 0, -1):
        W = self.params['W' + str(i)]
        cache = caches['layer' + str(i)]
        if self.use_dropout:
            dropout_cache = caches['dropout' + str(i)]
            da = dropout_backward(da, dropout_cache)
        if self.normalization == 'batchnorm':
            da, dW, db, dgamma, dbeta = affine_batchnorm_relu_backward(da, cache)
            grads['gamma' + str(i)] = dgamma
            grads['beta' + str(i)] = dbeta
        elif self.normalization == 'layernorm':
            da, dW, db, dgamma, dbeta = affine_layernorm_relu_backward(da, cache)
            grads['gamma' + str(i)] = dgamma
            grads['beta' + str(i)] = dbeta
        else:
        	da, dW, db = affine_relu_backward(da, cache)

        grads['W' + str(i)] = dW + self.reg * W
        grads['b' + str(i)] = db
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

Running check with dropout =  1
Initial loss:  2.300479089768492
W1 relative error: 1.03e-07
W2 relative error: 2.21e-05
W3 relative error: 4.56e-07
b1 relative error: 4.66e-09
b2 relative error: 2.09e-09
b3 relative error: 1.69e-10

Running check with dropout =  0.75
Initial loss:  2.302371489704412
W1 relative error: 1.85e-07
W2 relative error: 2.15e-06
W3 relative error: 4.56e-08
b1 relative error: 1.16e-08
b2 relative error: 1.82e-09
b3 relative error: 1.48e-10

Running check with dropout =  0.5
Initial loss:  2.30427592207859
W1 relative error: 3.11e-07
W2 relative error: 2.48e-08
W3 relative error: 6.43e-08
b1 relative error: 5.37e-09
b2 relative error: 1.91e-09
b3 relative error: 1.85e-10

Regularization Experiment

​ 作为一项实验,我们将在 500 个训练示例上训练一对双层网络:其中一个网络不使用 dropout,另一个网络使用 0.25 的保留概率。然后,我们将可视化这两个网络随时间变化的训练和验证准确率。

​ 测试结果如下:

1
(Iteration 1 / 125) loss: 7.856643
(Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000
(Epoch 1 / 25) train acc: 0.416000; val_acc: 0.258000
(Epoch 2 / 25) train acc: 0.482000; val_acc: 0.276000
(Epoch 3 / 25) train acc: 0.532000; val_acc: 0.277000
(Epoch 4 / 25) train acc: 0.600000; val_acc: 0.271000
(Epoch 5 / 25) train acc: 0.708000; val_acc: 0.299000
(Epoch 6 / 25) train acc: 0.722000; val_acc: 0.282000
(Epoch 7 / 25) train acc: 0.832000; val_acc: 0.255000
(Epoch 8 / 25) train acc: 0.880000; val_acc: 0.268000
(Epoch 9 / 25) train acc: 0.902000; val_acc: 0.277000
(Epoch 10 / 25) train acc: 0.898000; val_acc: 0.261000
(Epoch 11 / 25) train acc: 0.924000; val_acc: 0.263000
(Epoch 12 / 25) train acc: 0.960000; val_acc: 0.300000
(Epoch 13 / 25) train acc: 0.972000; val_acc: 0.314000
(Epoch 14 / 25) train acc: 0.972000; val_acc: 0.310000
(Epoch 15 / 25) train acc: 0.974000; val_acc: 0.314000
(Epoch 16 / 25) train acc: 0.994000; val_acc: 0.303000
(Epoch 17 / 25) train acc: 0.970000; val_acc: 0.306000
(Epoch 18 / 25) train acc: 0.992000; val_acc: 0.312000
(Epoch 19 / 25) train acc: 0.990000; val_acc: 0.311000
(Epoch 20 / 25) train acc: 0.990000; val_acc: 0.287000
(Iteration 101 / 125) loss: 0.001695
(Epoch 21 / 25) train acc: 0.994000; val_acc: 0.289000
(Epoch 22 / 25) train acc: 0.998000; val_acc: 0.307000
(Epoch 23 / 25) train acc: 0.994000; val_acc: 0.308000
(Epoch 24 / 25) train acc: 0.998000; val_acc: 0.311000
(Epoch 25 / 25) train acc: 0.992000; val_acc: 0.310000

0.25
(Iteration 1 / 125) loss: 17.318478
(Epoch 0 / 25) train acc: 0.230000; val_acc: 0.177000
(Epoch 1 / 25) train acc: 0.378000; val_acc: 0.243000
(Epoch 2 / 25) train acc: 0.402000; val_acc: 0.254000
(Epoch 3 / 25) train acc: 0.502000; val_acc: 0.276000
(Epoch 4 / 25) train acc: 0.528000; val_acc: 0.298000
(Epoch 5 / 25) train acc: 0.562000; val_acc: 0.296000
(Epoch 6 / 25) train acc: 0.626000; val_acc: 0.291000
(Epoch 7 / 25) train acc: 0.622000; val_acc: 0.297000
(Epoch 8 / 25) train acc: 0.688000; val_acc: 0.313000
(Epoch 9 / 25) train acc: 0.712000; val_acc: 0.297000
(Epoch 10 / 25) train acc: 0.724000; val_acc: 0.306000
(Epoch 11 / 25) train acc: 0.768000; val_acc: 0.307000
(Epoch 12 / 25) train acc: 0.774000; val_acc: 0.284000
(Epoch 13 / 25) train acc: 0.828000; val_acc: 0.308000
(Epoch 14 / 25) train acc: 0.812000; val_acc: 0.346000
(Epoch 15 / 25) train acc: 0.848000; val_acc: 0.338000
(Epoch 16 / 25) train acc: 0.842000; val_acc: 0.306000
(Epoch 17 / 25) train acc: 0.856000; val_acc: 0.301000
(Epoch 18 / 25) train acc: 0.860000; val_acc: 0.317000
(Epoch 19 / 25) train acc: 0.882000; val_acc: 0.313000
(Epoch 20 / 25) train acc: 0.866000; val_acc: 0.312000
(Iteration 101 / 125) loss: 4.185203
(Epoch 21 / 25) train acc: 0.894000; val_acc: 0.332000
(Epoch 22 / 25) train acc: 0.898000; val_acc: 0.313000
(Epoch 23 / 25) train acc: 0.928000; val_acc: 0.314000
(Epoch 24 / 25) train acc: 0.922000; val_acc: 0.315000
(Epoch 25 / 25) train acc: 0.926000; val_acc: 0.330000

Inline Question 2

​ 比较有/无 dropout 的验证精度和训练精度–您的结果对作为正则化器的 dropout 有何启示?

】相同情况下,有 dropout 的模型在训练时,训练精度会低于无 dropout 模型的训练精度,这表明 dropout 具有正则化的功能,可以有效地避免模型过拟合。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐