# 应用梯度累积算法

## 概述

本教程介绍梯度累积的训练方式，目的是为了解决由于内存不足导致某些大型网络无法训练大`batch_size`的问题。  
传统的训练方式是每次计算得到loss和梯度后，直接用所得梯度对参数进行更新。与传统的训练方式不同，梯度累积引入`mini_batch`的概念，首先对每个`mini_batch`的数据计算loss和梯度，但不立即更新模型参数，而是先对所得梯度进行累加，然后在指定数量（N）个`mini_batch`之后，用累积后的梯度更新网络参数。下次训练前清空过往累积梯度后重新累加，如此往复。  
最终目的是为了达到跟直接用N个mini_batch数据训练几乎同样的效果。 
本例将在MindSpore中应用梯度累积算法，实现对模型的训练。  
体验过程如下：

1. 数据准备。
2. 定义深度神经网络。
3. 训练函数并实现定义梯度累积算法。
4. 调用自定义训练函数进行训练。
5. 使用训练保存的模型参数进行验证。

> 本文档适用于GPU环境。

## 数据准备

下载MNIST_Data数据集并解压到指定位置，执行如下命令：

In [1]:
!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/MNIST_Data.zip
!unzip -o MNIST_Data.zip -d ./datasets/
!tree ./datasets/MNIST_Data/

--2020-12-14 15:24:22--  https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/MNIST_Data.zip
Resolving proxy-notebook.modelarts-dev-proxy.com (proxy-notebook.modelarts-dev-proxy.com)... 192.168.0.172
Connecting to proxy-notebook.modelarts-dev-proxy.com (proxy-notebook.modelarts-dev-proxy.com)|192.168.0.172|:8083... connected.
Proxy request sent, awaiting response... 200 OK
Length: 10754903 (10M) [application/zip]
Saving to: ‘MNIST_Data.zip’


2020-12-14 15:24:22 (154 MB/s) - ‘MNIST_Data.zip’ saved [10754903/10754903]

Archive:  MNIST_Data.zip
   creating: ./datasets/MNIST_Data/test/
  inflating: ./datasets/MNIST_Data/test/t10k-images-idx3-ubyte  
  inflating: ./datasets/MNIST_Data/test/t10k-labels-idx1-ubyte  
   creating: ./datasets/MNIST_Data/train/
  inflating: ./datasets/MNIST_Data/train/train-images-idx3-ubyte  
  inflating: ./datasets/MNIST_Data/train/train-labels-idx1-ubyte  
./datasets/MNIST_Data/
├── test
│   ├── t10k-images-idx3-ubyte
│   └──

定义数据集增强函数create_dataset，调用该函数对MNIST原始训练数据集60000张$28\times28$的图片增强为1875个batch，每个batch张量为`(32,1,32,32)`的训练数据集。

In [2]:
import mindspore.dataset.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.vision import Inter
from mindspore import dtype as mstype
import mindspore.dataset as ds


def create_dataset(data_path, batch_size=32, repeat_size=1,
                   num_parallel_workers=1):
    # define dataset
    mnist_ds = ds.MnistDataset(data_path)

    # define some parameters needed for data enhancement and rough justification
    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0
    rescale_nml = 1 / 0.3081
    shift_nml = -1 * 0.1307 / 0.3081

    # according to the parameters, generate the corresponding data enhancement method
    c_trans = [
        CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR),
        CV.Rescale(rescale_nml, shift_nml),
        CV.Rescale(rescale, shift),
        CV.HWC2CHW()
    ]
    type_cast_op = C.TypeCast(mstype.int32)

    # using map to apply operations to a dataset
    mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=c_trans, input_columns="image", num_parallel_workers=num_parallel_workers)

    # process the generated dataset
    buffer_size = 10000
    mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)
    mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)
    mnist_ds = mnist_ds.repeat(repeat_size)

    return mnist_ds

## 定义深度神经网络

本例采用LeNet5训练网络对数据集进行训练，其构造方式如下：

In [3]:
import mindspore.nn as nn
from mindspore.common.initializer import Normal

class LeNet5(nn.Cell):
    """Lenet network structure."""
    # define the operator required
    def __init__(self, num_class=10, num_channel=1):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid')
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')
        self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02))
        self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02))
        self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02))
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()

    # use the preceding operators to construct networks
    def construct(self, x):
        x = self.max_pool2d(self.relu(self.conv1(x)))
        x = self.max_pool2d(self.relu(self.conv2(x)))
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x) 
        return x

## 定义Model函数并在其中进行梯度累积定义

梯度累积计算在Model函数中，这里对Model函数的原始代码进行重构。

重构中需涉及重构的方法主要有五点：

1. 定义梯度累积方法。
2. 定义前向反向传播方法。
3. 定义权重更新方法。
4. 定义梯度累积清除方法。
5. 定义模型训练执行器。

具体实现如下：

### 定义梯度累积方法

需要定义梯度累积的计算方式，并将计算方式注册到计算图中，若不进行注册，计算方法将不能在`nn.Cell`中构建计算图。

In [4]:
import mindspore.ops as ops

_sum_op = ops.MultitypeFuncGraph("grad_sum_op")
_clear_op = ops.MultitypeFuncGraph("clear_op")


@_sum_op.register("Tensor", "Tensor")
def _cumulative_grad(grad_sum, grad):
    """Apply grad sum to cumulative gradient."""
    add = ops.AssignAdd()
    return add(grad_sum, grad)


@_clear_op.register("Tensor", "Tensor")
def _clear_grad_sum(grad_sum, zero):
    """Apply zero to clear grad_sum."""
    success = True
    success = ops.depend(success, ops.assign(grad_sum, zero))
    return success

`_cumulativa_grad`：梯度累积方法，将grad值加到`grad_sum`中，后续计算过程中作用是将`mini_batch`计算出的grad值添加到`grad_sum`中。  
`_clear_grad_sum`：梯度清除方法，后续计算过程中的作用是当累积的梯度值`grad_sum`更新到权重中后，将`grad_sum`值清零。

### 定义前向反向传播方法

前向传播：利用训练前的模型函数，载入数据集中的数据，计算出loss值的过程。  
反向传播：利用loss值和载入的数据，通过优化器函数计算出梯度值，并将梯度值更新到模型函数的权重中的过程。  
这两个过程将在`TrainForwardBackward`中定义。  
MindSpore采用继承`nn.Cell`的方法，并将整体的计算过程在`construct`中实现。

In [5]:
from mindspore.nn import Cell

class TrainForwardBackward(Cell):
    def __init__(self, network, optimizer, grad_sum, sens=1.0):
        super(TrainForwardBackward, self).__init__(auto_prefix=False)
        self.network = network
        self.network.set_grad()
        self.network.add_flags(defer_inline=True)
        self.weights = ParameterTuple(network.trainable_params())
        self.optimizer = optimizer
        self.grad_sum = grad_sum
        self.grad = ops.GradOperation(get_by_list=True, sens_param=True)
        self.sens = sens
        self.hyper_map = ops.HyperMap()

    def construct(self, *inputs):
        weights = self.weights
        loss = self.network(*inputs)
        sens = ops.Fill()(ops.DType()(loss), ops.Shape()(loss), self.sens)
        grads = self.grad(self.network, weights)(*inputs, sens)
        return ops.depend(loss, self.hyper_map(ops.partial(_sum_op), self.grad_sum, grads))

`weights`：即网络中的权重参数。  
`loss`：当前网络参数载入训练数据后的损失值。  
`sens`：创建一个与loss相同类型和张量，将数值1填充其中。  
`grads`：计算出本次`mini_batch`的梯度值。  
`ops.depend`：使用前面的`loss`方法将loss值计算出来。

此方法定义了模型训练过程中前向传播和方向传播的具体过程，并且可以保存出所有权重的参数，计算出当前模型的权重参数下的loss值。

### 定义权重更新方法

执行优化权重的方法，即将`grad_sum`更新到权重参数中。

In [6]:
class TrainOptim(Cell):
    def __init__(self, optimizer, grad_sum):
        super(TrainOptim, self).__init__(auto_prefix=False)
        self.optimizer = optimizer
        self.grad_sum = grad_sum

    def construct(self):
        return self.optimizer(self.grad_sum)

### 定义清除累积梯度的方法

当累积的梯度`grad_sum`更新到权重中后，调用本函数将`grad_sum`值清零，再开始下一次梯度累积。

In [7]:
class TrainClear(Cell):
    def __init__(self, grad_sum, zeros):
        super(TrainClear, self).__init__(auto_prefix=False)
        self.grad_sum = grad_sum
        self.zeros = zeros
        self.hyper_map = ops.HyperMap()

    def construct(self):
        seccess = self.hyper_map(ops.partial(_clear_op), self.grad_sum, self.zeros)
        return seccess

### 定义模型训练执行器

在`GradientAccumulation`定义前向和反向以及梯度累积的执行过程。

In [8]:
import os
import mindspore.nn as nn
from mindspore import ParameterTuple, context, DatasetHelper
from mindspore import save_checkpoint


class GradientAccumulation:
    def __init__(self, network, loss_fn, optimizer):
        self._network = network
        self._loss_fn = loss_fn
        self._optimizer = optimizer

        params = self._optimizer.parameters
        self._grad_sum = params.clone(prefix="grad_sum", init='zeros')
        self._zeros = params.clone(prefix="zeros", init='zeros')
        self._train_forward_backward = self._build_train_forward_backward_network()
        self._train_optim = self._build_train_optim()
        self._train_clear = self._build_train_clear()

    def _build_train_forward_backward_network(self):
        """Build forward and backward network"""
        network = self._network
        network = nn.WithLossCell(network, self._loss_fn)
        loss_scale = 1.0
        network = TrainForwardBackward(network, self._optimizer, self._grad_sum, loss_scale).set_train()
        return network

    def _build_train_optim(self):
        """Build optimizer network"""
        network = TrainOptim(self._optimizer, self._grad_sum).set_train()
        return network

    def _build_train_clear(self):
        """Build clear network"""
        network = TrainClear(self._grad_sum, self._zeros).set_train()
        return network

    def train_process(self, epoch, train_dataset, mini_steps=None):
        """
        Training process. The data would be passed to network directly.
        """
        dataset_helper = DatasetHelper(train_dataset, dataset_sink_mode=False, epoch_num=epoch)

        for i in range(epoch):
            step = 0
            for k, next_element in enumerate(dataset_helper):
                loss = self._train_forward_backward(*next_element)
                if (k + 1) % mini_steps == 0:
                    step += 1
                    print("epoch:", i + 1, "step:", step, "loss is ", loss)
                    self._train_optim()
                    self._train_clear()

            train_dataset.reset()

        save_checkpoint(self._train_forward_backward, "gradient_accumulation.ckpt")

`train_process`：构建训练执行过程，并将梯度累积的方法在其中实现，即每`mini_steps`个`batch`数据训练完成后更新一次权重参数。

### 执行训练

执行训练过程，类似快速入门案例，将损失函数`SoftmaxCrossEntropyWithLogits`，优化器函数`Momentum`和深度网络`LeNet5`传入，自定义模型训练函数`GradientAccumolation`，并调用`train_process`方法，使用数据进行训练。

In [9]:
if __name__ == "__main__":
    context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
    ds_train_path = "./datasets/MNIST_Data/train/"
    ds_train = create_dataset(ds_train_path, 32)

    net = LeNet5(10)
    net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
    net_opt = nn.Momentum(net.trainable_params(), 0.01, 0.9)
    model = GradientAccumulation(net, net_loss, net_opt)

    print("============== Starting Training ==============")
    model.train_process(3, ds_train, mini_steps=4)

epoch: 1 step: 1 loss is  2.302572
epoch: 1 step: 2 loss is  2.3027077
epoch: 1 step: 3 loss is  2.3026032
epoch: 1 step: 4 loss is  2.3029802
epoch: 1 step: 5 loss is  2.3009882
epoch: 1 step: 6 loss is  2.3028584
epoch: 1 step: 7 loss is  2.2963173
epoch: 1 step: 8 loss is  2.301377
epoch: 1 step: 9 loss is  2.3019261
... ...
epoch: 1 step: 461 loss is  2.2829156
epoch: 1 step: 462 loss is  2.2586172
epoch: 1 step: 463 loss is  2.2446578
epoch: 1 step: 464 loss is  2.1804438
epoch: 1 step: 465 loss is  2.1868634
epoch: 1 step: 466 loss is  2.118839
epoch: 1 step: 467 loss is  2.1144428
epoch: 1 step: 468 loss is  1.94902
epoch: 2 step: 1 loss is  1.9981135
epoch: 2 step: 2 loss is  2.0984964
epoch: 2 step: 3 loss is  2.0167308
epoch: 2 step: 4 loss is  2.0224195
epoch: 2 step: 5 loss is  2.0156221
epoch: 2 step: 6 loss is  1.9364308
epoch: 2 step: 7 loss is  1.8101931
... ...
epoch: 2 step: 459 loss is  0.12907082
epoch: 2 step: 460 loss is  0.15356739
epoch: 2 step: 461 loss is  0.3

本例中采用了累积梯度为`mini_steps=4`，即每训练4个batch的数据，进行一次权重参数的更新。最后在目录中保存了模型的权重参数文件`gradient_accumulate.ckpt`。

## 验证累积梯度训练出的模型精度

载入累积梯度训练结束后保存的模型参数`gradient_accumulation.ckpt`文件到神经网络LeNet5中，同时将其与损失函数（net_loss），优化器（net_opt）放入MindSpore的模型函数Model中，重新结合成完整计算图，输入验证数据集进行验证。

In [10]:
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore import Model
from mindspore.nn import Accuracy


ds_eval_path = "./datasets/MNIST_Data/test/"
ds_eval_data = create_dataset(ds_eval_path,32)

param_dict = load_checkpoint("gradient_accumulation.ckpt")
load_param_into_net(net, param_dict)
model = Model(net, net_loss, net_opt, metrics={"Accuracy": Accuracy()})

acc = model.eval(ds_eval_data, dataset_sink_mode=False)
print(acc)

{'Accuracy': 0.96875}


经过验证，使用累积梯度训练方法生成的模型精度大于0.95，此方法训练效果可行。