optimizer.zero_grad()意思是把梯度置零,也就是把loss关于weight的导数变成0.

在学习pytorch的时候注意到,对于每个batch大都执行了这样的操作:

        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

对于这些操作我是把它理解成一种梯度下降法,贴一个自己之前手写的简单梯度下降法作为对照:

    # gradient descent
    weights = [0] * n
    alpha = 0.0001
    max_Iter = 50000
    for i in range(max_Iter):
        loss = 0
        d_weights = [0] * n
        for k in range(m):
            h = dot(input[k], weights)
            d_weights = [d_weights[j] + (label[k] - h) * input[k][j] for j in range(n)] 
            loss += (label[k] - h) * (label[k] - h) / 2
        d_weights = [d_weights[k]/m for k in range(n)]
        weights = [weights[k] + alpha * d_weights[k] for k in range(n)]
        if i%10000 == 0:
            print "Iteration %d loss: %f"%(i, loss/m)
            print weights

可以发现它们实际上是一一对应的:

optimizer.zero_grad()对应d_weights = [0] * n

即将梯度初始化为零(因为一个batch的loss关于weight的导数是所有sample的loss关于weight的导数的累加和)

outputs = net(inputs)对应h = dot(input[k], weights)

即前向传播求出预测的值

loss = criterion(outputs, labels)对应loss += (label[k] - h) * (label[k] - h) / 2

这一步很明显,就是求loss(其实我觉得这一步不用也可以,反向传播时用不到loss值,只是为了让我们知道当前的loss是多少)
loss.backward()对应d_weights = [d_weights[j] + (label[k] - h) * input[k][j] for j in range(n)]

即反向传播求梯度
optimizer.step()对应weights = [weights[k] + alpha * d_weights[k] for k in range(n)]

即更新所有参数

如有不对,敬请指出。欢迎交流

optimizer.zero_grad()意思是把梯度置零,也就是把loss关于weight的导数变成0.在学习pytorch的时候注意到,对于每个batch大都执行了这样的操作: # zero the parameter gradients optimizer.zero_grad() # forward + backward + optim...
使用 py torch 训练模型时,经常可以在迭代的过程中看到 optimizer . zero_grad (),loss.backward()和 optimizer .step()三行 代码 依次出现,比如: model = MyModel() criterion = nn.CrossEntropyLoss() optimizer = torch .optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4) for epoch in r
转载自知乎:Py Torch 中在反向传播前为什么要手动将梯度清零? - Pascal的回答 - 知乎 传统的训练函数,一个batch是这么训练的: for i,(images,target) in enumerate(train_loader): # 1. input output images = images.cuda(non_blocking=True) target ...
根据py torch 中backward()函数的计算,当网络参量进行反馈时,梯度是累积计算而不是被替换,但在处理每一个batch时并不需要与其他batch的梯度混合起来累积计算,因此需要对每个batch调用一遍 zero_grad ()将参数梯度置0. 另外,如果不是处理每个batch清除一次梯度,而是两次或多次再清除一次,相当于提高了batch_size,对硬件要求更高,更适用于需要更高batch_size的情况。 optimizer . zero_grad ()
博客:noahsnail.com  |  CSDN  |  简书 1. 引言 在Py Torch 中,对模型参数的梯度置0时通常 使用 两种方式:model. zero_grad ()和 optimizer . zero_grad ()。二者在训练 代码 都很常见,那么二者的区别在哪里呢? 2. model. zero_grad () model. zero_grad ()的作用是将所有模型参数的梯度置为0。其源码如下: for p in self.parameters():
在用py torch 训练模型时,通常会在遍历epochs的过程中依次用到 optimizer . zero_grad (),loss.backward、和 optimizer .step()、lr_scheduler.step()四个函数, 使用 如下所示: train_loader=DataLoader( train_dataset, batch_size=2, shuffle=True model=myModel() criterion=nn.CrossEntropyLoss()
optimizer . zero_grad () 功能 梯度初始化为零,把loss关于weight的导数变成0 为什么每一轮batch都需要设置 optimizer . zero_grad 根据py torch 中的backward()函数的计算,当网络参量进行反馈时,梯度是被积累的而不是被替换掉。 但是在每一个batch时毫无疑问并不需要将两个batch的梯度混合起来累积,因此这里就需要每个batch设置一遍 zero_grad 了 每个batch必定执行的操作步骤 optimizer . zero_grad () # 梯度初始
大多数的 代码 都来自https://morvanzhou.github.io/tutorials/machine-learning/ torch 只是自己入门py torch 随便记得东西,然后随手扔上来(markdown都不用了,懒懒懒 tensor: import torch data = [[1,2], [3,4]] tensor = torch .FloatTensor(data)
如何在下列 代码 中减小 Adam 优化器的学习率(lr),以防止步长过大;以及在模型中增加 Batch Normalization 层,以确保模型更稳定地收敛;class MLP( torch .nn.Module): def init(self, weight_decay=0.01): super(MLP, self).init() self.fc1 = torch .nn.Linear(178, 100) self.relu = torch .nn.ReLU() self.fc2 = torch .nn.Linear(100, 50) self.fc3 = torch .nn.Linear(50, 5) self.dropout = torch .nn.Dropout(p=0.1) self.weight_decay = weight_decay def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.fc2(x) x = self.relu(x) x = self.fc3(x) return x def regularization_loss(self): reg_loss = torch .tensor(0.).to(device) for name, param in self.named_parameters(): if 'weight' in name: reg_loss += self.weight_decay * torch .norm(param) return reg_lossmodel = MLP() criterion = torch .nn.CrossEntropyLoss() optimizer = torch .optim.Adam(model.parameters(), lr=0.001) for epoch in range(num_epochs): for i, (inputs, labels) in enumerate(train_loader): optimizer . zero_grad () outputs = model(inputs.to(device)) loss = criterion(outputs, labels.to(device)) loss += model.regularization_loss() loss.backward() optimizer .step()
要减小Adam 优化器的学习率(lr),可以通过设置 optimizer 的参数lr来实现: optimizer = torch .optim.Adam(model.parameters(), lr=0.0001)。要在模型中增加 Batch Normalization 层以确保模型更稳定地收敛,可以在每个线性层( torch .nn.Linear)之后添加BatchNorm1d层( torch .nn.BatchNorm1d): class MLP( torch .nn.Module): def __init__(self, weight_decay=0.01): super(MLP, self).__init__() self.fc1 = torch .nn.Linear(178, 100) self.bn1 = torch .nn.BatchNorm1d(100) self.relu = torch .nn.ReLU() self.fc2 = torch .nn.Linear(100, 50) self.bn2 = torch .nn.BatchNorm1d(50) self.fc3 = torch .nn.Linear(50, 5) self.dropout = torch .nn.Dropout(p=0.1) self.weight_decay = weight_decay def forward(self, x): x = self.fc1(x) x = self.bn1(x) x = self.relu(x) x = self.fc2(x) x = self.bn2(x) x = self.relu(x) x = self.fc3(x) return x def regularization_loss(self): reg_loss = torch .tensor(0.).to(device) for name, param in self.named_parameters(): if 'weight' in name: reg_loss += self.weight_decay * torch .norm(param) return reg_loss model = MLP() criterion = torch .nn.CrossEntropyLoss() optimizer = torch .optim.Adam(model.parameters(), lr=0.0001) for epoch in range(num_epochs): for i, (inputs, labels) in enumerate(train_loader): optimizer . zero_grad () outputs = model(inputs.to(device)) loss = criterion(outputs, labels.to(device)) loss += model.regularization_loss() loss.backward() optimizer .step()