optimizer.zero_grad()意思是把梯度置零,也就是把loss关于weight的导数变成0.
在学习pytorch的时候注意到,对于每个batch大都执行了这样的操作:
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
对于这些操作我是把它理解成一种梯度下降法,贴一个自己之前手写的简单梯度下降法作为对照:
# gradient descent
weights = [0] * n
alpha = 0.0001
max_Iter = 50000
for i in range(max_Iter):
loss = 0
d_weights = [0] * n
for k in range(m):
h = dot(input[k], weights)
d_weights = [d_weights[j] + (label[k] - h) * input[k][j] for j in range(n)]
loss += (label[k] - h) * (label[k] - h) / 2
d_weights = [d_weights[k]/m for k in range(n)]
weights = [weights[k] + alpha * d_weights[k] for k in range(n)]
if i%10000 == 0:
print "Iteration %d loss: %f"%(i, loss/m)
print weights
可以发现它们实际上是一一对应的:
optimizer.zero_grad()对应d_weights = [0] * n
即将梯度初始化为零(因为一个batch的loss关于weight的导数是所有sample的loss关于weight的导数的累加和)
outputs = net(inputs)对应h = dot(input[k], weights)
即前向传播求出预测的值
loss = criterion(outputs, labels)对应loss += (label[k] - h) * (label[k] - h) / 2
这一步很明显,就是求loss(其实我觉得这一步不用也可以,反向传播时用不到loss值,只是为了让我们知道当前的loss是多少)
loss.backward()对应d_weights = [d_weights[j] + (label[k] - h) * input[k][j] for j in range(n)]
即反向传播求梯度
optimizer.step()对应weights = [weights[k] + alpha * d_weights[k] for k in range(n)]
即更新所有参数
如有不对,敬请指出。欢迎交流
optimizer.zero_grad()意思是把梯度置零,也就是把loss关于weight的导数变成0.在学习pytorch的时候注意到,对于每个batch大都执行了这样的操作: # zero the parameter gradients optimizer.zero_grad() # forward + backward + optim...
在
使用
py
torch
训练模型时,经常可以在迭代的过程中看到
optimizer
.
zero_grad
(),loss.backward()和
optimizer
.step()三行
代码
依次出现,比如:
model = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer
=
torch
.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
for epoch in r
转载自知乎:Py
Torch
中在反向传播前为什么要手动将梯度清零? - Pascal的回答 - 知乎
传统的训练函数,一个batch是这么训练的:
for i,(images,target) in enumerate(train_loader):
# 1. input output
images = images.cuda(non_blocking=True)
target ...
根据py
torch
中backward()函数的计算,当网络参量进行反馈时,梯度是累积计算而不是被替换,但在处理每一个batch时并不需要与其他batch的梯度混合起来累积计算,因此需要对每个batch调用一遍
zero_grad
()将参数梯度置0.
另外,如果不是处理每个batch清除一次梯度,而是两次或多次再清除一次,相当于提高了batch_size,对硬件要求更高,更适用于需要更高batch_size的情况。
optimizer
.
zero_grad
()
博客:noahsnail.com | CSDN | 简书
1. 引言
在Py
Torch
中,对模型参数的梯度置0时通常
使用
两种方式:model.
zero_grad
()和
optimizer
.
zero_grad
()。二者在训练
代码
都很常见,那么二者的区别在哪里呢?
2. model.
zero_grad
()
model.
zero_grad
()的作用是将所有模型参数的梯度置为0。其源码如下:
for p in self.parameters():
在用py
torch
训练模型时,通常会在遍历epochs的过程中依次用到
optimizer
.
zero_grad
(),loss.backward、和
optimizer
.step()、lr_scheduler.step()四个函数,
使用
如下所示:
train_loader=DataLoader(
train_dataset,
batch_size=2,
shuffle=True
model=myModel()
criterion=nn.CrossEntropyLoss()
optimizer
.
zero_grad
() 功能
梯度初始化为零,把loss关于weight的导数变成0
为什么每一轮batch都需要设置
optimizer
.
zero_grad
根据py
torch
中的backward()函数的计算,当网络参量进行反馈时,梯度是被积累的而不是被替换掉。
但是在每一个batch时毫无疑问并不需要将两个batch的梯度混合起来累积,因此这里就需要每个batch设置一遍
zero_grad
了
每个batch必定执行的操作步骤
optimizer
.
zero_grad
() # 梯度初始
大多数的
代码
都来自https://morvanzhou.github.io/tutorials/machine-learning/
torch
只是自己入门py
torch
随便记得东西,然后随手扔上来(markdown都不用了,懒懒懒
tensor:
import
torch
data = [[1,2], [3,4]]
tensor =
torch
.FloatTensor(data)
如何在下列
代码
中减小 Adam 优化器的学习率(lr),以防止步长过大;以及在模型中增加 Batch Normalization 层,以确保模型更稳定地收敛;class MLP(
torch
.nn.Module): def init(self, weight_decay=0.01): super(MLP, self).init() self.fc1 =
torch
.nn.Linear(178, 100) self.relu =
torch
.nn.ReLU() self.fc2 =
torch
.nn.Linear(100, 50) self.fc3 =
torch
.nn.Linear(50, 5) self.dropout =
torch
.nn.Dropout(p=0.1) self.weight_decay = weight_decay def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.fc2(x) x = self.relu(x) x = self.fc3(x) return x def regularization_loss(self): reg_loss =
torch
.tensor(0.).to(device) for name, param in self.named_parameters(): if 'weight' in name: reg_loss += self.weight_decay *
torch
.norm(param) return reg_lossmodel = MLP() criterion =
torch
.nn.CrossEntropyLoss()
optimizer
=
torch
.optim.Adam(model.parameters(), lr=0.001) for epoch in range(num_epochs): for i, (inputs, labels) in enumerate(train_loader):
optimizer
.
zero_grad
() outputs = model(inputs.to(device)) loss = criterion(outputs, labels.to(device)) loss += model.regularization_loss() loss.backward()
optimizer
.step()
要减小Adam 优化器的学习率(lr),可以通过设置
optimizer
的参数lr来实现:
optimizer
=
torch
.optim.Adam(model.parameters(), lr=0.0001)。要在模型中增加 Batch Normalization 层以确保模型更稳定地收敛,可以在每个线性层(
torch
.nn.Linear)之后添加BatchNorm1d层(
torch
.nn.BatchNorm1d): class MLP(
torch
.nn.Module): def __init__(self, weight_decay=0.01): super(MLP, self).__init__() self.fc1 =
torch
.nn.Linear(178, 100) self.bn1 =
torch
.nn.BatchNorm1d(100) self.relu =
torch
.nn.ReLU() self.fc2 =
torch
.nn.Linear(100, 50) self.bn2 =
torch
.nn.BatchNorm1d(50) self.fc3 =
torch
.nn.Linear(50, 5) self.dropout =
torch
.nn.Dropout(p=0.1) self.weight_decay = weight_decay def forward(self, x): x = self.fc1(x) x = self.bn1(x) x = self.relu(x) x = self.fc2(x) x = self.bn2(x) x = self.relu(x) x = self.fc3(x) return x def regularization_loss(self): reg_loss =
torch
.tensor(0.).to(device) for name, param in self.named_parameters(): if 'weight' in name: reg_loss += self.weight_decay *
torch
.norm(param) return reg_loss model = MLP() criterion =
torch
.nn.CrossEntropyLoss()
optimizer
=
torch
.optim.Adam(model.parameters(), lr=0.0001) for epoch in range(num_epochs): for i, (inputs, labels) in enumerate(train_loader):
optimizer
.
zero_grad
() outputs = model(inputs.to(device)) loss = criterion(outputs, labels.to(device)) loss += model.regularization_loss() loss.backward()
optimizer
.step()