Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Ask Question
I'm new to pytorch, i've been trying to implement a text summarization network. When i call
loss.backward()
an error appears.
RuntimeError
: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 1, 1, 400]], which is output 0 of UnsqueezeBackward0, is at version 98; expected version 97 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
it's a seq2seq model, and i think the problem comes from this code snippet
final_dists=torch.zeros((batch_size,dec_max_len,extended_vsize)) #to hold the model outputs with extended vocab
attn_dists=torch.zeros((batch_size,dec_max_len,enc_max_len)) #to retain the attention weights over decoder steps
coverages=torch.zeros((batch_size,dec_max_len,enc_max_len)) #the coverages are retained to compute coverage loss
inp=self.emb_dropout(self.embedding(dec_batch[:,0])) #starting input: <SOS> shape [batch_size]
#self.prev_coverage is the accumulated coverage
coverage=None #initially none, but accumulates
with torch.autograd.set_detect_anomaly(True):
for i in range(1,dec_max_len):
#NOTE: the outputs, atten_dists, p_gens assignments start from i=1 (DON'T FORGET!)
vocab_dists,hidden,attn_dists_tmp,p_gen,coverage=self.decoder(inp,hidden,enc_outputs,enc_lens,coverage)
attn_dists[:,i,:]=attn_dists_tmp.squeeze(1)
coverages[:,i,:]=coverage.squeeze(1)
#vocab_dists: [batch_size, 1, dec_vocab_size] Note: this is the normalized probability
#hidden: [1,batch_size, dec_hid_dim]
#attn_dists_tmp: [batch_size, 1, enc_max_len]
#p_gen: [batch_size, 1]
#coverage: [batch_size, 1, enc_max_len]
#===================================================================
#To compute the final dist in pointer-generator network by extending vocabulary
vocab_dists_p=p_gen.unsqueeze(-1)*vocab_dists #[batch_size,1,dec_vocab_size] note we want to maintain vocab_dists for teacher_forcing_ratio
attn_dists_tmp=(1-p_gen).unsqueeze(-1)*attn_dists_tmp #[batch_size, 1, enc_max_len] note we want to maintain attn_dists for later use
extra_zeros=torch.zeros((batch_size,1,max_art_oovs)).to(self.device)
vocab_dists_extended=torch.cat((vocab_dists_p,extra_zeros),dim=2) #[batch_size, 1, extended_vsize]
attn_dists_projected=torch.zeros((batch_size,1,extended_vsize)).to(self.device)
indices=enc_batch_extend_vocab.clone().unsqueeze(1) #[batch_size, 1,enc_max_size]
attn_dists_projected=attn_dists_projected.scatter(2,indices,attn_dists_tmp)
#We need this otherwise we would modify a leaf Variable inplace
#attn_dists_projected_clone=attn_dists_projected.clone()
#attn_dists_projected_clone.scatter_(2,indices,attn_dists_tmp) #this will project the attention weights
#attn_dists_projected.scatter_(2,indices,attn_dists_tmp)
final_dists[:,i,:]=vocab_dists_extended.squeeze(1)+attn_dists_projected.squeeze(1)
#===================================================================
#teacher forcing, whether or not should use pred or dec sequence label
if random.random()<teacher_forcing_ratio:
inp=self.emb_dropout(self.embedding(dec_batch[:,i]))
else:
inp=self.emb_dropout(self.embedding(vocab_dists.squeeze(1).argmax(1)))
if i remove the for loop, and just do one step of updating attn_dists[:,1,:] etc, with toy loss from the outputs returned by forward, then it works fine.
Anyone has any idea what is wrong here? There is no inplace operation here. Many thanks!
From looking at your code, the problem likely comes from the following lines:
attn_dists[:,i,:]=attn_dists_tmp.squeeze(1)
coverages[:,i,:]=coverage.squeeze(1)
you are performing an in place operation that conflicts with the graph created by pytorch for backprop. It should be solved by concatenating the new info at every loop (you may run out of memory very soon!)
attn_dists = torch.cat((attn_dists, attn_dists_tmp.squeeze(1)), dim=1)
coverages = torch.cat(coverages, coverage.squeeze(1)),dim=1)
You should, change their initialization as well, otherwise you will endup of a tensor that is twice the size you were accounting for.
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.