Generating articles using AI has never been easier, thanks to the power of pre-trained models like GPT-2 and mBART. In this post, we'll walk you through the step-by-step process of how to use these models to generate an article on a given topic, and then translate it into any language of your choice.
MBART (Multilingual denoising autoencoder for language understanding) is a pre-trained transformer-based model developed by Facebook AI. It is trained on a diverse set of monolingual data from many different languages and fine-tuned on many-to-many machine translation tasks. This allows it to perform well on tasks that involve translating between multiple languages.
The
pip install transformers
command in step 1 installs the Hugging Face's transformers library, which is a library that allows you to easily use pre-trained transformer models such as GPT-2 and mBART in your code.
In step 2, we import PyTorch and GPT-2 by running
import torch
and
from transformers import GPT2LMHeadModel, GPT2Tokenizer
. PyTorch is an open-source machine learning library that is used to train and run the GPT-2 model. The
GPT2LMHeadModel
and
GPT2Tokenizer
classes from the transformers library are used to load the GPT-2 model and tokenizer respectively.
In step 3, we load the GPT-2 tokenizer and model by running
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
and
model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)
. The tokenizer is used to encode the input text and the model is used to generate the output text. We are using the "gpt2-large" version of the model, which is a large version of the model that has been trained on a larger dataset than the smaller versions.
In step 4, we set the topic for the article by running
topic = 'Benefits of Sleeping Early'
. This sets the topic on which the GPT-2 model will generate an article.
In step 5, we encode the input topic by running
input_ids = tokenizer.encode(topic, return_tensors='pt')
. This converts the text of the topic into a numerical format that the GPT-2 model can understand. The
return_tensors='pt'
argument specifies that the input should be returned as a PyTorch tensor, which is the format that the GPT-2 model requires.
In step 6, we generate the article by running
output = model.generate(input_ids, max_length=200, num_beams=30, no_repeat_ngram_size=4, early_stopping=True)
. The
model.generate()
function generates text based on the input provided. The
max_length
argument specifies the maximum number of words that the generated article can contain. The
num_beams
argument specifies the number of different combinations of words that can be chained together. The
no_repeat_ngram_size
argument specifies the number of words that can be combined together and repeated. The
early_stopping
argument is set to True to stop the generation when the model is confident that it has generated a complete and coherent article.
In step 7, we print the generated article by running
print(tokenizer.decode(output[0], skip_special_tokens=True))
. The
tokenizer.decode()
function converts the numerical output of the model back into text. The
skip_special_tokens
argument is set to True to remove any special tokens that the model may have added to the output.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large"
model = GPT2LMHeadModel.from_pretrained("gpt2-large",
pad_token_id=tokenizer.eos_token_id))
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")