Generating articles using AI has never been easier, thanks to the power of pre-trained models like GPT-2 and mBART. In this post, we'll walk you through the step-by-step process of how to use these models to generate an article on a given topic, and then translate it into any language of your choice. MBART (Multilingual denoising autoencoder for language understanding) is a pre-trained transformer-based model developed by Facebook AI. It is trained on a diverse set of monolingual data from many different languages and fine-tuned on many-to-many machine translation tasks. This allows it to perform well on tasks that involve translating between multiple languages.
  • The pip install transformers command in step 1 installs the Hugging Face's transformers library, which is a library that allows you to easily use pre-trained transformer models such as GPT-2 and mBART in your code.
  • In step 2, we import PyTorch and GPT-2 by running import torch and from transformers import GPT2LMHeadModel, GPT2Tokenizer . PyTorch is an open-source machine learning library that is used to train and run the GPT-2 model. The GPT2LMHeadModel and GPT2Tokenizer classes from the transformers library are used to load the GPT-2 model and tokenizer respectively.
  • In step 3, we load the GPT-2 tokenizer and model by running tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") and model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id) . The tokenizer is used to encode the input text and the model is used to generate the output text. We are using the "gpt2-large" version of the model, which is a large version of the model that has been trained on a larger dataset than the smaller versions.
  • In step 4, we set the topic for the article by running topic = 'Benefits of Sleeping Early' . This sets the topic on which the GPT-2 model will generate an article.
  • In step 5, we encode the input topic by running input_ids = tokenizer.encode(topic, return_tensors='pt') . This converts the text of the topic into a numerical format that the GPT-2 model can understand. The return_tensors='pt' argument specifies that the input should be returned as a PyTorch tensor, which is the format that the GPT-2 model requires.
  • In step 6, we generate the article by running output = model.generate(input_ids, max_length=200, num_beams=30, no_repeat_ngram_size=4, early_stopping=True) . The model.generate() function generates text based on the input provided. The max_length argument specifies the maximum number of words that the generated article can contain. The num_beams argument specifies the number of different combinations of words that can be chained together. The no_repeat_ngram_size argument specifies the number of words that can be combined together and repeated. The early_stopping argument is set to True to stop the generation when the model is confident that it has generated a complete and coherent article.
  • In step 7, we print the generated article by running print(tokenizer.decode(output[0], skip_special_tokens=True)) . The tokenizer.decode() function converts the numerical output of the model back into text. The skip_special_tokens argument is set to True to remove any special tokens that the model may have added to the output.
  • tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large" model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)) from transformers import MBartForConditionalGeneration, MBart50TokenizerFast model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt") tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")