How to fine-tune the Transformer architecture NLP model-Visual Studio Magazine

2021-12-06 10:22:34 By : Ms. Anne Lu

The goal is sentiment analysis-accept the text of movie reviews (for example, "This movie was a waste of my time.") and output category 0 (negative reviews) or category 1 (positive reviews).

This article describes how to fine-tune the pre-trained Transformer architecture model for natural language processing. More specifically, this article explains how to fine-tune a condensed version of the pre-trained BERT model to create a binary classifier for a subset of the IMDB movie review dataset. The goal is sentiment analysis-accept the text of movie reviews (for example, "This movie was a waste of my time.") and output category 0 (negative reviews) or category 1 (positive reviews).

You can think of the pre-trained Transformer Architecture (TA) model as some kind of English language expert. But TA experts know nothing about movies, so you provide additional training to fine-tune the model to understand the difference between positive movie reviews and negative reviews.

There are several pre-trained TA models for natural language processing (NLP). The two most famous are BERT (a two-way encoder representation from a converter) and GPT (generative pre-trained converter). The TA model is very large, with millions of weights and bias parameters.

The TA model has completely changed NLP, but TA systems are extremely complex, and implementing them from scratch may require hundreds or even thousands of man-hours. Hugging Face (HF) is an open source code base that provides pre-trained models and a set of APIs for models. The HF library greatly reduces the difficulty of using the TA model to implement an NLP system (see "How to Create a Transformer Architecture Model for Natural Language Processing").

A good way to understand the content of this article is to view the screenshot of the demo program in Figure 1. The demo program first loads a subset of 200 items of the IMDB movie review data set into the memory. The complete data set has 50,000 movie reviews-25,000 reviews for training and 25,000 reviews for testing, of which 12,500 are positive and 12,500 are negative. Using the complete data set is very time consuming, so the demo data only uses the first 100 positive training comments and the first 100 negative training comments.

Movie reviews are in original text form. The comments are read into memory and converted to a data structure containing integer tags. For example, the tag ID of the word "movie" = 3185. The tokenized movie review data structure is fed to the PyTorch dataset object, which is used to send batches of tokenized comments and their associated tags to the training code.

After preparing the movie review data, the demonstration loads the pre-trained DistilBERT model into memory. DistilBERT is a condensed ("distilled") but still large version of the huge BERT model. The uncased version of DistilBERT has 66 million weights and biases. Then demonstrate fine-tuning the pre-trained model by training the model using standard PyTorch technology. At the end of the demonstration, save the fine-tuned model to a file.

This article assumes that you have intermediate or better familiarity with the C series programming languages ​​(preferably Python) and a basic familiarity with PyTorch, but it does not assume that you have no knowledge of the Hugging Face code base. The complete source code of the demo program is provided in this article, and the code can also be found in the attached file download.

To run the demo program, Python, PyTorch and HF must be installed on your machine. The demo program was developed on Windows 10, using Anaconda 2020.02 64-bit distribution (including Python 3.7.6) and PyTorch version 1.8.0 for the CPU installed via pip and HF converter version 4.11.3. Installation is not easy. You can find detailed step-by-step installation instructions for PyTorch in my blog post. Installing the HF transformer library is relatively simple. You can issue the shell command "pip installtransformers".

The overall program structure demo program structure is:

HF library can use PyTorch or TensorFlow library. This demo uses PyTorch. IMDbDataset is a program-defined PyTorch class used to save training data and provide it in batches. The read_imdb() function is a helper program that reads movie review text data from a file into memory. All program logic is in a main() function.

The complete demo code has some minor edits to save space, as shown in Listing 1. I prefer to use two spaces instead of the standard four spaces for indentation. The backslash character is used to continue lines to break up long sentences.

Listing 1: The complete fine-tuning demo program

Obtaining IMDB training data Online IMDB movie review data is stored in compressed form as aclImdb_v1.tar.gz, and must be decompressed and extracted using utilities such as WinZip or 7-Zip on Windows systems. Both utilities are good, but I prefer 7-Zip.

The unzipped file will be located in the root folder named aclImdb ("ACL IMDB"). The root folder contains subdirectories named test and train. The test and train directories contain subdirectories named pos and neg. Each of these two directories contains 12,500 text files, each of which is a movie review.

The file names look like 0_9.txt and 113_3.txt, where the first part of the name before the underscore is a 0-based index, and the second part of the name is the actual numerical rating of the review. The ratings of 7, 8, 9, 10 are positive comments (all in the "pos" directory), and the ratings of 1, 2, 3, and 4 are negative comments. The IMDB data set does not include movie reviews that have scores of 5 and 6 (neither strongly positive nor strongly negative). Please note that the actual 1-10 rating is not used in the demo.

In order to reduce the IMDB data set to a size that can be used for experiment management, I only used the training file and deleted all comments except the first 100 positive and the first 100 negative, leaving a total of 200 training comments.

The read_imdb() function defined by the program reads the comment text and tags into the memory. It is implemented as:

The Python pathlib library is relatively new (from Python 3.4) and more robust than the old os library (which still works). The result of the read_imdb() function is a Python tuple, where the first item is a list of Python comma-separated comments, and the second item is a list of related class tags. 0 means negative comments and 1 means positive comments.

The loading movie review demo program starts execution from the following statement:

Suppressing warnings is not a good idea, but I did this to keep the output of the screenshot in Figure 1 clean. Setting the torch and NumPy random number seed is not necessary, but it is usually a good idea to try to make the program run reproducibly.

The movie review text and tags are loaded into memory as shown below:

Segmentation of comment text This demo creates a tokenizer object, and then uses the following two sentences to segment the comment text:

Generally, each HF model has its own associated tokenizer to decompose the source sequence text into tokens. This is different from earlier language systems, which usually use general-purpose tokenizers such as spaCy. Therefore, the demo loads the distilbert-base-uncased tokenizer. The return result of calling the tokenizer on IMDB comments is a data structure that has two components: an input_ids field, which contains the integer ID corresponding to the word in the comment text, and an attention_mask field, which contains 0 and 1, indicating Which tags are active and the tags to be ignored (usually fill tags).

The tokenized text and attention mask, and the list of class tags are provided to the IMDbDataset constructor:

The returned result is a PyTorch Dataset object, which can provide training items in batches. The underlying item is a collection of dictionaries with keys [input_ids], [attention_mask], and [label].

Windows Forms wants you to know that reports of its death have been greatly exaggerated.

Uno Platform 4.0 has been released, highlighting new extensions that work in Microsoft's Visual Studio Code editor.

Microsoft announced several new features that are now generally available in the autumn update of Azure Functions for event-driven serverless computing in the cloud.

Microsoft's Teams Toolkit 3.0 introduced a number of new features, including better multi-developer collaboration and improved multi-environment management.

problem? problem? Give back? Email us.