Sentiment Analysis Fine-Tuning with BERT

By Yangming Li

Introduction

In this article, we explore the process of fine-tuning a BERT model for sentiment analysis using the Peft library and Lora technique. We will discuss the practical steps and provide a detailed code example.

Import Libraries

First, we import the necessary libraries for handling data, models, and evaluation.

import argparse
import os

The argparse and os modules are used for handling command-line arguments and file operations.

import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader

These modules from PyTorch are used for model training, optimization, and data loading.

from peft import (get_peft_config, get_peft_model, get_peft_model_state_dict,
    set_peft_model_state_dict, PeftType, PrefixTuningConfig, PromptEncoderConfig, 
    PromptTuningConfig, LoraConfig)

PEFT is used to implement parameter-efficient fine-tuning techniques such as Prefix Tuning, Prompt Tuning, and LoRA (Low-Rank Adaptation).

import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer,
    get_linear_schedule_with_warmup, set_seed
from tqdm import tqdm

The evaluate library is used for metrics, datasets for loading data, transformers for handling BERT models, and tqdm for progress bars.

Load and Prepare Dataset

#!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv

Use wget to download the dataset (commented out, assuming the file is downloaded separately).

data_file = "./ChnSentiCorp_htl_all.csv"
dataset = load_dataset("csv", data_files=data_file)

The dataset is loaded from a CSV file using the datasets library.

dataset = dataset.filter(lambda x: x["review"] is not None)
datasets = dataset["train"].train_test_split(0.2, seed=123)

Filter out entries without reviews and split the data into training and test sets (80/20 split).

Tokenizer Setup

model_name_or_path = "/data/pretrained_models/bert/bert-base-uncased"

Specify the path to the pre-trained BERT model.

if any(k in model_name_or_path for k in ("gpt", "opt", "bloom")):
    padding_side = "left"
else:
    padding_side = "right"

Determine the padding side based on the model type.

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side=padding_side)
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

Load the tokenizer with the specified padding side and set the padding token ID if it is not already defined.

Tokenization and Dataset Preparation

def process_function(examples):
    tokenized_examples = tokenizer(examples["review"], truncation=True, max_length=max_length)
    tokenized_examples["labels"] = examples["label"]
    return tokenized_examples

Define a function to tokenize the review text and map the labels.

tokenized_datasets = datasets.map(process_function, batched=True, remove_columns=datasets["train"].column_names)

Tokenize the dataset using the defined function and remove unnecessary columns.

Define Metrics

accuracy_metric = evaluate.load("accuracy")

Load the accuracy metric from the evaluate library.

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

Define a function to compute accuracy using predictions and reference labels.

Dataloader Preparation

def collate_fn(examples):
    return tokenizer.pad(examples, padding="longest", return_tensors="pt")

Define a collation function to pad the tokenized inputs for batching.

batch_size = 64
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
eval_dataloader = DataLoader(tokenized_datasets["test"], shuffle=False, collate_fn=collate_fn, batch_size=batch_size)

Create training and evaluation data loaders with the specified batch size.

PEFT Configuration and Model Setup

p_type = "lora"

Specify the PEFT type (LoRA in this case).

if p_type == "prefix-tuning":
    peft_type = PeftType.PREFIX_TUNING
    peft_config = PrefixTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=20)
elif p_type == "prompt-tuning":
    peft_type = PeftType.PROMPT_TUNING
    peft_config = PromptTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=20)
elif p_type == "p-tuning":
    peft_type = PeftType.P_TUNING
    peft_config = PromptEncoderConfig(task_type="SEQ_CLS", num_virtual_tokens=20, encoder_hidden_size=128)
elif p_type == "lora":
    peft_type = PeftType.LORA
    peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)

Configure the PEFT type and parameters based on the chosen p_type.

model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=2)

Load the pre-trained BERT model for sequence classification with two labels.

Model Setup and Parameter Printing

if p_type is not None:
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

If PEFT is enabled, wrap the model with PEFT configuration and print trainable parameters.

else:
    def print_trainable_parameters(model):
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            num_params = param.numel()
            if num_params == 0 and hasattr(param, "ds_numel"):
                num_params = param.ds_numel
            all_param += num_params
            if param.requires_grad:
                trainable_params += num_params
        print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")
    print_trainable_parameters(model)

Define a custom function to print the count of trainable parameters if PEFT is not used.

Optimizer and Scheduler Setup

lr = 3e-4
num_epochs = 3
optimizer = AdamW(params=model.parameters(), lr=lr)

Instantiate an AdamW optimizer with the specified learning rate.

lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs), num_training_steps=(len(train_dataloader) * num_epochs))

Create a linear learning rate scheduler with warmup to adjust learning rates during training.

Training and Evaluation Loop

device = "cuda"
model.to(device)
metric = evaluate.load("accuracy")

Move the model to the GPU for faster computation and load the accuracy metric.

for epoch in range(num_epochs):
    model.train()
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

Perform forward and backward passes, optimizer steps, and scheduler updates for each epoch.

    model.eval()
    total_loss = 0.
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss
        predictions = outputs.logits.argmax(dim=-1)
        predictions, references = predictions, batch["labels"]
        metric.add_batch(predictions=predictions, references=references)

Evaluate the model by calculating loss and accuracy for the test set after each epoch.

    eval_metric = metric.compute()
    print(f"epoch {epoch} loss {total_loss}:", eval_metric)

Compute and print the evaluation metric (accuracy) for each epoch.

Timing and Output

import time
start = time.time()
end = time.time()
print("耗时:{}分钟".format((end-start) / 60))

Calculate and print the time taken for the entire training and evaluation process.