Related Articles
Sentiment Analysis Fine-Tuning with BERT
By Yangming Li
Introduction
In this article, we explore the process of fine-tuning a BERT model for sentiment analysis using the Peft library and Lora technique. We will discuss the practical steps and provide a detailed code example.
Import Libraries
First, we import the necessary libraries for handling data, models, and evaluation.
import argparse
import os
The argparse
and os
modules are used for handling command-line arguments and file operations.
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
These modules from PyTorch are used for model training, optimization, and data loading.
from peft import (get_peft_config, get_peft_model, get_peft_model_state_dict,
set_peft_model_state_dict, PeftType, PrefixTuningConfig, PromptEncoderConfig,
PromptTuningConfig, LoraConfig)
PEFT is used to implement parameter-efficient fine-tuning techniques such as Prefix Tuning, Prompt Tuning, and LoRA (Low-Rank Adaptation).
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer,
get_linear_schedule_with_warmup, set_seed
from tqdm import tqdm
The evaluate
library is used for metrics, datasets
for loading data, transformers
for handling BERT models, and tqdm
for progress bars.
Load and Prepare Dataset
#!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv
Use wget to download the dataset (commented out, assuming the file is downloaded separately).
data_file = "./ChnSentiCorp_htl_all.csv"
dataset = load_dataset("csv", data_files=data_file)
The dataset is loaded from a CSV file using the datasets
library.
dataset = dataset.filter(lambda x: x["review"] is not None)
datasets = dataset["train"].train_test_split(0.2, seed=123)
Filter out entries without reviews and split the data into training and test sets (80/20 split).
Tokenizer Setup
model_name_or_path = "/data/pretrained_models/bert/bert-base-uncased"
Specify the path to the pre-trained BERT model.
if any(k in model_name_or_path for k in ("gpt", "opt", "bloom")):
padding_side = "left"
else:
padding_side = "right"
Determine the padding side based on the model type.
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side=padding_side)
if getattr(tokenizer, "pad_token_id") is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
Load the tokenizer with the specified padding side and set the padding token ID if it is not already defined.
Tokenization and Dataset Preparation
def process_function(examples):
tokenized_examples = tokenizer(examples["review"], truncation=True, max_length=max_length)
tokenized_examples["labels"] = examples["label"]
return tokenized_examples
Define a function to tokenize the review text and map the labels.
tokenized_datasets = datasets.map(process_function, batched=True, remove_columns=datasets["train"].column_names)
Tokenize the dataset using the defined function and remove unnecessary columns.
Define Metrics
accuracy_metric = evaluate.load("accuracy")
Load the accuracy metric from the evaluate
library.
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = predictions.argmax(axis=-1)
return accuracy_metric.compute(predictions=predictions, references=labels)
Define a function to compute accuracy using predictions and reference labels.
Dataloader Preparation
def collate_fn(examples):
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
Define a collation function to pad the tokenized inputs for batching.
batch_size = 64
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
eval_dataloader = DataLoader(tokenized_datasets["test"], shuffle=False, collate_fn=collate_fn, batch_size=batch_size)
Create training and evaluation data loaders with the specified batch size.
PEFT Configuration and Model Setup
p_type = "lora"
Specify the PEFT type (LoRA in this case).
if p_type == "prefix-tuning":
peft_type = PeftType.PREFIX_TUNING
peft_config = PrefixTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=20)
elif p_type == "prompt-tuning":
peft_type = PeftType.PROMPT_TUNING
peft_config = PromptTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=20)
elif p_type == "p-tuning":
peft_type = PeftType.P_TUNING
peft_config = PromptEncoderConfig(task_type="SEQ_CLS", num_virtual_tokens=20, encoder_hidden_size=128)
elif p_type == "lora":
peft_type = PeftType.LORA
peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)
Configure the PEFT type and parameters based on the chosen p_type
.
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=2)
Load the pre-trained BERT model for sequence classification with two labels.
Model Setup and Parameter Printing
if p_type is not None:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
If PEFT is enabled, wrap the model with PEFT configuration and print trainable parameters.
else:
def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
num_params = param.numel()
if num_params == 0 and hasattr(param, "ds_numel"):
num_params = param.ds_numel
all_param += num_params
if param.requires_grad:
trainable_params += num_params
print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")
print_trainable_parameters(model)
Define a custom function to print the count of trainable parameters if PEFT is not used.
Optimizer and Scheduler Setup
lr = 3e-4
num_epochs = 3
optimizer = AdamW(params=model.parameters(), lr=lr)
Instantiate an AdamW
optimizer with the specified learning rate.
lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs), num_training_steps=(len(train_dataloader) * num_epochs))
Create a linear learning rate scheduler with warmup to adjust learning rates during training.
Training and Evaluation Loop
device = "cuda"
model.to(device)
metric = evaluate.load("accuracy")
Move the model to the GPU for faster computation and load the accuracy metric.
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(tqdm(train_dataloader)):
batch.to(device)
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
Perform forward and backward passes, optimizer steps, and scheduler updates for each epoch.
model.eval()
total_loss = 0.
for step, batch in enumerate(tqdm(eval_dataloader)):
batch.to(device)
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
total_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions, references = predictions, batch["labels"]
metric.add_batch(predictions=predictions, references=references)
Evaluate the model by calculating loss and accuracy for the test set after each epoch.
eval_metric = metric.compute()
print(f"epoch {epoch} loss {total_loss}:", eval_metric)
Compute and print the evaluation metric (accuracy) for each epoch.
Timing and Output
import time
start = time.time()
end = time.time()
print("耗时:{}分钟".format((end-start) / 60))
Calculate and print the time taken for the entire training and evaluation process.