High quality-tuning the BERT mannequin with social media information
Buying and getting ready information
The dataset we’ll use is taken from Kaggle and will be downloaded right here. https://www.kaggle.com/datasets/farisdurrani/sentimentsearch (CC BY 4.0 License) For my experiments, I solely chosen the Fb and Twitter datasets.
The next snippet takes a csv file and saves the three splits (practice, validation, take a look at) anyplace you want, we suggest saving it in Google Cloud Storage.
You may run the script with the next command:
python make_splits --output-dir gs://your-bucket/
import pandas as pd
import argparse
import numpy as np
from sklearn.model_selection import train_test_splitdef make_splits(output_dir):
df=pd.concat([
pd.read_csv("data/farisdurrani/twitter_filtered.csv"),
pd.read_csv("data/farisdurrani/facebook_filtered.csv")
])
df = df.dropna(subset=['sentiment'], axis=0)
df['Target'] = df['sentiment'].apply(lambda x: 1 if x==0 else np.signal(x)+1).astype(int)
df_train, df_ = train_test_split(df, stratify=df['Target'], test_size=0.2)
df_eval, df_test = train_test_split(df_, stratify=df_['Target'], test_size=0.5)
print(f"Information shall be saved in {output_dir}")
df_train.to_csv(output_dir + "/practice.csv", index=False)
df_eval.to_csv(output_dir + "/eval.csv", index=False)
df_test.to_csv(output_dir + "/take a look at.csv", index=False)
print(f"Prepare : ({df_train.form}) samples")
print(f"Val : ({df_eval.form}) samples")
print(f"Check : ({df_test.form}) samples")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-dir')
args, _ = parser.parse_known_args()
make_splits(args.output_dir)
The information appears roughly like this:
Use a small BERT pretrained mannequin
In our mannequin, we use BERT-Tiny, a light-weight BERT mannequin that’s already pre-trained on an enormous quantity of knowledge, however not essentially skilled on social media information, neither is it supposed to do sentiment evaluation, so we fine-tune it.
It accommodates solely two layers of 128 items, and you may see the total checklist of fashions right here. here If you want a bigger measurement.
First principal.py
A file containing all of the required modules:
import pandas as pd
import argparse
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as textual content
import logging
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "UNCOMPRESSED"def train_and_evaluate(**params):
cross
# shall be up to date as we go
Write down your necessities in a devoted notepad necessities.txt
transformers==4.40.1
torch==2.2.2
pandas==2.0.3
scikit-learn==1.3.2
gcsfs
We load two components to coach the mannequin.
- of Tokenizersplits the textual content enter into tokens that BERT was skilled on.
- of mannequin itself.
Each can be found from Huggingface hereYou can too obtain it to your cloud storage, which is what I did, so load it like this:
# Load pretrained tokenizers and bert mannequin
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')
Then add the next to the file:
class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.remaining = nn.Linear(in_features=128, out_features=3, bias=True) # Uncomment the under when you solely wish to retrain sure layers.
# self.bert_module.requires_grad_(False)
# for param in self.bert_module.encoder.parameters():
# param.requires_grad = True
def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
# print(ids.measurement(), masks.measurement(), token_type_ids.measurement())
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.remaining(x)
return out
Let’s take a brief break right here: If you wish to reuse an present mannequin, you’ve got just a few choices:
- Switch Studying : You repair the weights of the mannequin and use it as a “characteristic extractor”, so you’ll be able to add extra layers downstream. That is typically utilized in pc imaginative and prescient, the place you’ll be able to reuse fashions like VGG, Xception, and so on. to coach customized fashions on small datasets.
- Tweak : Unpack all or a part of the mannequin weights and retrain the mannequin on a customized dataset. That is the advisable method when coaching a customized LLM.
Study extra about switch studying and fine-tuning here:
For this mannequin, we selected to freeze your entire mannequin, however be at liberty to freeze a number of layers of the pre-trained BERT module to see the way it impacts efficiency.
The important thing right here is so as to add a completely linked layer after the BERT module to “hyperlink” it to the classification process, i.e. the ultimate layer shall be three items, which permits us to reuse the pre-trained BERT weights and adapt the mannequin to our process.
Making a Information Loader
To create a knowledge loader, we’d like the Tokenizer we loaded above. The Tokenizer takes a string as enter and returns some output containing tokens (on this case “input_ids”).
The BERT tokenizer is a bit particular and returns just a few outputs, a very powerful of that are input_ids
: A token used to encode a sentence. It may be a phrase or a part of a phrase. For instance, the phrase “trying” consists of two tokens: “look” and “##ing”.
Now let’s create a knowledge loader module to course of the dataset.
class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_lengthdef __len__(self):
return len(self.df)
def __getitem__(self, idx):
X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]
inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]
x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)
return x, y
Creating the principle script to coach the mannequin
First, let’s outline two capabilities that deal with the coaching and analysis steps.
def practice(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.practice()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.measurement(0)
if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()
if max_steps just isn't None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.measurement(0)
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
We’re getting nearer to getting our principal script up and operating, so it is time to begin wiring the items collectively. Now we have the next:
- a
BertDataset
A category that handles loading information - a
SentimentBERT
A mannequin that takes the Tiny-BERT mannequin and provides layers for customized use instances practice()
andeval()
The operate that handles these steps- a
train_and_eval()
The characteristic that brings all of it collectively
We’d use argparse
Permits you to launch a script with arguments, usually coaching/analysis/take a look at recordsdata to run the mannequin on a given dataset, the trail the place the mannequin shall be saved, and coaching associated parameters.
import pandas as pd
import time
import torch.nn as nn
import torch
import logging
import numpy as np
import argparsefrom torch.utils.information import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
logging.basicConfig(format='%(asctime)s [%(levelname)s]: %(message)s', degree=logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)
# --- CONSTANTS ---
BERT_MODEL_NAME = 'small_bert/bert_en_uncased_L-2_H-128_A-2'
if torch.cuda.is_available():
logging.data(f"GPU: {torch.cuda.get_device_name(0)} is accessible.")
DEVICE = torch.machine('cuda')
else:
logging.data("No GPU out there. Coaching will run on CPU.")
DEVICE = torch.machine('cpu')
# --- Information preparation and tokenization ---
class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]
inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]
x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)
return x, y
# --- Mannequin definition ---
class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.remaining = nn.Linear(in_features=128, out_features=3, bias=True)
def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.remaining(x)
return out
# --- Coaching loop ---
def practice(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.practice()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()
for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.measurement(0)
if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()
if max_steps just isn't None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
# --- Validation loop ---
def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.measurement(0)
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
# --- Essential operate ---
def train_and_evaluate(**params):
logging.data("operating with the next params :")
logging.data(params)
# Load pretrained tokenizers and bert mannequin
# replace the paths to whichever you're utilizing
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')
# Coaching parameters
epochs = int(params.get('epochs'))
batch_size = int(params.get('batch_size'))
learning_rate = float(params.get('learning_rate'))
# Load the info
df_train = pd.read_csv(params.get('training_file'))
df_eval = pd.read_csv(params.get('validation_file'))
df_test = pd.read_csv(params.get('testing_file'))
# Create dataloaders
train_ds = BertDataset(df_train, tokenizer, max_length=100)
train_loader = DataLoader(dataset=train_ds,batch_size=batch_size, shuffle=True)
eval_ds = BertDataset(df_eval, tokenizer, max_length=100)
eval_loader = DataLoader(dataset=eval_ds,batch_size=batch_size)
test_ds = BertDataset(df_test, tokenizer, max_length=100)
test_loader = DataLoader(dataset=test_ds,batch_size=batch_size)
# Create the mannequin
classifier = SentimentBERT(bert_model=mannequin).to(DEVICE)
total_parameters = sum([np.prod(p.size()) for p in classifier.parameters()])
model_parameters = filter(lambda p: p.requires_grad, classifier.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
logging.data(f"Whole params : {total_parameters} - Trainable : {params} ({params/total_parameters*100}% of whole)")
# Optimizer and loss capabilities
optimizer = torch.optim.Adam([p for p in classifier.parameters() if p.requires_grad], learning_rate)
loss_fn = nn.CrossEntropyLoss()
# If dry run we solely
logging.data(f'Coaching mannequin with {BERT_MODEL_NAME}')
if args.dry_run:
logging.data("Dry run mode")
epochs = 1
steps_per_epoch = 1
else:
steps_per_epoch = None
# Motion !
for epoch in vary(1, epochs + 1):
epoch_start_time = time.time()
train_metrics = practice(epoch, classifier, train_loader, loss_fn=loss_fn, optimizer=optimizer, max_steps=steps_per_epoch)
eval_metrics = consider(classifier, eval_loader, loss_fn=loss_fn)
print("-" * 59)
print(
"Finish of epoch {:3d} - time: {:5.2f}s - loss: {:.4f} - accuracy: {:.4f} - valid_loss: {:.4f} - legitimate accuracy {:.4f} ".format(
epoch, time.time() - epoch_start_time, train_metrics['loss'], train_metrics['acc'], eval_metrics['loss'], eval_metrics['acc']
)
)
print("-" * 59)
if args.dry_run:
# If dry run, we don't run the analysis
return None
test_metrics = consider(classifier, test_loader, loss_fn=loss_fn)
metrics = {
'practice': train_metrics,
'val': eval_metrics,
'take a look at': test_metrics,
}
logging.data(metrics)
# save mannequin and structure to single file
if params.get('job_dir') is None:
logging.warning("No job dir offered, mannequin is not going to be saved")
else:
logging.data("Saving mannequin to {} ".format(params.get('job_dir')))
torch.save(classifier.state_dict(), params.get('job_dir'))
logging.data("Bye bye")
if __name__ == '__main__':
# Create arguments right here
parser = argparse.ArgumentParser()
parser.add_argument('--training-file', required=True, kind=str)
parser.add_argument('--validation-file', required=True, kind=str)
parser.add_argument('--testing-file', kind=str)
parser.add_argument('--job-dir', kind=str)
parser.add_argument('--epochs', kind=float, default=2)
parser.add_argument('--batch-size', kind=float, default=1024)
parser.add_argument('--learning-rate', kind=float, default=0.01)
parser.add_argument('--dry-run', motion="store_true")
# Parse them
args, _ = parser.parse_known_args()
# Execute coaching
train_and_evaluate(**vars(args))
That is nice, however sadly, this mannequin takes a very long time to coach. In reality, with about 4.7 million parameters to coach, it takes about 3 seconds per step on a 16Gb Macbook Professional with an Intel chip.
With 1238 steps remaining and 10 epochs to finish, 3 seconds per step is a really very long time…
No GPU, no occasion.