Spam Detection Using Hugging Face

In this post, we will perform spam detection using a pre-trained transformer from the Hugging Face library.

Installing Libraries

We need to install the datasets and the transformers library

!pip install -qq transformers[sentencepiece] datasets

Data

We will use a slightly modified version of the spam dataset that has already been pre-processed. This file can be found here.

Dataset

The dataset library can be used to create train/test dataset. This will be used as input to the model if we are using the Trainer API by HuggingFace. Note that we can also build PyTorch dataset/dataloader if we are using our own training pipeline.

We will load the csv using the load_dataset function

raw_dataset = load_dataset('csv', data_files='spam2.csv', column_names=['data', 'labels'], skiprows=1)

This creates a dataset dictionary with the (default) key train.

Train/Test Split

The train_test_split method can be used to split the raw dataset into a train/test split.

dataset = raw_dataset['train'].train_test_split(test_size=0.2)

The number of samples can be seen as

len(dataset['train']), len(dataset['test'])

which will return as 4457 and 1115 respectively.

Transformers

We will use the distilbert-base-uncased checkpoint for our task.

checkpoint = 'distilbert-base-uncased'

Tokenizer and Tokenization

We will use the AutoTokenizer module from the transformers library to create the tokenizer from the checkpoint

tokenizer = from transformers import AutoTokenizer.from_pretrained(checkpoint)

Now, we need to tokenize our datasets. Note that we will tokenize the datasets, not the dataset dictionary!

dset_train_tok = dataset['train'].map(lambda x: tokenizer(x['data'], truncation=True, padding=True), batched=True)
dset_test_tok = dataset['test'].map(lambda x: tokenizer(x['data'], truncation=True, padding=True), batched=True)

We need to tell the library to treat our tokens in a PyTorch-compatible format

dset_train_tok.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
dset_test_tok.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Model

We will load the AutoModelForSequenceClassification module since we intend to perform a classification task.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Trainer

Finally, we have the Trainer API along with its training arguments.

training_args = TrainingArguments(
    'test-trainer',                          # output directory where information is stored!
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 5,
    learning_rate=2e-5,
    weight_decay = 0.01
    )

trainer = Trainer(
    model,
    training_args,
    train_dataset = dset_train_tok,
    eval_dataset = dset_test_tok,
    tokenizer = tokenizer
)

We can train our model using

trainer.train()

Evaluation

We can make predictions on our test dataset as follows Note that the model always outputs logits; we have to use argmax to find the prediction.

predictions = trainer.predict(dset_test_tok)
preds = np.argmax(predictions.predictions, axis=-1)

In the end, we can plot a confusion matrix or calculate accuracy/precision etc. using the predictions!

The complete notebook can be found here