NLP question answering model using transformers and IMDB dataset

Using huggingface’s dataset, tokenizer and transformer’s library
Author

Eric Vincent

Published

November 2, 2024

!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# import checkpoint and model

from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
# access GPU
model.to(device)
device

# import and create tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# import the dataset 

from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset
Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
# check out a sample of the dataset

sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the moon" It reminds me of Sinatra's song High Hopes, it is fun and inspirational. The Music is great throughout and my favorite song is sung by the King, Hank (bing Crosby) and Sir "Saggy" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.'
'>>> Label: 1'

'>>> Review: George P. Cosmatos' "Rambo: First Blood Part II" is pure wish-fulfillment. The United States clearly didn't win the war in Vietnam. They caused damage to this country beyond the imaginable and this movie continues the fairy story of the oh-so innocent soldiers. The only bad guys were the leaders of the nation, who made this war happen. The character of Rambo is perfect to notice this. He is extremely patriotic, bemoans that US-Americans didn't appreciate and celebrate the achievements of the single soldier, but has nothing but distrust for leading officers and politicians. Like every film that defends the war (e.g. "We Were Soldiers") also this one avoids the need to give a comprehensible reason for the engagement in South Asia. And for that matter also the reason for every single US-American soldier that was there. Instead, Rambo gets to take revenge for the wounds of a whole nation. It would have been better to work on how to deal with the memories, rather than suppressing them. "Do we get to win this time?" Yes, you do.'
'>>> Label: 0'
# explore dataset with pandas

imdb_dataset.set_format("pandas")
df_train_imdb = imdb_dataset["train"][:]
df_train_imdb[df_train_imdb['label'] == 0]
text label
0 I rented I AM CURIOUS-YELLOW from my video sto... 0
1 "I Am Curious: Yellow" is a risible and preten... 0
2 If only to avoid making this type of film in t... 0
3 This film was probably inspired by Godard's Ma... 0
4 Oh, brother...after hearing about this ridicul... 0
... ... ...
12495 There just isn't enough here. There a few funn... 0
12496 Tainted look at kibbutz life<br /><br />This f... 0
12497 I saw this movie, just now, not when it was re... 0
12498 Any film which begins with a cowhand shagging ... 0
12499 Yes AWA wrestling how can anyone forget about ... 0

12500 rows × 2 columns

frequencies = (
    df_train_imdb["label"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "label", "label": "frequency"})
)
frequencies.head()
label frequency
0 0 12500
1 1 12500
imdb_dataset.reset_format()
imdb_dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
# tokenize

def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# batched=True to activate fast multithreading
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets
Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})
tokenizer.model_max_length
512
# smaller fit for GPU 
chunk_size = 128
# sampling a review

tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")
'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")
'>>> Concatenated reviews length: 800'
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})
tokenizer.decode(lm_datasets["train"][1]["input_ids"])
"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"
tokenizer.decode(lm_datasets["train"][1]["labels"])
"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

'>>> [CLS] i rented [MASK] [MASK] curious - yellow from my [MASK] store because of all the controversy that [MASK] it whennya was first released in 1967. i also heard that at first [MASK] [MASK] seized by u. s. customs if it ever tried to enter this [MASK], therefore being a fan of films [MASK] " [MASK] " i really had to see this for myself [MASK] < br / > < br / > the plot is [MASK] around [MASK] young swedish [MASK] student [MASK] lena who wants to learn everything she can about [MASK]. [MASK] embarrassed she wants info focus her attentions to making some [MASK] of documentary on what the average swede thought about [MASK] [MASK] issues [MASK]'

'>>> as the vietnam war [MASK] race issues transplant the [MASK] states. in between asking politicians マ ordinary denizens of stockholm about their opinions on [MASK], [MASK] has sex with her drama teacher, [MASK], [MASK] married men. < br / > < br / > what kills me about i am curious consecrated yellow [MASK] that 40 het ago [MASK] this was considered pornographic. really, explanation sex and nu [MASK] scenes are few and far between, even then it's not shot [MASK] some cheaply made porn [MASK]. while my countrymen [MASK] find it shocking, in [MASK] sex [MASK] nudity are a major staple in swedish cinema. even ingmar bergman,'
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

'>>> [CLS] i rented i am curious - yellow [MASK] my video store because [MASK] all the controversy that surrounded it when it was first released in 1967 [MASK] i also heard that at first it was seized by u. s. customs if it ever tried to enter [MASK] [MASK], therefore being [MASK] [MASK] of films considered " controversial [MASK] i really [MASK] to see this [MASK] myself. < [MASK] [MASK] > < br / > the plot is centered around a young swedish drama student [MASK] [MASK] [MASK] [MASK] to learn everything she can about life [MASK] in particular she [MASK] to focus her attentions to making [MASK] sort of [MASK] on what [MASK] average [MASK] [MASK] thought about certain political issues such'

'>>> as the vietnam [MASK] and race [MASK] in [MASK] [MASK] states. in between asking politicians and ordinary [MASK] [MASK] [MASK] of stockholm [MASK] their opinions on politics, she has sex with her drama teacher, [MASK], and married men. [MASK] [MASK] / > < br / > what kills [MASK] about i am curious [MASK] yellow is that [MASK] years ago, this was considered [MASK]. really, the sex and [MASK] [MASK] scenes are few [MASK] far between, even then it'[MASK] not shot [MASK] some cheaply [MASK] porno [MASK] while my countrymen mind [MASK] it shocking, in [MASK] sex [MASK] nudity are a [MASK] [MASK] in swedish cinema. [MASK] ingmar [MASK],'
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})
from huggingface_hub import notebook_login

notebook_login()
Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
Cloning https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb into local empty directory.
Using cuda_amp half precision backend
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
/content/distilbert-base-uncased-finetuned-imdb is already a clone of https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/content/distilbert-base-uncased-finetuned-imdb is already a clone of https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb. Make sure you pull the latest changes with `repo.git_pull()`.
Using cuda_amp half precision backend
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64
[16/16 00:02]
>>> Perplexity: 21.94
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 10000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 471
  Number of trainable parameters = 66985530
[471/471 02:42, Epoch 3/3]
Epoch Training Loss Validation Loss
1 2.708600 2.489761
2 2.579600 2.422981
3 2.526900 2.435405

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64
[16/16 01:19]
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=471, training_loss=2.604787395258618, metrics={'train_runtime': 163.2332, 'train_samples_per_second': 183.786, 'train_steps_per_second': 2.885, 'total_flos': 994208670720000.0, 'train_loss': 2.604787395258618, 'epoch': 3.0})
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64
[16/16 00:01]
>>> Perplexity: 11.85
trainer.push_to_hub()
# fine tune with accelerate

def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)
# new & clean version of pretrained model 
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForMaskedLM.

All the weights of DistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForMaskedLM for predictions without further training.
# use AdamW optimizer

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)
# specify learning rate scheduler

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name
'evincent18/distilbert-base-uncased-finetuned-imdb-accelerate'
model_name
'distilbert-base-uncased-finetuned-imdb-accelerate'
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)
Cloning https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb-accelerate into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/evincent18/distilbert-base-uncased-finetuned-imdb-accelerate into local empty directory.
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )
Configuration saved in distilbert-base-uncased-finetuned-imdb-accelerate/config.json
>>> Epoch 0: Perplexity: 11.623045597165843
Model weights saved in distilbert-base-uncased-finetuned-imdb-accelerate/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-imdb-accelerate/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-imdb-accelerate/special_tokens_map.json
Configuration saved in distilbert-base-uncased-finetuned-imdb-accelerate/config.json
>>> Epoch 1: Perplexity: 11.121389323527449
Model weights saved in distilbert-base-uncased-finetuned-imdb-accelerate/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-imdb-accelerate/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-imdb-accelerate/special_tokens_map.json
Configuration saved in distilbert-base-uncased-finetuned-imdb-accelerate/config.json
>>> Epoch 2: Perplexity: 10.923441783481074
Model weights saved in distilbert-base-uncased-finetuned-imdb-accelerate/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-imdb-accelerate/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-imdb-accelerate/special_tokens_map.json
# download the model created

from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)
text = "This is a great [MASK]."

preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")
>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great deal.
>>> this is a great adventure.