Fine-tuning Multimodal Embedding Models | by Shaw Talebi

The first (and most important) step of any fine-tuning process is data collection. Here, I extracted title-thumbnail pairs from my channel in a 2-step process.

First, I used YouTube’s search API to extract the video IDs for all the videos on my channel. Second, I used YouTube’s video API to extract the title and thumbnail URL of each of my long-form videos (i.e. longer than 3 min).

# imports
from top_secret import my_key
import requests
from isodate import parse_durationimport pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from datasets import DatasetDict, Dataset

channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID
page_token = None # initialize page token
url = 'https://www.googleapis.com/youtube/v3/search' # YouTube search API # extract video data across multiple search result pages
video_id_list = []
while page_token != 0:
params = {
"key": my_key, 
'channelId': channel_id, 
'part': ["snippet","id"], 
'order': "date", 
'maxResults':50, 
'pageToken': page_token
}
response = requests.get(url, params=params)
for raw_item in dict(response.json())['items']:



# only execute for youtube videos
if raw_item['id']['kind'] != "youtube#video":
continue
# grab video ids
video_id_list.append(raw_item['id']['videoId'])



try:
# grab next page token
page_token = dict(response.json())['nextPageToken']
except:
# if no next page token kill while loop
page_token = 0

Note that you will need a YouTube API key to run the above Python code, which you can create using the Google Cloud Console. To adapt this to your channel, you just need to change the channel_id variable.

# extract video titles and thumbnails
url = "https://www.googleapis.com/youtube/v3/videos"
video_data_list = []for video_id in video_id_list:
params = {
"part": ["snippet","contentDetails"],
"id": video_id,  
"key": my_key,  
}
response = requests.get(url, params=params)
raw_dict = dict(response.json())['items'][0]
# only process videos longer than 3 minutes
iso_duration = raw_dict['contentDetails']["duration"]
if parse_duration(iso_duration).total_seconds() < 180:
continue
# extract video data
video_data = {}
video_data['video_id'] = video_id
video_data['title'] = raw_dict['snippet']['title']
video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']
# append data to list
video_data_list.append(video_data)

As an additional step, I created negative thumbnail-title pairs. We can use these during the training process to not only guide the model with examples of which embedding should be close together (i.e. positive pair), but also which embedding should be far apart (i.e. negative pairs).

To do this, I computed the similarity between all possible title pairs using the sentence transformer library. Then for each positive pair, I matched the least similar title as a negative example (ensuring there were no duplicates).

# store data in dataframe
df = pd.DataFrame(video_data_list)# Load the model
model = SentenceTransformer("all-mpnet-base-v2")
# Encode all titles
embeddings = model.encode(df['title'].to_list())
# compute similarities
similarities = model.similarity(embeddings, embeddings)
# match least JDs least similar to positive match as the negative match
similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
negative_pair_index_list = []
for i in range(len(similarities)):
# Start with the smallest similarity index for the current row
j = 0
index = int(similarities_argsorted[i][j])
# Ensure the index is unique
while index in negative_pair_index_list:
j += 1  # Move to the next smallest index
index = int(similarities_argsorted[i][j])  # Fetch next smallest index
negative_pair_index_list.append(index)
# add negative pairs to df
df['title_neg'] = df['title'].iloc[negative_pair_index_list].values

Finally, I created a train-valid-test split and pushed the dataset to the Hugging Face Hub.

# Shuffle the dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)# Split into train, validation, and test sets
train_frac = 0.7
valid_frac = 0.15
test_frac = 0.15
# define train and validation size
train_size = int(train_frac * len(df))
valid_size = int(valid_frac * len(df))
# create train, validation, and test datasets
df_train = df[:train_size]
df_valid = df[train_size:train_size + valid_size]
df_test = df[train_size + valid_size:]
# Convert the pandas DataFrames back to Hugging Face Datasets
train_ds = Dataset.from_pandas(df_train)
valid_ds = Dataset.from_pandas(df_valid)
test_ds = Dataset.from_pandas(df_test)
# Combine into a DatasetDict
dataset_dict = DatasetDict({
'train': train_ds,
'valid': valid_ds,
'test': test_ds
})

# push data to hub
dataset_dict.push_to_hub("shawhin/yt-title-thumbnail-pairs")

Although we have all the data we need for fine-tuning, it is still not a suitable format for training. More specifically, we need to convert our image URLs to PIL image objects and organize our data into (anchor, positive, negative) triplets, i.e., a thumbnail, its corresponding title, and negative title, respectively.

We can process all three data splits (i.e. train, valid, and test) in the following way using the Hugging Face Datasets library.

from PIL import Image# load dataset
dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")
# define preprocessing function
def preprocess(batch):
"""
Preprocessing data without augmentations for test set
"""
# get images from urls
image_list = [Image.open(requests.get(url, stream=True).raw) 
for url in batch["thumbnail_url"]]
# return columns with standard names
return {
"anchor": image_list,       
"positive": batch["title"],  
"negative": batch["title_neg"]
}
# remove columns not relevant to training
columns_to_remove = [col for col in dataset['train'].column_names 
if col not in ['anchor', 'positive', 'negative']]
# apply transformations
dataset = dataset.map(preprocess, batched=True, 
remove_columns=columns_to_remove)

It’s important that we order our columns as (anchor, positive, negative) triplets because this is the format expected by the loss function we will use during training (which I learned the hard way).

Training involves optimizing a model’s parameters to minimize a loss function. However, this value (i.e. a contrastive loss) is rarely helpful in assessing the model’s performance on a downstream task (e.g. matching titles to thumbnails).

A quantity that is more insightful, in this case, is the model’s ability to correctly match a given thumbnail to the correct title among several candidates. This is denoted Recall@1.

We can implement an evaluator compatible with the Sentence Transformers library to compute this metric. Since the code is quite long, I won’t paste it here, but the curious reader can find it in Cell 12 of this notebook.

# function to create new evaluator given data split
def create_recall_evaluator(set_name, k=1):
"""
Create triplet evaluator for "train", "valid", or "test" split
"""return ImageTextRetrievalEvaluator(
images=dataset[f"{set_name}"]["anchor"],
texts=dataset[f"{set_name}"]["positive"],
name=f"yt-title-thumbnail-{set_name}",
k=k
)
# Create new evaluator with Recall@k
evaluator_recall_train = create_recall_evaluator("train", k=1)
evaluator_recall_valid = create_recall_evaluator("valid", k=1)
print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}

We can see the model already has decent performance out-of-the-box, with correct titles being matched 66% of the time.

There are 3 key things we must do before training the model. Namely, choose which parameters to train, pick a loss function, and set hyperparameters.

Trainable Parameters

The key limitation of this project is that I’ve only posted 76 YouTube videos (as of writing this). With the validation and test splits, this leaves only 53 examples for training.

Since we have so few training examples, limiting the number of parameters we train is a good idea. In this case, I only train the final projection layer of the model, which maps the text and image embeddings into a shared vector space. This is about 1M parameters total.

# import model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/clip-ViT-L-14")# pick specific layers to train (note: you can add more layers to this list)
trainable_layers_list = ['projection']
# Apply freezing configuration
for name, param in model.named_parameters():
# freeze all params
param.requires_grad = False
# unfreeze layers in trainable_layers_list
if any(layer in name for layer in trainable_layers_list):
param.requires_grad = True

# Count total and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"% of trainable parameters: {100*trainable_params/total_params:.2f}%")
# >> Total parameters: 427,616,513
# >> Trainable parameters: 1,376,256
# >> % of trainable parameters: 0.32%

Loss function

Here, I use the Multiple Negatives Ranking Loss from the Sentence Transformers library (which works with single negatives like in this case). It works by maximizing the similarity between positive pairs while minimizing the similarity between negative pairs. Here’s what the loss function looks like for the single negative case [2].

from sentence_transformers.losses import MultipleNegativesRankingLoss# define loss
loss = MultipleNegativesRankingLoss(model)

Hyperparameters

For hyperparameters, I experimented with a handful of choices manually and picked the choice with the best validation loss and Recall@1 performance. Here are the final choices.

from sentence_transformers import SentenceTransformerTrainingArguments# hyperparameters
num_epochs = 2
batch_size = 16
lr = 1e-4
finetuned_model_name = "clip-title-thumbnail-embeddings"
train_args = SentenceTransformerTrainingArguments(
output_dir=f"models/{finetuned_model_name}",
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=lr,
# Evaluation settings
eval_strategy="epoch",
eval_steps=1,
logging_steps=1,
)

With our loss and hyperparameters defined, we can train the model using the SentenceTransformersTrainer().

from sentence_transformers import SentenceTransformerTrainertrainer = SentenceTransformerTrainer(
model=model,
args=train_args,
train_dataset=dataset["train"],
eval_dataset=dataset["valid"],
loss=loss,
evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
trainer.train()

Model training is an iterative process where you may explore dozens of models for different choices of trainable parameters, loss functions, and hyperparameters.

However, I highly recommend keeping these experiments as simple as possible. If you find yourself spending too much time tweaking training args to get your model to converge, there’s probably something fundamentally wrong with your data (speaking from experience 😅).

As a final step, we can evaluate the model’s Recall@1 score on the testing set. These data were not used for training or hyperparameter tuning, so it gives us an unbiased assessment of the model.

evaluator_recall_test = create_recall_evaluator("test")print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
print("Test:", evaluator_recall_test(model))
# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}
# >> Test: {'yt-title-thumbnail-test_Recall@1': 0.75}

We see that the model performs well across all three datasets with 75% Recall@1 on the test set. In other words, 75% of the time, the model correctly matches a given thumbnail to its original title. Additionally, the recall for the validation dataset increases by 27%!

Multimodal embedding models, like CLIP, unlock countless 0-shot use cases such as image classification and retrieval. Here, we saw how we can fine-tune such a model to adapt it to a specialized domain (i.e. my YouTube titles and thumbnails).

Although CLIP is a small model by today’s standards (~500M parameters) and our training dataset was tiny, the final model still demonstrated strong performance on this task. This highlights the power of fine-tuning.

If you have any questions or suggestions for future content, let me know in the comments 🙂

More on Multimodal AI 👇

Fine-tuning Multimodal Embedding Models | by Shaw Talebi

Trainable Parameters

Loss function

Hyperparameters

Multimodal AI

Related Updates

18 soldiers martyred, 23 terrorists killed in Balochistan clearance operations: ISPR – Pakistan

What Travis Kelce, Kansas City Chiefs Tight End, Eats

Two Israeli hostages handed to Red Cross in Khan Younis

Wild meet Senators in battle of road savvy vs. strength at home

Recent Updates

18 soldiers martyred, 23 terrorists killed in Balochistan clearance operations: ISPR – Pakistan

What Travis Kelce, Kansas City Chiefs Tight End, Eats

Two Israeli hostages handed to Red Cross in Khan Younis

Wild meet Senators in battle of road savvy vs. strength at home