Generally, an object detection model is trained with a fixed vocabulary, meaning it can only recognize a predefined set of object categories. However, in our pipeline, since we can’t predict in advance which objects will appear in the image, we need an object detection model that is versatile and capable of recognizing a wide range of object classes. To achieve this, I use the OWL-ViT model [11], an open-vocabulary object detection model. This model requires text prompts that specifies the objects to be detected.
Another challenge that needs to be addressed is obtaining a high-level idea of the objects present in the image before utilizing the OWL-ViT model, as it requires a text prompt describing the objects. This is where VLMs come to the rescue! First, we pass the image to the VLM with a prompt to identify the high-level objects in the image. These detected objects are then used as text prompts, along with the image, for the OWL-ViT model to generate detections. Next, we plot the detections as bounding boxes on the same image and pass this updated image to the VLM, prompting it to generate a caption. The code for inference is partially adapted from [12].
# Load model directly
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetectionprocessor = AutoProcessor.from_pretrained("google/owlvit-base-patch32")
model = AutoModelForZeroShotObjectDetection.from_pretrained("google/owlvit-base-patch32")
I detect the objects present in each image using the VLM:
IMAGE_QUALITY = "high"
system_prompt_object_detection = """You are provided with an image. You must identify all important objects in the image, and provide a standardized list of objects in the image.
Return your output as follows:
Output: object_1, object_2"""user_prompt = "Extract the objects from the provided image:"
detected_objects = process_images_in_parallel(image_paths, system_prompt=system_prompt_object_detection, user_prompt=user_prompt, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)
detected_objects_cleaned = {}for key, value in detected_objects.items():
detected_objects_cleaned[key] = list(set([x.strip() for x in value.replace("Output: ", "").split(",")]))
The detected objects are now passed as text prompts to the OWL-ViT model to obtain the predictions for the images. I implement a helper function that predicts the bounding boxes for the images, and then plots the bounding box on the original image.
from PIL import Image, ImageDraw, ImageFont
import numpy as np
import torchdef detect_and_draw_bounding_boxes(
image_path,
text_queries,
model,
processor,
output_path,
score_threshold=0.2
):
"""
Detect objects in an image and draw bounding boxes over the original image using PIL.
Parameters:
- image_path (str): Path to the image file.
- text_queries (list of str): List of text queries to process.
- model: Pretrained model to use for detection.
- processor: Processor to preprocess image and text queries.
- output_path (str): Path to save the output image with bounding boxes.
- score_threshold (float): Threshold to filter out low-confidence predictions.
Returns:
- output_image_pil: A PIL Image object with bounding boxes and labels drawn.
"""
img = Image.open(image_path).convert("RGB")
orig_w, orig_h = img.size # original width, height
inputs = processor(
text=text_queries,
images=img,
return_tensors="pt",
padding=True,
truncation=True
).to("cpu")
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = torch.max(outputs["logits"][0], dim=-1) # shape (num_boxes,)
scores = torch.sigmoid(logits.values).cpu().numpy() # convert to probabilities
labels = logits.indices.cpu().numpy() # class indices
boxes_norm = outputs["pred_boxes"][0].cpu().numpy() # shape (num_boxes, 4)
converted_boxes = []
for box in boxes_norm:
cx, cy, w, h = box
cx_abs = cx * orig_w
cy_abs = cy * orig_h
w_abs = w * orig_w
h_abs = h * orig_h
x1 = cx_abs - w_abs / 2.0
y1 = cy_abs - h_abs / 2.0
x2 = cx_abs + w_abs / 2.0
y2 = cy_abs + h_abs / 2.0
converted_boxes.append((x1, y1, x2, y2))
draw = ImageDraw.Draw(img)
for score, (x1, y1, x2, y2), label_idx in zip(scores, converted_boxes, labels):
if score < score_threshold:
continue
draw.rectangle([x1, y1, x2, y2], outline="red", width=3)
label_text = text_queries[label_idx].replace("An image of ", "")
text_str = f"{label_text}: {score:.2f}"
text_size = draw.textsize(text_str) # If no font used, remove "font=font"
text_x, text_y = x1, max(0, y1 - text_size[1]) # place text slightly above box
draw.rectangle(
[text_x, text_y, text_x + text_size[0], text_y + text_size[1]],
fill="white"
)
draw.text((text_x, text_y), text_str, fill="red") # , font=font)
img.save(output_path, "JPEG")
return img
for key, value in tqdm(detected_objects_cleaned.items()):
value = ["An image of " + x for x in value]
detect_and_draw_bounding_boxes(key, value, model, processor, "images_with_bounding_boxes/" + key.split("/")[-1], score_threshold=0.15)
The images with the detected objects plotted are now passed to the VLM for captioning:
IMAGE_QUALITY = "high"
image_paths_obj_detected_guided = [x.replace("downloaded_images", "images_with_bounding_boxes") for x in image_paths] system_prompt="""You are a helpful assistant that can analyze images and provide captions. You are provided with images that also contain bounding box annotations of the important objects in them, along with their labels.
Analyze the overall image and the provided bounding box information and provide an appropriate caption for the image.""",
user_prompt="Please analyze the following image:",
obj_det_zero_shot_high_quality_captions = process_images_in_parallel(image_paths_obj_detected_guided, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)
In this task, given the simple nature of the images we use, the location of the objects does not add any significant information to the VLM. However, Object Detection Guided Prompting can be a powerful tool for more complex tasks, such as Document Understanding, where layout information can be effectively provided through object detection to the VLM for further processing. Additionally, Semantic Segmentation can be employed as a method to guide prompting by providing segmentation masks to the VLM.
VLMs are a powerful tool in the arsenal of AI engineers and scientists for solving a variety of problems that require a combination of vision and text skills. In this article, I explore prompting strategies in the context of VLMs to effectively use these models for tasks such as image captioning. This is by no means an exhaustive or comprehensive list of prompting strategies. One thing that has become increasingly clear with the advancements in GenAI is the limitless potential for creative and innovative approaches to prompt and guide LLMs and VLMs in solving tasks.
[1] J. Chen, H. Guo, K. Yi, B. Li and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 18009–18019, doi: 10.1109/CVPR52688.2022.01750.
[2] Luo, Z., Xi, Y., Zhang, R., & Ma, J. (2022). A Frustratingly Simple Approach for End-to-End Image Captioning.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ‘22). Curran Associates Inc., Red Hook, NY, USA, Article 1723, 23716–23736.
[4] https://huggingface.co/blog/vision_language_pretraining
[5] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
[6] https://platform.openai.com/docs/guides/vision
[7] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
[8]https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/traditional/#rouge-score
[9] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
[10] https://aws.amazon.com/blogs/machine-learning/foundational-vision-models-and-visual-prompt-engineering-for-autonomous-driving-applications/
[11] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. 2022. Simple Open-Vocabulary Object Detection. In Computer Vision — ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X. Springer-Verlag, Berlin, Heidelberg, 728–755. https://doi.org/10.1007/978-3-031-20080-9_42
[12]https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb
[13] https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/gpt-4-v-prompt-engineering