Search Images with Text: Build a Multimodal AI Engine (Python Tutorial)

Xuebin Wei
Jan 22
3 min read

Updated: Jan 22

https://www.youtube.com/watch?v=Que_DJA23u0

Introduction: Building a Multimodal Search Engine in Python

Social media data is messy. It’s not just a clean spreadsheet of words; it’s a chaotic mix of captions, hashtags, screenshots, and photos. If you are only analyzing the text, you are ignoring half the story.

In this tutorial, we will build a Multimodal Search Engine in Python. Unlike traditional databases that look for matching keywords, this engine understands concepts. It allows you to search for the text "Pizza" and get back a photo of a pizza, or upload a photo of a dog to find tweets about "Golden Retrievers."

Prerequisites: This tutorial runs entirely in the cloud. If you are new to this environment, check out our guide on AI Coding in Google Colab with Gemini to get your T4 GPU runtime set up.

The Tech Stack: Python, MongoDB, and CLIP

To build this, we don't need a massive server farm. We rely on three key components:

Python: For the logic.
MongoDB Atlas: To store our data and perform vector search.
OpenAI CLIP: An open-source model that translates both text and images into the same "Vector Space."

First, we install the necessary libraries: sentence-transformers for our AI model and pymongo for database connectivity.

!pip install sentence-transformers pymongo pillow requests -q

Data Collection & The "Split Strategy"

For this demo, we generate synthetic social media posts combining text and image URLs. However, in a real-world scenario, you would likely be using data scraped from X (Twitter).

Need Data? Read our tutorial on How to Build a Cloud Python Data Pipeline to collect your own tweets.
Need a Database? Take our free course on Database & Data Collection to set up your free tier cluster.

Why One Tweet = Two Documents

We use a "Split Strategy." Since a single tweet contains both text and an image, we treat them as two separate vectors in our database. This allows us to match a user's query against either the text or the image independently.

Diagram of a process splitting a tweet into text and image documents via Python and CLIP, stored in MongoDB. Arrows connect each stage. — The "Split" Storage Strategy: Creating separate vectors for text and images.

Understanding Multimodal Embeddings

This is where the magic happens. We use the CLIP (Contrastive Language-Image Pre-Training) model. It converts our text and our images into lists of numbers (vectors) that share the same mathematical space.

If you've followed our previous guide on Enhanced Twitter Insights with Vector Databases, you know how text embeddings work. Today, we are adding the visual layer.

Illustration of a multimodal embedding model with a cat image and text input. Concepts like "Cat" and "Pizza" shown in a shared 3D vector space. — The Shared Vector Space: Text and Images mapped to the same coordinates.

Here is how we load the model and process the images:

from sentence_transformers import SentenceTransformer
from PIL import Image

# Load the CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# Allow processing of large images
Image.MAX_IMAGE_PIXELS = None

The "Double-Tap" Search Logic

To make the search truly "multimodal," we cannot just run a single query. Text vectors and image vectors often cluster in slightly different numeric ranges.

To fix this, we define a search function that queries our database twice:

Search A: Forces the database to find the best Text matches.
Search B: Forces the database to find the best Image matches.
Merge: We combine both sets to give the user a rich answer.

Flowchart of the "Double-Tap" Multimodal Search Strategy with steps: search text, search image, combine results, and final results. — The "Double-Tap" logic ensures we find relevant results across both media types.

def mixed_search(query, num_results=1):
    # Convert query (text or image) to vector
    query_vector = model.encode(query).tolist()
    
    # Perform vector search in MongoDB (Simplified Logic)
    # We run the search against the image index AND the text index
    # Then combine the results
    return list(results)

Testing the Results

Does it work? Let's look at the actual output from our Python notebook.

Test 1: Text-to-Image

We searched for the text string "Pizza". The database ignored the file names and focused on the image content. It successfully returned a photo of a pepperoni pizza even though the file name didn't explicitly say "pizza".

Pizza with BBQ sauce, pineapple, chicken, and cilantro on a wooden board. Garnished with red onion slices, creating a vibrant, appetizing look. — Result: The engine successfully retrieved this image when searching for the text "Pizza".

Test 2: Image-to-Text

We fed the system a random picture of a dog. The system returned tweets containing words like "Golden Retriever" and "Puppy." It successfully bridged the gap between pixels and language.

Want to do more with images? Check out our tutorial on AI Magic for Twitter Images to learn how to generate and transform images using Diffusion models.

Conclusion

By moving from simple keyword matching to Multimodal Embeddings, we unlock a deeper understanding of social media data. This architecture is the foundation for modern recommendation systems and content moderation tools.

Get the Code: You can find the complete Jupyter Notebook for this tutorial on our GitHub: > View on GitHub

Search Images with Text: Build a Multimodal AI Engine (Python Tutorial)

Introduction: Building a Multimodal Search Engine in Python

The Tech Stack: Python, MongoDB, and CLIP

Data Collection & The "Split Strategy"

Why One Tweet = Two Documents

Understanding Multimodal Embeddings

The "Double-Tap" Search Logic

Testing the Results

Test 1: Text-to-Image

Test 2: Image-to-Text

Conclusion

Recent Posts

Comments