top of page

LBSocial

Search Images with Text: Build a Multimodal AI Engine (Python Tutorial)

Updated: Jan 22

Search Images with Text: Build a Multimodal AI Engine (Python Tutorial)

Introduction: Building a Multimodal Search Engine in Python


Social media data is messy. It’s not just a clean spreadsheet of words; it’s a chaotic mix of captions, hashtags, screenshots, and photos. If you are only analyzing the text, you are ignoring half the story.


In this tutorial, we will build a Multimodal Search Engine in Python. Unlike traditional databases that look for matching keywords, this engine understands concepts. It allows you to search for the text "Pizza" and get back a photo of a pizza, or upload a photo of a dog to find tweets about "Golden Retrievers."


Prerequisites: This tutorial runs entirely in the cloud. If you are new to this environment, check out our guide on AI Coding in Google Colab with Gemini to get your T4 GPU runtime set up.


The Tech Stack: Python, MongoDB, and CLIP


To build this, we don't need a massive server farm. We rely on three key components:


  1. Python: For the logic.

  2. MongoDB Atlas: To store our data and perform vector search.

  3. OpenAI CLIP: An open-source model that translates both text and images into the same "Vector Space."


First, we install the necessary libraries: sentence-transformers for our AI model and pymongo for database connectivity.


!pip install sentence-transformers pymongo pillow requests -q

Data Collection & The "Split Strategy"


For this demo, we generate synthetic social media posts combining text and image URLs. However, in a real-world scenario, you would likely be using data scraped from X (Twitter).



Why One Tweet = Two Documents


We use a "Split Strategy." Since a single tweet contains both text and an image, we treat them as two separate vectors in our database. This allows us to match a user's query against either the text or the image independently.


Diagram of a process splitting a tweet into text and image documents via Python and CLIP, stored in MongoDB. Arrows connect each stage.
The "Split" Storage Strategy: Creating separate vectors for text and images.

Understanding Multimodal Embeddings


This is where the magic happens. We use the CLIP (Contrastive Language-Image Pre-Training) model. It converts our text and our images into lists of numbers (vectors) that share the same mathematical space.


If you've followed our previous guide on Enhanced Twitter Insights with Vector Databases, you know how text embeddings work. Today, we are adding the visual layer.


Illustration of a multimodal embedding model with a cat image and text input. Concepts like "Cat" and "Pizza" shown in a shared 3D vector space.
The Shared Vector Space: Text and Images mapped to the same coordinates.


Here is how we load the model and process the images:


from sentence_transformers import SentenceTransformer
from PIL import Image

# Load the CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# Allow processing of large images
Image.MAX_IMAGE_PIXELS = None

The "Double-Tap" Search Logic


To make the search truly "multimodal," we cannot just run a single query. Text vectors and image vectors often cluster in slightly different numeric ranges.


To fix this, we define a search function that queries our database twice:


  1. Search A: Forces the database to find the best Text matches.

  2. Search B: Forces the database to find the best Image matches.

  3. Merge: We combine both sets to give the user a rich answer.


Flowchart of the "Double-Tap" Multimodal Search Strategy with steps: search text, search image, combine results, and final results.
The "Double-Tap" logic ensures we find relevant results across both media types.
def mixed_search(query, num_results=1):
    # Convert query (text or image) to vector
    query_vector = model.encode(query).tolist()
    
    # Perform vector search in MongoDB (Simplified Logic)
    # We run the search against the image index AND the text index
    # Then combine the results
    return list(results)


Testing the Results


Does it work? Let's look at the actual output from our Python notebook.


Test 1: Text-to-Image


We searched for the text string "Pizza". The database ignored the file names and focused on the image content. It successfully returned a photo of a pepperoni pizza even though the file name didn't explicitly say "pizza".


Pizza with BBQ sauce, pineapple, chicken, and cilantro on a wooden board. Garnished with red onion slices, creating a vibrant, appetizing look.
Result: The engine successfully retrieved this image when searching for the text "Pizza".

Test 2: Image-to-Text


We fed the system a random picture of a dog. The system returned tweets containing words like "Golden Retriever" and "Puppy." It successfully bridged the gap between pixels and language.


Want to do more with images? Check out our tutorial on AI Magic for Twitter Images to learn how to generate and transform images using Diffusion models.


Conclusion


By moving from simple keyword matching to Multimodal Embeddings, we unlock a deeper understanding of social media data. This architecture is the foundation for modern recommendation systems and content moderation tools.


Get the Code: You can find the complete Jupyter Notebook for this tutorial on our GitHub: > View on GitHub

Comments


bottom of page