Building a Social Media Knowledge Graph with Python & Neo4j

Xuebin Wei
3 days ago
3 min read

Stop staring at flat rows of data. In this tutorial, we build a high-performance ETL pipeline to transform nested JSON social media posts into a living, connected network using Python and Neo4j. In traditional relational databases, joining complex entities such as users, hashtags, and locations can be slow and counterintuitive; here, we move beyond spreadsheets to build a fully interactive Knowledge Graph.

Watch the full tutorial here

Social Media Knowledge Graph: Building with Python & Neo4j

Step 0: Setting Up Your Neo4j AuraDB Instance

Before coding, we need a home for our graph. Neo4j AuraDB offers a powerful free tier well-suited to this project.

Create an Account: Sign up at Neo4j Aura.
Launch Instance: Click "Create Instance" and select AuraDB Free. This supports up to 200,000 nodes—plenty for our tutorial.
Save Your Credentials: Crucial Step! Download the generated .txt file. This single document contains your Connection URI, Username, and Password. You will need all three to link your Python code to the database.

Dashboard showing an AuraDB instance named "demo," marked as running. Details include ID, node count, and relationships. Options to upgrade. — The Neo4j Aura Console shows your instance is **RUNNING**.

Step 1: The Architecture of a Tweet

A single tweet is a rich, nested object. To build a graph, we must map these "documents" into four distinct node types: Users, Tweets, Places, and Hashtags.

Diagram of a tweet object structure with text, place, author, entities, and geo details. Mentions Neo4j and New York, NY coordinates. — Figure 1: The nested JSON structure of a single dummy tweet object showing embedded author and place data.

By mapping relationships such as POSTED, LOCATED_AT, and TAGGED_WITH, we create a schema that mirrors real-world interactions. For a deeper look at managing data flows in the cloud, check out our guide on Building a Cloud Python Data Pipeline.

Figure 2: The Graph Schema: Mapping relationships between User, Tweet, Place, and Hashtag nodes.

Step 2: Generating Context-Aware Social Data

We use the Faker library to create "Semantic Clusters." This ensures that our dummy data consists of related topics such as AI, Neo4j, and Python. If you want to speed up this process, leverage AI Coding in Colab with Gemini to help generate your synthetic data structures.

# Install dependencies
!pip install neo4j faker -q

# Sample of the generation strategy
from faker import Faker
fake = Faker()

topic_content = {
    "Neo4j": ["Graph databases are game changers for complex relationships.", "Just learned Cypher!"],
    "AI": ["Generative AI is transforming how we write code every day.", "LLMs are evolving."]
}

Step 3: High-Performance Ingestion

To connect Python to Neo4j, store your URI and Password from the downloaded .txt file in Google Colab "Secrets" (the key icon).

Settings panel showing options for adding secrets: URI and password fields, with hidden values and action buttons. — Screenshot of the Google Colab sidebar showing the 'Secrets' tab (key icon) where the 'URI' and 'password' are securely stored for use in the notebook.

We use the Cypher UNWIND command to batch-process the JSON. This is significantly faster than inserting nodes one at a time.

import json
from neo4j import GraphDatabase

# Using Cypher UNWIND to map JSON to Nodes/Relationships efficiently
query = """
UNWIND $batch AS row
MERGE (u:User {id: row.__expansion_author.id})
SET u.username = row.__expansion_author.username
CREATE (t:Tweet {id: row.id})
SET t.text = row.text, t.created_at = datetime(row.created_at)
MERGE (u)-[:POSTED]->(t)
"""

Step 4: Exploring Your Data

Once the script confirms "Success," head to the Explore tool in the Neo4j console.

Natural Language Logic: Try querying "Show me all users in New York" and watch the AI translate your request into Cypher.
Manual Expansion: Double-click a User node to "expand" their network and see every tweet they've posted.

This is just the beginning. To see how to take the search further, check out our tutorial on Building a Multimodal Search Engine with Python.

Screenshot of the Neo4j Bloom interface demonstrating Natural Language Logic. The query "users posted ai hashtags" has been automatically translated into a graph pattern, revealing a central purple Hashtag node (#ai) connected to multiple yellow Tweet nodes and their respective blue User authors. — Screenshot of the Neo4j Bloom interface demonstrating **Natural Language Logic**. The query "users posted ai hashtags" has been automatically translated into a graph pattern, revealing a central purple **Hashtag** node (#ai) connected to multiple yellow **Tweet** nodes and their respective blue **User** authors.

Conclusion

You’ve gone from an empty project to a living knowledge graph! This pipeline is the foundation for advanced graph analytics and AI-driven insights.

Get the Code: View the full Python Notebook on GitHub