![[ezgif-28824157dc0dbe.webp]]
The modern news cycle is relentless and I have always wanted the highest signal to noise ratio for anything that I want to read.
For platforms, delivering that signal in a compelling way is the path to growth. I faced this exact challenge: how to transform a real time stream of raw news articles into engaging narratives that users would actually _want_ to consume.
>I then came up with an ML pipeline that scaled to serve approximately **10,000 Daily Active Users (DAU)**, engaged **50,000 Monthly Active Users (MAU)**, and, most tellingly, saw users swipe through nearly **3 million news cards**.
My goal is to give you a look under the hood: a robust semantic clustering engine to identify unique stories, and then summarizing using LLMs to present them.
### How do you discover signal in the noise?
Every minute, countless articles are published. Many are syndicated, slightly reworded, or follow-ups to the same underlying event. The first critical task was to accurately identify and group these into distinct, evolving stories.
There are two realities:
1. Failure here means a repetitive and frustrating user experience.
2. Success means delivering clarity and balanced viewpoints.
This is what I came up with to solve this:
1. Use **Semantic Clustering** to group articles by what they _mean_, not just what words they contain.
2. Drive summarization with Large Language Models to create diverse, high-quality summaries tailored for different consumption habits.
Let's break these down.
### Phase 1: Semantic Clustering of News Articles
The objective here is to take a stream of incoming articles and group those that discuss the same real-world event.
**1. Text Embeddings: Capturing Semantic Essence**
To compare articles meaningfully, we first need to convert their textual content into a numerical format that captures semantic relationships. Simple keyword matching is too brittle. We opted for OpenAI's `text-embedding-3-small` model, a strong performer that offered a good balance of semantic richness, speed, and cost-effectiveness for our scale.
Each article's title and its content were pushed into a high-dimensional vector space.
```python
# from openai import OpenAI # Assuming client is initialized elsewhere
# client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
"""
Generates an embedding for the given text using the specified OpenAI model.
Text is truncated to the first 500 characters.
"""
text = text.replace("\n", " ") # Basic preprocessing
return client.embeddings.create(input = [text[:500]], model=model).data[0].embedding
```
**2. Hierarchical Agglomerative Clustering (HAC) with Ward's Method**
With articles represented as embedding vectors, the next step is clustering. We chose Hierarchical Agglomerative Clustering (HAC) because the number of distinct news stories isn't known beforehand. HAC builds a hierarchy of clusters, which we can then "cut" at a desired dissimilarity threshold.
Specifically, we used **Ward's linkage method**. Ward's method is designed to minimize the total within-cluster sum of squares (variance). At each step, it merges the pair of clusters that will lead to the minimum increase in the total within-cluster variance. The objective function aims to minimize:
$
E = \sum_{k} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2
$
where $C_k$ is the $k$-th cluster and $mu_k$ is the centroid (mean vector) of the points $x_i$ in cluster $C_k$.
The distance metric chosen was **cosine distance**. For high-dimensional data like text embeddings, cosine distance (or similarity) is often more effective than Euclidean distance as it measures the angle between vectors, thus capturing orientation (semantic similarity) rather than magnitude.
Cosine similarity between two vectors **A** and **B** is:
$
S_C(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}
$
And cosine distance is:
$
D_C(A, B) = 1 - S_C(A, B)
$
A smaller cosine distance implies greater semantic similarity.
```python
from scipy.spatial.distance import pdist # For pairwise distances
from scipy.cluster.hierarchy import linkage, fcluster # For HAC
import numpy as np
def perform_clustering(embedding_list, distance_threshold=0.3):
"""
Performs HAC on a list of embedding vectors.
"""
if not embedding_list or len(embedding_list) < 2:
# Not enough items to cluster, return individual cluster labels
return np.arange(len(embedding_list))
embeddings_array = np.array(embedding_list)
# Calculate pairwise cosine distances. pdist returns a condensed distance matrix.
distance_matrix = pdist(embeddings_array, metric='cosine')
# Perform hierarchical clustering using Ward's linkage.
# Ward's method is chosen to minimize variance within each cluster.
Z = linkage(distance_matrix, method='ward')
# Form flat clusters from the hierarchical clustering defined by Z.
# We use a distance threshold 't'. Articles within this cosine distance
# (or more accurately, whose merged cluster satisfies Ward's criterion
# relative to this scale of distance) are grouped.
# The threshold of 0.3 was determined empirically for our dataset and embedding model.
clusters = fcluster(Z, t=distance_threshold, criterion='distance')
return clusters # Returns an array of cluster labels
```
After clustering, we implemented a filtering step: clusters containing fewer than a minimum number of articles (e.g., 2) were typically discarded. This helped focus on stories with multiple sources or significant development, reducing noise from isolated or poorly supported articles.
### Phase 2: Summarization with LLMs
Once articles are clustered into distinct stories, the challenge shifts to presenting these stories effectively. A single summary format rarely fits all user needs or content types. We leveraged OpenAI's `gpt-4o` model, orchestrated via Langchain, with prompt management handled by PromptLayer.
**1. Using Diverse formats for different usecases**
We aimed to generate multiple summary types for each news story:
- **ELI5 (Explain Like I'm 5):** For complex topics.
- **BulletPoints:** Concise, scannable key facts.
- **W5Points (Who, What, When, Where, Why):** Classic journalistic breakdown.
- **TLDR (Too Long; Didn't Read):** Ultra-brief takeaway.
- **Gist:** The core essence.
- **GenZ:** A more informal, engaging, and often emoji-laden style.
A critical piece for reliability was enforcing **structured output**. We defined Pydantic models for each summary type. Langchain's `JsonOutputParser` then used these models to instruct the LLM to return JSON that strictly conformed to our desired schema (e.g., distinct fields for `title` and `content`). This dramatically reduced parsing errors and improved the consistency of the LLM's output.
```python
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
# from langchain_openai import AzureChatOpenAI # Or your preferred LLM interface
# --- Pydantic Model Example ---
class BulletPointsSummary(BaseModel):
"""Defines the structure for a bullet-point style news summary."""
# Descriptions here guide the LLM, often populated from PromptLayer templates
title: str = Field(description="A catchy, concise title for the bullet-point summary. Max 10 words.")
content: str = Field(description="A summary of the key points from the news content, presented as markdown bullet points. Each bullet should be impactful. Max 5-7 bullets.")
# --- Simplified Summarization Chain Sketch ---
# llm = AzureChatOpenAI(deployment_name="gpt-4o", temperature=0.2) # Example initialization
def generate_structured_summary(news_content_str, pydantic_model_class):
"""
Generates a structured summary for the given news content using the specified Pydantic model.
"""
parser = JsonOutputParser(pydantic_object=pydantic_model_class)
# The prompt template instructs the LLM on its task and how to format its response.
# {format_instructions} is automatically populated by the parser.
# {query} is where the actual news content will be injected.
prompt_template_str = """
You are an expert news summarizer. Based on the provided news content, generate a summary.
Adhere strictly to the following JSON schema for your output:
{format_instructions}
News Content:
{query}
"""
prompt = PromptTemplate(
template=prompt_template_str,
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Construct the chain: Prompt -> LLM -> Parser
chain = prompt | llm | parser
try:
# 'news_content_str' would be the concatenated text from articles in a cluster
result = chain.invoke({"query": news_content_str})
return result
except Exception as e:
# Handle exceptions, e.g., LLM errors, parsing issues
print(f"Error generating summary for {pydantic_model_class.__name__}: {e}")
return None # Or a default error structure
```
**2. Orchestration and Chained Summarization**
To generate all summary formats for a given news cluster efficiently, we used a `ThreadPoolExecutor` to make parallel calls to the `generate_structured_summary` function, one for each `PydanticModel` type.
The "GenZ" format had a unique twist: it was generated by first creating the "BulletPoints" summary. This bullet-point summary (both title and content) was then fed as input to another LLM call, this time with a prompt specifically engineered to translate the factual bullets into a GenZ-appealing narrative. This form of _chained summarization_ allowed for specialized styling built upon a factual base.
PromptLayer was invaluable here. It allowed us to version, test, and iterate on prompts for each of these summary types independently, ensuring we could fine-tune the output quality for each without a monolithic prompt structure.
### Measuring Success: Engagement is your North Star
This entire pipeline,from raw articles to clustered stories to multi-format summaries, was built to drive user engagement. The numbers speak for themselves:
- **~10,000 Daily Active Users (DAU)**
- **~50,000 Monthly Active Users (MAU)**
- Nearly **3 Million news cards swiped**
The stickiness of the content generated by this system is further evidenced by user swipe behavior. We tracked a funnel for the news cards swiped as well:
- **81%** of users swiped at least twice, strong initial hook
- **25%** of users who swiped once continued to **10+ swipes**, sustained engagement
This isn't just about one "aha!" moment; it's about delivering a continuous stream of relevant, digestible content that is extremely addictive.
### Key Learnings & Why This Approach Worked
- **Iterative Prompt Engineering is the only way:** The quality of LLM outputs is directly tied to prompt quality. Constant refinement, A/B testing, and detailed instruction were paramount.
- **Use Structured LLM Outputs:** Moving from free-form text to JSON outputs guided by Pydantic schemas was a massive leap in reliability and predictability.
- **Semantic Understanding is Key for Clustering:** Embeddings provided the depth needed to truly group related articles, far surpassing older keyword-based methods. Ward's method with cosine distance proved robust for this task.