How AI Search Engines Decide What to Cite

When you ask Perplexity a question, it doesn't generate an answer from nothing. It searches the web, retrieves a handful of pages, reads them, and synthesizes a response that quotes from those pages with inline citations. The same is true of ChatGPT's web search mode, Google AI Overviews, and most production AI search engines today.

The architecture behind this is called Retrieval-Augmented Generation (RAG). Understanding it, even at a high level, explains why some pages get cited and most don't, and what to change about your content if you want to be in the cited set.

The RAG Pipeline in Plain English

When a query comes in, a RAG system does three things in sequence:

Retrieve. Search an index for documents related to the query. This is typically semantic (vector embedding) search, not keyword matching. The system compares the meaning of the query to the meaning of indexed documents.
Rerank and select. Score the retrieved candidates, pick the top N (usually 3 to 10), and feed them into the language model as context.
Generate. The model writes an answer using the retrieved context, attributing claims back to specific source documents.

Citation happens at step 3. The model picks spans of text from the retrieved documents that answer the question and attributes them. If your page isn't retrieved (step 1) or doesn't survive the rerank (step 2), it can't be cited. And if it's retrieved but contains no clean spans the model can quote, it won't be cited either.

Why Semantic Retrieval Rewards Entity-Rich Content

Embedding-based retrieval works by mapping text to high-dimensional vectors and finding documents whose vectors are close to the query's. The vector of a page is largely determined by the entities (named people, products, places, organizations) and the concepts in it. A page that mentions specific entities will have a sharper, more locatable vector. A page full of generic language (“we offer industry-leading solutions for enterprise needs”) maps to a vague vector that matches few specific queries well.

Practical implication: entity density matters. A page about “CRM software” that names Salesforce, HubSpot, Pipedrive, and Close.io is more retrievable for “HubSpot vs Salesforce” than a page about CRMs that names none of them. Be specific. Use proper nouns. Mention real products, real people, real places.

Why Direct-Answer Paragraphs Get Cited

Once a page is retrieved, the LLM scans it for spans that answer the user's query. The format that makes this easiest is also the format that gets cited most: an H2 that mirrors a likely question, followed by a first sentence that directly answers it.

Compare:

Citable: H2 “What is a CRM?” followed by “A CRM (Customer Relationship Management) system is software that centralizes customer data across sales, support, and marketing.”
Not citable: H2 “Understanding modern business tools” followed by “In today's fast-paced environment, organizations face increasing complexity...”

The first version is a clean span the model can quote. The second is a paragraph the model has to summarize, which it's less likely to do.

Schema Markup and the Grounding Layer

Some AI search engines (notably Google's grounded LLMs and Microsoft Copilot) read schema.org markup directly. A page with proper Article, FAQ, or HowTo schema gives the engine pre-structured data to extract from. This is one of the cheapest GEO wins available. The schema doesn't change your prose, but it gives retrieval and grounding layers extra signal to work with.

What the Data Says

The Princeton/Georgia Tech 2024 GEO paper measured this empirically:

Adding outbound citations to authoritative sources: 30 to 40% visibility lift in Perplexity and BingChat.
Adding concrete statistics: ~30% lift.
Adding direct quotes from named authorities: ~30% lift.
Improving fluency and confident tone: smaller but consistent gains.

These are deltas measured against control pages, not heuristics. They're the closest thing we have to a rulebook for AI search citation.

A Practical Checklist

Mention at least 3 concrete numeric data points or percentages in the body.
Link out to at least 3 authoritative sources (.gov, .edu, major news, recognized industry publications).
Structure every H2 as a question or claim, and make the first sentence under it directly answer the H2.
Include at least one block quote attributed to a named source.
Name specific products, people, and places, not categories and abstractions.
Add Article + FAQ or HowTo schema with required fields complete.
Drop the hedging language in the first paragraph. “Might, could possibly, may be” weakens citation likelihood.

That checklist is the practical core of what the GEO Score measures. Run a page through it, fix what fails, and the citation data will tell you whether it worked.

AI SearchView all posts