Approach
The Easy Part
It is surprisingly easy to generate an llms.txt. Take any agent SDK, hand it a model with built-in web search, maybe throw in a validation tool (not even strictly necessary), and point it at a domain. Within a few minutes you have a markdown file that passes the spec.
I tried this. It works. The file looks reasonable. It has headings, links, descriptions. If you squint, it seems like a perfectly good llms.txt.
But then you ask: is it actually good?
And that's where the real problem starts. There is no metric for "good." The llmstxt.org proposal says the goal is "to provide information to help LLMs use a website at inference time," but it doesn't say how to measure whether a given file achieves that. You're left staring at the output with no way to know if it's useful or just plausible-looking noise.
Defining the Goal
I start by taking an opinionated position:
A good llms.txt allows an agent to answer questions about a website in an accurate and token-efficient manner.
This single sentence makes the problem tractable. Instead of arguing about subjective quality (are the descriptions well-written? are the sections logical?) you can measure something concrete: does the file actually help an LLM do its job?
The word accurate gives the first lever. If the purpose of an llms.txt is to help an agent answer questions correctly, then you need questions with known answers. You need test cases. A question like "What is the pricing for the Pro plan?" paired with an expected answer that can be verified. If the agent gets it right using the llms.txt, the file is doing its job for that question. If it gets it wrong, the file failed.
But accuracy alone isn't enough. An agent is an unbounded loop. Given enough turns, it could scrape the entire website, follow every link, read every page, and eventually find any answer. A terrible llms.txt that says nothing useful would still produce correct answers because the agent would just ignore it and brute-force the site directly. That's not the file being good. That's the agent being persistent.
This is where the second constraint comes in: token efficiency. The llms.txt should accelerate the agent's search. Not every answer will be written directly in the file, but the file should act as a routing table, pointing the agent toward the right pages so it finds answers faster, in fewer tokens, with fewer wrong turns. A proxy for this in the agentic loop is total tokens consumed across the entire interaction: reading the llms.txt, deciding which pages to fetch, fetching them, and synthesizing an answer.
Put these together and you get a quality signal that balances both. A file that enables correct answers with minimal token overhead is a good file. A file that forces the agent into long, expensive explorations, or worse, leads it to wrong answers, is a bad one. This isn't a subjective judgment. It's measurable.
The Score
So there's a conceptual formula. But it needs to become something concrete: a single number that shows up next to a generated llms.txt and tells you whether it's any good.
The evaluation works like this. I auto-generate a test set of question-answer pairs from the crawled pages. These aren't hand-written. An LLM reads the pages and produces factual questions with known answers, things like "What programming languages does the SDK support?" or "What is the rate limit for the free tier?" Then a separate QA agent is given only the generated llms.txt and a fetch_url tool to look up pages on the site, and asked to answer each question.
A judge model compares the agent's answer against the expected answer and rules: correct, incorrect, or unsure. I also record exactly how many tokens the QA agent consumed. Every token of the llms.txt it read, every page it fetched, every reasoning step it took.
The score formula captures both dimensions in one number:
score = (correct / total) * (1 / (1 + avg_tokens / K)) * 100
K is an estimated baseline for typical agent usage, set to 3000. If the agent averages 3000 tokens per question, the efficiency multiplier halves the score. Get every question right but burn 15000 tokens each time? You score around 16. Get them all right in under 1500 tokens each? You're above 66.
The decay function is what makes this interesting. It models the difference between a one-shot answer (the information was right there in the llms.txt, the agent barely had to think) and a two or three-shot answer (the agent had to follow a link, read a page, come back).
This is why the score isn't just "percent correct." A file that gets 3/3 correct but costs 3000 tokens per question scores lower than a file that gets 3/3 correct in 300 tokens. The first file is accurate but unhelpful as a routing table. The second file is doing its job.
The Crawler
Before generating anything, I need to understand the site. And crawling is a harder problem than it looks.
A naive BFS crawler starts at the homepage and follows every link. For a 20-page documentation site, this is fine. For a site with thousands of pages, product listings, blog archives, and paginated feeds, BFS will happily crawl for hours and return a mountain of redundant content. I wanted to avoid the trap of falling into less relevant information. BFS from the root helps with this compared to DFS, since pages closer to the homepage tend to matter more, but we can do better.
I use Maximal Marginal Relevance (MMR) crawling instead. The idea comes from information retrieval: at each step, pick the next page that is both relevant to the root domain and diverse from what you've already seen. Each candidate URL is scored by its similarity, similarity in this case determined by embeddings, to the homepage content (relevance) minus its similarity to already-crawled pages (redundancy). The formula:
MMR(page) = lambda * relevance(page) - (1 - lambda) * max_similarity(page, visited)
With lambda at 0.5, the crawler strikes a balance: it won't wander off-topic, but it also won't keep fetching variations of the same content. In practice this means it discovers the pricing page, the docs, the about page, and the API reference early (the pages that matter) instead of grinding through 200 blog posts.
There's a harder problem I haven't fully solved yet: bot detection. The current fetcher uses a standard HTTP client with a browser-like user agent. This works for most sites, but modern WAFs and bot detection services (Cloudflare, Akamai, DataDome) will block it on more protected sites. I'd like to integrate the emerging Web Bot Auth protocol, a proposed standard where automated agents can identify themselves and prove authorization, rather than pretending to be browsers. The honest approach. Until that protocol matures, the fallback path would be something like BrowserBase's stealth mode or AWS Bedrock's AgentCore Browser, which handle JavaScript rendering and fingerprint challenges. For now, the MMR crawler works well on the majority of sites, and I treat bot-blocked pages gracefully rather than crashing the pipeline.
The Generator Agent
With crawled pages in hand, the generator agent builds the actual llms.txt. This is a Claude-powered agent (using the Strands agent SDK) with a set of purpose-built tools:
- file_read: Read any crawled page, with pagination for long documents. The agent can scan the crawl manifest first, then dive into specific pages it finds interesting.
- file_write: Write files to the sandbox. The agent uses this for scratch notes, intermediate outlines, and the final llms.txt output.
- validate_llmstxt: Check the generated file against the llmstxt.org spec. The agent can validate its own work before finishing.
- fetch_url: Fetch a live URL not covered by the crawl. If the agent reads a page that references an important link the crawler missed, it can go get it directly.
The tools matter because they give the agent agency over its own information gathering. A template-based generator is limited to whatever the crawler found. This agent can notice gaps. If the crawl missed the pricing page but three other pages reference it, the agent can fetch it on its own. If a page is too long to process in one read, it can paginate through it. If it writes something that doesn't validate, it can fix it.
The agent operates in a sandboxed temporary directory. Crawled pages are seeded as flat files. The agent reads, reasons, fetches additional content if needed, writes the llms.txt, validates it, and hands it back. The sandbox is cleaned up automatically when generation finishes.
Closing the Loop
Everything so far (the crawler, the generator agent, the evaluation framework) works. You can point it at a domain and get a scored llms.txt. But there is one missing piece.
How do I know what a good system prompt looks like for the generator agent?
I have good intuition. The current prompt emphasizes exhaustiveness, specificity, factual detail. It works reasonably well. I could iterate on it by hand: try different phrasings, run evaluations, compare scores, adjust. The usual human-in-the-loop optimization.
But I live in an agentic world now. I don't need to move at the speed of humans or devote my own time to prompt iteration. Because I already have everything needed to automate this.
Think about what's been built: a crawl mechanism that can explore any site, an evaluation framework that produces a numeric quality score, and a generator that takes a system prompt and produces an llms.txt. The system prompt is the only free variable. Everything else is infrastructure.
So I combine them. I run a wide crawl across a set of test domains, much wider than I would at runtime when serving an actual client. From that expanded crawl, I can programmatically generate question-answer pairs that span domains far larger than what any single llms.txt could reasonably contain. This means I can effectively test one-shot lookups (answer is directly in the file) and two-shot lookups (answer requires following a link) and even three-shot searches (answer requires multiple hops). I have the means to stress-test the generator's output in ways that go far beyond a handful of hand-written questions.
Then I give an optimizer agent an experimentation tool. It can modify the system prompt, trigger a generation run on a test domain, observe the evaluation score, and try again. The feedback loop is fully automated: generate, evaluate, adjust, repeat. The agent sees which prompt changes improve accuracy, which reduce token consumption, which cause regressions. It can run for as long as I'm willing to pay for the tokens.
This is effectively a genetic algorithm. The system prompt is the genome, mutations are the optimizer's edits, and the evaluation score is the fitness function. Candidates that improve the score survive; the rest are discarded. The optimizer doesn't need to understand why a prompt works, it just needs to observe that it scores higher and keep going.
This is the architecture's real payoff. The evaluation framework isn't just a report card, it's a fitness function. The crawler isn't just a data collector, it's a training set generator. And the generator agent isn't just a one-shot tool, it's the subject of automated optimization. The same infrastructure that serves users also improves itself.