# Metehan Yesilyurt — metehan.ai (Full Content) > AI Search and SEO researcher. Publishing original research on LLM visibility, AI-powered search engines, and the next era of discoverability. ## Links - Website: https://metehan.ai - X: https://x.com/metehan777 - LinkedIn: https://www.linkedin.com/in/metehanyesilyurt - GitHub: https://github.com/metehan777 - Substack: https://metehanai.substack.com - RSS: https://metehan.ai/rss.xml - Sitemap: https://metehan.ai/sitemap-index.xml - llms.txt: https://metehan.ai/llms.txt ## Full Articles (103) ### I Crawled 65,000 Pages of My Own Site Without Parsing a Single Line of HTML URL: https://metehan.ai/blog/http-headers-internal-links/ Date: 2026-05-04 Last week I spoke at SEO Week 2026 in New York City, organized by iPullRank, on April 27th and 30th. It was, easily, the most concentrated room of practitioners thinking about where SEO, AEO, and GEO are actually going. Somewhere between talks on day one, I think it was during a hallway conversation about how badly LLM crawlers struggle with rendered DOMs. A question I had been chewing on for months finally crystalized: What is the smallest, fastest, most boring thing a website can do to make itself perfectly understood by every crawler, scraper, and LLM that visits it? Robots.txt? Sitemaps? llms.txt? Schema.org? They all help. They all also assume the same thing: that a crawler is going to download your HTML, render it, parse it, and politely figure out what you meant. That’s a lot of trust to place in a stranger’s parser. So I flew home and tried something different. I took my entire internal link graph, every page, every connection, encoded it as compact JSON, base64url-ed it, and stuffed it directly into an HTTP response header on every single page of my site. Then I added a second header carrying the page’s heading hierarchy. Not in the body. Not in a sidecar file. In the headers. Then I crawled 65,000 pages in 99 seconds without parsing one byte of HTML. This post is the story of that experiment, the numbers from a fresh 100-URL run I did this morning, the prior art (yes I checked), the things that broke in production, and the open-source code so you can replicate it on your own site. All the code: github.com/metehan777/http-header-link-graph (Cloudflare Worker + Rust crawler + Python insights pipeline, MIT-licensed). The idea in one sentence HTTP responses can carry 8 KB to 32 KB of headers depending on the server config. That is more than enough room to ship structured site metadata next to every page for free, on every request, before a single byte of body is sent. Once that clicks, a lot of things change. A WAF blocks the body? The headers still went out. A page returns 403? The headers still went out. A page returns 500? The headers still went out. The page is JavaScript-rendered? The headers went out before the JS even loaded. Headers travel light, headers travel first, and headers travel even when everything else falls apart. That makes them a beautiful place to publish things you actually want crawlers to see. What I actually built A toy. On purpose. A small Cloudflare Worker that serves a real site (data.stateglobe.com, 65k+ pages), but on every response it adds: Decode X-Internal-Links and you get the full list of internal links from that page: Decode X-Headings and you get the page’s heading skeleton: No DOM, no Readability, no Playwright, no Cheerio. Then I asked one question: If I act like a scraper, with no HTML parsing at all, can I rebuild the entire internal link graph and the heading hierarchy of every page using only response headers? Spoiler: yes. And the speed surprised me. Run 1: 65,000 pages, 14 minutes (cold cache) I wrote a Rust crawler with tokio and reqwest (because once you taste 1,000 RPS you cannot go back), pointed it at the live site with 800 concurrent connections, and watched it grind away. 76 requests per second. For a static-feeling site. Ouch. The bottleneck was obvious once I looked: every request was hitting my Worker, executing logic, and serializing the link payload from scratch. Cloudflare’s edge cache was sitting there, completely empty, watching me re-render the same HTML for the millionth time. So I taught the Worker to use caches.default.put and added a sane Cache-Control: public, s-maxage=7200, max-age=300. Then I purged the edge cache once to flush stale stuff. Run 2: 65,000 pages, 99 seconds (warm cache) Same crawler, same site, same 65,000 pages: 8.6× speedup. Once Cloudflare’s edge had a copy of every response with my custom headers attached, the Worker barely had to lift a finger. The crawler isn’t even downloading the full body. It is streaming the response headers, decoding a base64 string, and skipping the rest. That’s why 1 KB of header outperforms 50 KB of HTML the crawler never asked for the 50 KB. Run 3: 100 URLs, both headers, no cache, this morning After the talk in NYC, a few people asked, “yeah but can you actually fit headings in there too?” So I deployed an updated Worker that also emits X-Headings, then ran a fresh, cache-busted 100-URL probe so we could measure the worst case (every request hits the Worker, no edge cache, both headers populated). Code is in scripts/probe_100.py. Here is the actual output: Translating from machine to human: 100 pages, 1.88 seconds, all 200s. Both headers present and decoded successfully on every single page. p95 latency 364 ms while hitting the Worker on every request, no edge cache. p95 combined header size 2.3 KB. Comfortably inside the 8 KB budget I aim for. The heading distribution is very revealing: every page has exactly 1×H1, 1×H2, plus a long tail of H3/H4. That tells me my templates are consistent, which is exactly the kind of structural signal AEO/GEO systems reward and I never had to render the page to find that out. This is what I mean by “headers as a publishing surface.” We are not waiting for a parser to figure out the hierarchy. We are handing it over. Then I built SEO insights from headers alone After the 65k crawl finished, I had 65,000 JSON payloads sitting in a file. Each one was just { url, links: [...] }. No HTML. Out of curiosity, I wrote a small Python script (scripts/seo_insights.py) that did one thing: build the full directed graph from those headers and compute SEO metrics on it. What came back floored me: 27,372 pages (41.9%) had zero inbound internal links. Pure orphans, only reachable through the sitemap. On a site that was supposed to be tightly cross-linked. 27,659 pages were unreachable by walking from /. Same root cause. Click-depth distribution: 54% of pages were 4+ clicks deep from the homepage. 53% sat at depth 6. Gini coefficient on inbound links: 0.918. That’s near-monopolistic. The top 10 pages were hoovering up 17% of all internal links. Cluster inequality. The country sub-folders had identical structure -301 pages each- but median inbound count was 199 for some countries and 1 for others. Same template, same code, wildly different link distribution. I have used Screaming Frog, Sitebulb, Ahrefs, OnCrawl, JetOctopus. They are all great. But this hit different. Because it took 99 seconds, cost approximately nothing, and surfaced a structural problem that no amount of “let’s audit the homepage” would have caught. The crawl wasn’t the product. The crawl was just the delivery vehicle for a much cleaner SEO insight pipeline. And now, with X-Headings shipping alongside, I can layer on: Heading uniqueness across the site (duplicate H1s, missing H2s, broken hierarchy) Topical drift between cluster pages AEO-friendly question detection (any H3 phrased as a question) Snippet candidates per page …all without rendering a single page. Why this matters for SEO, AEO, and GEO This is the part that got me genuinely excited at SEO Week, because it generalizes. I started with internal links and headings. But the header is just an envelope. You can put almost anything structured inside it. For traditional SEO Internal link graph (what I tested): orphans, dead-ends, click-depth, hub identification, Gini concentration. Heading skeleton (what I added this week): hierarchy validation, topical clustering, duplicate detection. Canonical hints, hreflang, content hash: ride along with every response, no extra request. Last-edited timestamp: helps crawlers decide whether to revisit. Cheaper than Last-Modified games. Page tier signal: tell crawlers “this is a tier-1 hub” or “this is a tier-3 long-tail page” so they can budget accordingly. For AEO/GEO (Answer Engine Optimization) This is where it gets fun. AI engines like Perplexity, ChatGPT, Gemini, and the new wave of search agents are extremely sensitive to clarity of structure. They reward sites where the topic, the headings, and the link relationships are obvious. A header set like: …lets any answer engine that fetches your URL get a perfect, parser-free abstract before the body even arrives. You are not begging the crawler to “understand” your page. You are handing it the understanding. You’re publishing a machine-readable “if you quote this page, here is the canonical snippet to use, and here is the license.” That is way more useful for an LLM than scraping a paragraph and hoping it picked the right sentence. For technical SEO operations Crawl prioritization at the edge. Inject X-Crawl-Priority: high on hub pages. Crawl change detection. A hash of the page’s link graph in the header lets you detect navigation drift between deploys without comparing HTML. Independence from rendering. Edge headers are added by your platform, not your CMS. So even if marketing forgets to add canonical tags, the platform still publishes the truth. Has anyone done this before? (I checked, yes and no.) After SEO Week I spent an hour digging through prior art, because if this idea were obvious someone would have shipped it by 2015. Here is what exists, and how it differs from what I am proposing. 1. RFC 8288 - Link HTTP response header (2017). The canonical standard for putting links in HTTP headers. Used in production for rel="canonical", rel="next"/"prev", rel="preload", rel="api-catalog". It is not used to ship a page’s full internal link graph wrong shape (one rel per link), and the syntax balloons fast. 2. Cloudflare’s Agent Readiness framework (Q1 2026). Cloudflare’s agent-readiness post lists Link headers as one of three official “discoverability” standards for agents. Alongside robots.txt and sitemap.xml. They explicitly say: “agents can discover important resources directly from HTTP response headers… without having to parse any markup.” But they’re still scoped to the standard rels. Nobody at Cloudflare (or anyone else I could find) is pushing a “full link graph in a header” pattern. 3. llms.txt (Sept 2024, Jeremy Howard / Answer.AI). A markdown file at /llms.txt that describes your site for LLMs. Closest to my idea in intent. But it is a separate file (extra round-trip), curated and static, not per-page metadata. My approach: zero extra requests, per-page granularity, machine-precise structured payload instead of a hand-curated reading list. 4. JSON-LD via Accept: application/ld+json content negotiation. A documented pattern to serve a JSON-LD version of a page when the client asks for it. Still a separate request, still the body, still requires the agent to opt in. My approach arrives unconditionally on every response, in the headers, regardless of what the client asked for. 5. Open Graph / structured data via headers (proposals). A few Stack Overflow and Webmasters SE threads where developers asked, “why can’t we put OG/JSON-LD in HTTP headers?” The community answer is always: “search engines don’t support it,” so it stayed theoretical. My angle is different. I am not asking Google to parse it. I am publishing it for my own tooling, my own SEO/AEO pipelines, my own monitoring, my own dev team and any forward-thinking agent that decides to read it. That inversion is the new bit. The prior art is all about what crawlers ask for. My framing is about what site owners can publish. If anyone has shipped exactly this before, I genuinely want to hear about it. Reach out and I’ll happily update this post. A real-world warning: I broke my own site doing this Now the embarrassing part. About an hour before I sat down to write this post, I deployed the second header (X-Headings) and ran a probe. The 100 sub-page URLs returned 100/100 success. Then I tried curl https://data.stateglobe.com/. My homepage was 500-ing. So was /blog. So was /articles. So was /analytics every single hub page. The reason? My homepage had 230 links and a long heading list. X-Internal-Links alone was around 10.8 KB. Add X-Headings at ~6.9 KB on top, and the combined response headers on those hub pages crossed ~17 KB. That breached the response-header size limit on the path between origin and edge, and the platform refused to deliver the response at all. Real users were getting 500s. The sub-pages, where the link list was shorter (31 links, 1.5 KB) and the heading list was small (8 items, 540 bytes), worked perfectly. The hubs, which are by definition the most important pages on the site, were broken. This is exactly the kind of thing you don’t want to learn in production. The fix: a defensive header-budget module After watching my own homepage 500 in production, I extracted a tiny, dependency-free TypeScript module that enforces a combined header-size budget and gracefully truncates the payload before it can blow up your origin. It’s in the repo at src/headers.ts and works in any modern JS runtime. Cloudflare Workers, Next.js middleware, Deno, Bun, Node 18+. The interface is minimal: Internally it does three things: Per-header cap (default 6 KB). Each list is shrunk in 10% chunks until its base64url-encoded size fits under the per-header budget. Combined hard cap (default 12 KB). If both fit individually but their sum exceeds 12 KB, the heading list is truncated first, then if needed the link list, until the combined size fits. Truncation telemetry. When clipping happens, the response gains X-Internal-Links-Truncated: 1 and X-Internal-Links-Original: 230 (and the equivalents for headings), so your monitoring can alert you when budgets are being hit. I added a stress test in scripts/test-headers-budget.mjs that throws 5,000 links + 5,000 headings at the function. Result: combined output is 11.4 KB, safely under the 12 KB cap, and the response still ships valid (truncated) payloads. No 500. Ever. What’s running on data.stateglobe.com right now After deploying that module and a defensive combined > 12 KB → truncate rule, the live site is back to 200s on every page, including the homepage and hubs. Important note about the live demo: to make the experiment safe to run on a real site, I’m intentionally truncating the payload on data.stateglobe.com for testing. Hub pages with 200+ links would otherwise need a chunked-header approach (X-Internal-Links-1, X-Internal-Links-2, …) to ship the full graph. For the public demo, I’d rather you see a clean, 200-OK response with a representative subset of links + headings than a “complete” payload that risks 500ing hubs. Treat the live numbers as a lower bound on what’s possible. If you want to see the full, untruncated technique, run the local Worker in the repo small demo site, no truncation needed. Read this list before you ship anything So please, before you ship this on anything that matters: HTTP response header size limits are real and origin-dependent. Cloudflare’s default response-header limit is around 16 KB. Many origins enforce 8 KB or stricter. If your combined headers don’t fit, your origin returns 5xx, to crawlers and to humans. Hub pages are the danger zone. A homepage with 200+ links and a 50-item heading map can easily blow past the limit. Test every hub before rollout. Always cap the payload defensively. Use attachStructuralHeaders or roll your own equivalent. Set a hard byte limit (e.g. 6 KB per header, 12 KB combined) and gracefully truncate when over budget. Better to ship 50 of 200 links than to 500 the page. Cache it at the edge. Workers don’t cache by default. You have to explicitly caches.default.put with a real Cache-Control. Otherwise every crawl hits compute, and you’ll see 76 RPS instead of 660. Edge caches are sticky. After deploying a new header shape, purge once, otherwise old responses keep getting served. HEAD requests are not your friend. I tried switching the crawler to HEAD to skip body bytes. Cloudflare and many origins respond differently to HEAD, and I lost the headers. Stick with GET; the body is cheap to drop on the client side. Treat missing headers as crawl-incomplete, not as data. When my first insights pass tagged 256 pages as “dead ends,” they were actually pages where the header was missing on that fetch. Re-crawl those before drawing conclusions. Scope this to your own domains. This is a publishing technique for site owners, not a bypass tool. Don’t go ramming custom headers through someone else’s WAF and call it a day. Important: do not roll this out alone, especially in enterprise I want to be very direct about this, because it will save someone’s job. This experiment is fun on a personal site. It is not a thing you ship to an enterprise site without your dev team in the room. Specifically your platform/SRE team, your security team, and whoever owns your CDN config. Why? This change touches your edge layer, your origin response budget, and your bot/security ruleset all at once. None of those are “the SEO team’s domain.” A misconfigured rule can 5xx your homepage to real users (literally what happened to me an hour ago, in a controlled experiment). Many enterprise stacks (Akamai, Fastly, NetScaler, F5, custom WAFs) have their own response-header rewriting, header-size limits, and stripping behaviors that aren’t documented anywhere obvious. You will only find out by testing. You almost certainly need to coordinate with logging and observability. Custom headers can show up in access logs, in CDN logs, in SIEM rules, in compliance audits. Surprise nobody. Custom headers can also get stripped by intermediate proxies (some corporate/enterprise WANs aggressively prune unknown headers). Test from the actual customer egress paths if your audience includes corporate networks. The right rollout is: Build it on a staging hostname. Get sign-off from platform/SRE on header-size limits, caching strategy, and observability impact. Get sign-off from security on the new bot/agent surface. Roll out to a single product directory first (e.g. /blog/* only), monitor 5xx for a week. Then expand to the rest. If you are an SEO doing this solo on a serious site, you are going to have a bad week. Bring your team in early. They will probably also help you avoid the homepage-500 trap I just walked into. The architecture, in three boring sentences Generate a small JSON payload per page describing its links / headings / topic. Cache it. On every response, attach the payload as a base64url-encoded HTTP header. Cap the size so combined headers stay safely under your origin’s limit. Cache the response at the edge so the second crawl is essentially free. That’s it. There is no machine learning, no fancy infra, no protocol change. It is the kind of thing a single dev can ship in an afternoon and the kind of thing a careless dev can use to break production in five minutes. Be the first kind. The bigger picture: headers as a publishing surface Most of SEO has trained us to think of “the content” as the thing on the page. The body. The DOM. The words. But every HTTP response is actually two things stacked on top of each other: For 30 years, we shoved everything into the body and asked machines to figure it out. Crawlers got stronger, parsers got smarter, and we kept paying the cost of that translation, every single request, forever. Meanwhile, the headers above the body the most efficient part of every HTTP response on the planet were sitting there carrying Content-Type: text/html and not much else. I think that is going to change. As LLMs and answer engines become the dominant consumers of the web, they will reward sites that publish structured intent. And there is no faster, cheaper, more universal place to publish structured intent than HTTP headers. You don’t need a new protocol. You don’t need a new standard. You don’t need WebSub or ActivityPub or some W3C committee. You just need to put the JSON in the header and serve it. Carefully. The crawler does the rest in milliseconds. Where I’m taking this next A few directions I’m exploring this week: A defensive header budget library. ✅ Shipped. src/headers.ts Pure, dependency-free, drop into any Worker or middleware. Stress-tested with 5,000 links + 5,000 headings. Will not let your origin 500. Topic embeddings as a header. A 256-dim quantized vector base64’d. LLMs can compare pages without reading them. A crawl-budget protocol. A header that says “I have 65,000 pages, 240 hubs, last full re-index was 14 hours ago.” Let crawlers use that to decide how aggressively to revisit. A public hosted “header crawler” API. Take any site you own, pass the base URL, and watch a 99-second full audit happen. The Rust crawler in the repo already does this; I want to host it as a free SaaS for the SEO community. AEO-friendly header pack. X-Headings + X-Page-Topic + X-Cite-Snippet + X-Page-Summary A reference set with sensible defaults and hard size caps. Chunked-header support. X-Internal-Links-1, X-Internal-Links-2, … so very large hubs can ship the full graph without truncation. If any of this excites you, the code is on GitHub. Open an issue, send a PR, or just ping me. This is the kind of weekend rabbit hole that is genuinely more fun with collaborators. You can also check outputs; I’m building my own custom reporting schema. Homepage in nodes (HTML): https://metehan.ai/graph-input/link-graph JSON report: https://metehan.ai/graph-input/seo-graph-report.json CSV report: https://metehan.ai/graph-input/seo-header-report.csv Header summary: https://metehan.ai/graph-input/seo-header-summary.json Closing thought I went into this experiment thinking I would solve a crawling problem. I came out of SEO Week 2026 thinking I had stumbled onto a different way of looking at the entire SEO/AEO/GEO stack: the header is the most under-used publishing surface on the web, and it is going to matter more, not less, as machines do more of the reading. 65,000 pages. 99 seconds. No HTML parsed. A 41% orphan rate I would have never caught with traditional tools. A 100-URL fresh probe with both headers, both decoded, in 1.88 seconds. One spectacularly broken homepage as a free lesson. And about a hundred new ideas for what to put in a header next. If you have a site you own, an afternoon to spare, and your dev team in Slack, try it. The first thing you will notice is how small the change is. The second thing you will notice is how much it changes how you think about your site. Thanks again to Michael King and the iPullRank team for putting on SEO Week 2026. This idea wouldn’t exist without that room. Code, sample reports, and the full Worker: github.com/metehan777/http-header-link-graph --- ### How I Turned Meta AI's Brain Scanner Model Into a Free SEO Tool URL: https://metehan.ai/blog/how-i-turned-meta-ais-brain-scanner-model-into-a-free-seo-tool/ Date: 2026-03-30 Using fMRI Brain Data to Score Content Before Publishing The tool tells you how the brain actually responds to your titles, intros, and SERP screenshots before you publish anything. You can try it here: NeuralSEO on Hugging Face The problem with current SEO tools & AI Every SEO tool or AI model on the market measures the same thing: what already happened. Rankings, clicks, impressions, keyword difficulty. All lagging indicators. You publish, wait, and hope. AI models are trying to predict what you asked for and it's blind. But the real question has always been: will a human brain actually pay attention to this? That's not a metaphor. It's literally measurable. And now, thanks to neuroscience research from Meta AI, we can predict it before publishing. What is Meta AI TRIBE v2? TRIBE v2 stands for TRImodal Brain Encoder v2. It's a foundation model from Meta AI's FAIR lab. It was trained on fMRI recordings, actual brain scans, collected while 700+ volunteers watched videos, listened to audio, and read text. It's a multilanguage model. The model learned to predict how the human cortex responds to any input across roughly 20,000 cortical vertices. Feed it a sentence, and it tells you which brain regions activate, how strongly, and for how long. I looked at this and immediately thought: this can help for SEO. And of course, it's experimental. What NeuralSEO does I built three core analysis tools on top of TRIBE v2. 1. Neural Screenshot Analyzer Upload a screenshot of a Google SERP, ChatGPT response, Perplexity answer, or Google AI Mode result. The tool splits the screenshot into layout regions like title, snippet, sidebar and so on. It crops each region and runs it through TRIBE v2's visual inference pipeline. It scores each element by neural attention activation. Then it draws live scored overlays directly on the image so you can see exactly which parts of the page grab the brain's attention. This is the closest thing to eye-tracking without actual eye-tracking hardware. 2. Intro Paragraph Analyzer Paste your opening paragraph (auto-trimmed to 600 characters). TRIBE v2 scores it across four neural dimensions: Hook Strength: does the opening trigger frontal attention networks? Engagement: global neural activation level Salience: does it stand out from noise? Retention: will the reader's brain encode this into memory? You get a 0 to 100 neural score, a radar chart breakdown, and optional Gemini-powered rewrite recommendations. 3. Neural CTR Predictor Enter a keyword. Gemini generates 10 to 20 dynamic title tag variants. Each title runs through TRIBE v2 individually, scored by frontal attention network activation and salience response. You get a ranked list of predicted organic CTR before you publish, without A/B testing. How brain signals map to SEO signals Here's how TRIBE v2's brain activation patterns map to SEO-relevant signals: These aren't traditional SEO metrics. They're neurological proxies, directional signals based on how the human brain processes content. Technical architecture The stack runs on Hugging Face Spaces with GPU allocation: Model: facebook/tribev2, Meta's trimodal brain encoder Inference: Text goes through TTS audio, then word-level timestamps via faster-whisper, then TRIBE v2 fMRI prediction Visual pipeline: Image becomes a short MP4 video via moviepy, then goes through TRIBE v2 visual inference Title generation: Google Gemini Flash generates dynamic variants and TRIBE v2 scores them Frontend: Gradio with custom dark theme, procedural Three.js brain visualization (brain part still needs a development) Brain viewer: Procedural mesh with 5 cortical regions that light up based on actual analysis scores The text pipeline is particularly interesting. TRIBE v2 was trained on multimodal data, so even for text analysis, the input goes through a TTS step to generate audio, which is then transcribed with word-level timestamps. This gives the model the temporal dynamics it needs to predict brain activation patterns over time. Limitations This needs to be said clearly. Neural scores are directional signals, not ground truth ranking guarantees. Google's ranking algorithm doesn't use fMRI data. Or do they? Google has different patterns. GPU quotas are real. On Hugging Face's free tier, large batches can timeout. Use smaller inputs when possible. First request is slow. The TRIBE v2 model weighs around 6 GB and loads on the first inference call. Non-commercial use only. TRIBE v2 is licensed CC BY-NC 4.0. Why I built this I've been in SEO for years, and the gap between what we measure and what actually matters to users has always bothered me. We optimize for algorithms, but algorithms are trying to approximate what humans want. TRIBE v2 skips the algorithm entirely. It predicts the human response directly. Is it perfect? No. Is it a useful signal? I believe so. At minimum, it's a fundamentally different lens on content quality, one grounded in neuroscience rather than keyword density. Try it NeuralSEO is free and open: If you find it useful or have feedback, reach out: Newsletter: metehanai.substack.com Website: metehan.ai X: @metehan777 LinkedIn: metehanyesilyurt - - NeuralSEO is built on Meta AI's TRIBE v2 (CC BY-NC 4.0). For non-commercial use only. --- ### I Turned 16 Months of Google Search Console Data Into a Vector Database. URL: https://metehan.ai/blog/i-turned-16-months-of-google-search-console-data-into-a-vector-database-heres-what-i-learned/ Date: 2026-03-18 I use OpenClaw as my daily SEO automation agent. It handles a lot of the repetitive work for me, but I kept running into the same limitation: OpenClaw is great at executing tasks and running skills, but it doesn't have deep awareness of my historical search performance. It knows what I tell it in the moment. It doesn't know that a query cluster has been declining for three months or that a page I published in January is now cannibalizing an older one. It's trying to work with the GSC API but crashes then start working on "made up" data by itself... So I built a tool that pulls 16 months of GSC data, embeds it into a local ChromaDB vector database, and lets me ask questions in plain English using Gemini, Grok, or Claude. I also wired up Parallel.ai to scrape competitor pages so the AI can tell me what content I'm missing. The tool works. But building it taught me more about when vector databases make sense and when they don't than any tutorial ever could. What I Built The pipeline is straightforward: Extract all GSC data via the API (queries, pages, clicks, impressions, CTR, position, dates) Aggregate the raw rows into query-page pair documents with computed trends (rising, declining, stable, new) Embed everything using Gemini's embedding model and store it in ChromaDB Query the vector database semantically and feed the results to an LLM for analysis I added three LLM providers because they each bring something different. Gemini Flash is fast and free, good for quick checks. Grok has a 2M token context window, so I can send it 400 documents from the vector DB instead of 50. Claude tends to give more strategic, nuanced recommendations. For deeper analysis, I integrated Parallel.ai's search and extract APIs. This lets me scrape my own pages and competitor pages, then feed everything to the AI for a side-by-side content gap analysis. The GSC data tells me how I rank. The scraped content tells me why. The Honest Problem With This Approach Here's the thing I don't see people talk about enough: GSC data is structured. It's rows and columns. Queries, numbers, dates. This is exactly what SQL databases were designed for. When I ask the vector database "find queries with high impressions but low CTR," it doesn't actually do math. It embeds that sentence and finds documents whose text is semantically similar to it. That's a fundamentally different operation than SELECT * FROM gsc WHERE impressions > 1000 AND ctr < 0.03. The vector DB might return a query with 200 impressions because its text happened to be close in embedding space. It might miss a query with 50,000 impressions because the text representation didn't match. There's no guarantee of numerical correctness. For aggregations like "top 10 pages by clicks" or "average CTR by device type," SQL would give me the exact right answer every time. The vector DB gives me a best-effort approximation based on text similarity. Where the Vector DB Actually Helps That said, there are things the vector DB does that SQL simply can't. If I ask "what content about AI is performing on my site?", the vector DB finds queries like "neural network tutorial," "transformer architecture explained," and "deep learning vs machine learning." None of those contain the words "AI," but they're all semantically related. To get this from SQL, I'd need to manually build keyword lists for every possible topic. That doesn't scale. The natural language interface is genuinely useful. I can ask vague, exploratory questions like "what's happening with my technical SEO content?" and get relevant data back. With SQL, I'd need to know exactly what I'm looking for before I can write the query. The vector DB also surfaces connections I wouldn't think to look for. When related queries cluster together in embedding space, patterns emerge that I'd miss scrolling through spreadsheets. The Comparison Nobody Makes I keep seeing people build GSC MCP servers for real-time lookups. That works for quick checks, but you can't do bulk historical analysis across 16 months of data through an MCP. Every question is an API call, and you hit rate limits fast on complex analyses. Here's how the three approaches actually compare: The honest answer is that the ideal setup would combine SQL for exact numerical queries with a vector DB for semantic discovery, with an LLM layer on top of both. My tool leans into the vector DB side, which makes it strong for exploratory analysis but less precise for exact metric filtering. What Actually Matters The part of this project that adds the most value isn't the vector database itself. It's the data processing pipeline. The raw GSC API returns one row per query per page per date. For a site with decent traffic over 16 months, that's millions of rows. My data processor aggregates all of that into meaningful documents: total clicks, average CTR, weighted average position, and a trend classification based on comparing recent performance to historical performance. That aggregation step turns noise into signal before anything touches the vector DB or the LLM. Without it, you'd be embedding raw API rows, which is mostly useless. The second most valuable piece is the Parallel.ai integration. GSC data only tells you what's happening. It doesn't tell you what your competitors are doing differently. By scraping actual page content and feeding it alongside GSC metrics to the LLM, the analysis goes from "your CTR is low" to "your competitors have comparison tables and FAQ sections that you're missing." How to Use It The tool is open source: github.com/metehan777/vectordb-gsc Setup takes a few minutes. You need a Google Cloud service account with Search Console API access, a free Gemini API key for embeddings and analysis, and optionally API keys for Grok, Claude, or Parallel.ai. First run: Then ask anything: What I'd Do Differently If I were starting over, I'd add DuckDB alongside ChromaDB. Use SQL for any question involving specific numbers or thresholds, and the vector DB for semantic discovery and topic clustering. The LLM would decide which backend to query based on the question. I'd also experiment with embedding the data differently. Right now, each document is a text description like "Query: seo tools, Page: /blog/seo, Clicks: 150, Impressions: 5000." The numerical values get lost in embedding space. A better approach might be to store metrics separately as metadata and only embed the semantic content (query text, page URL, topic). But the current version works well enough for what I actually use it for: finding patterns, discovering opportunities, and getting content recommendations that are grounded in real data rather than generic advice. The code is MIT licensed. If you find it useful or want to improve it, contributions are welcome. --- ### "I Built a 60,000-Page AI Website for $10: GPTBot Crawled It 30,000+ URL: https://metehan.ai/blog/i-built-a-60-000-page-ai-website-for-10-gptbot-crawled-it-30-000-times-in-12-hours/ Date: 2026-03-04 A wild experiment. Why I Built This Let me be clear upfront: this website is designed purely as an experiment. I wanted to observe what real traffic looks like on a large-scale programmatic SEO site and more importantly, how AI crawlers behave in the wild. This is not a guide on how to build a sustainable business with AI-generated content. If you create programmatic SEO pages only for traffic, one of two things will happen: Your traffic will tank within weeks due to deindexing, Google's systems are increasingly good at detecting thin, templated content at scale You'll see initial traction, then a slow bleed over a few months as manual reviews or algorithm updates catch up I've seen this pattern play out repeatedly across the industry. The economics of generating pages are now so cheap that the barrier is essentially zero which is exactly why Google has gotten aggressive about it. So why build it? Because the interesting part isn't the SEO. It's the bots. I did not expect GPTBot to crawl a brand-new, zero-backlink domain at the scale it did. That was the real discovery. The Experiment I built StateGlobe.com, a Statista-style statistics website covering digital marketing, SEO, content marketing, and web technology across 200 countries. Every single page was generated by AI. 60,000 pages generated in under 30 minutes Total API cost: less than $10 Model used: `gpt-4.1-nano` via OpenAI's Chat Completions API Hosted on Cloudflare Workers + D1 (serverless, edge-rendered) The entire project is open source. The Tech Stack Content Generation Pipeline The pipeline is straightforward: Taxonomy: 300 statistical topics × 200 countries = 60,000 unique combinations Generation: A Node.js script fires real-time API calls to gpt-4.1-nano with controlled concurrency (50 parallel requests) and a token bucket rate limiter Output: Each page gets a title, meta description, 5 key statistics, 3 analysis paragraphs, and 2 FAQ items. All as structured JSON Import: Results are bulk-imported into Cloudflare D1 (serverless SQLite) The prompt asks the model to produce 2026 projections based on industry trends. Each response costs fractions of a cent, gpt-4.1-nano with max_tokens: 700 and response_format: json_object. Cloudflare Architecture Cloudflare Workers: Edge-rendered HTML, no build step, no static files. Every page is assembled on-demand from D1 data Cloudflare D1: Serverless SQLite storing all 60,000 pages and visit analytics Dynamic OG Images: Generated on-the-fly as PNGs using @resvg/resvg-wasm with the Inter font loaded from a CDN. No pre-generated images, no storage costs Clean URLs: /brazil/seo-budget-allocation-statistics No .html extensions, proper 404 headers SEO: Structured data (Article, FAQPage, BreadcrumbList), XML sitemaps (paginated), canonical URLs, internal linking (same topic across countries, same country across topics) Total hosting cost: effectively free on Cloudflare's free tier. What Happened Next: The Bots Arrived Within minutes of deploying, GPTBot started crawling. Hard. First 12 Hours 29,000+ requests from GPTBot alone GPTBot was hitting the site at roughly 1 request per second, systematically crawling through pages OAI-SearchBot and ChatGPT-User also showed up GoogleOther appeared with 60+ requests Googlebot, AhrefsBot, and PerplexityBot followed 3-Hour Snapshot after Server Side Tracking Enabled By the time server-side tracking was fully operational, Cloudflare's own analytics showed 37,000+ total requests to the worker. GPTBot was by far the most aggressive crawler, more active than Googlebot by orders of magnitude. The Part I Didn't Expect I've launched plenty of sites before. I expected Googlebot to show up, maybe some SEO tool crawlers. That's normal. What I did not expect was OpenAI's GPTBot hitting a brand-new domain — with zero backlinks, zero social shares, no Search Console submission, at 1 request per second within minutes of deployment. It found the site through the XML sitemap and just started consuming everything. This raises serious questions. If GPTBot is this aggressive on fresh domains, how much of the open web is it processing daily? And what does it mean for site owners who haven't explicitly blocked it in robots.txt? For context: Googlebot made 11 requests in the same period that GPTBot made 5,200+. That's a 470x difference in crawl intensity. Building Public Analytics I wanted anyone to see what was happening in real time, so I built a public analytics dashboard at stateglobe.com/analytics. Server-Side Tracking Initially, I used a client-side beacon (navigator.sendBeacon). But bots don't execute JavaScript, so I was missing all bot traffic. The fix was server-side tracking: Every request to the Worker records page_slug, user_agent, country (from cf-ipcountry), and is_bot directly into D1 Bot detection runs against a pattern list (GPTBot, Googlebot, AhrefsBot, etc.) ctx.waitUntil() ensures the D1 write completes without blocking the response The client-side beacon was removed entirely — one clean tracking path The Dashboard Shows Human vs. Bot traffic with separate summary cards (today, this week, all time) Daily traffic chart (inline SVG line chart, last 30 days, human vs. bot) Top pages, Top bots, Top human user agents Recent visits with bot badges, paginated Pre-tracking estimates included in totals with clear notes IP Verification: Catching Spoofed Bots User agent strings can be spoofed by anyone. A curl request with -A "GPTBot" would be counted as a real bot visit. So I implemented IP verification for OpenAI bots: Downloaded the official IP ranges from openai.com/gptbot.json, searchbot.json, and chatgpt-user.json Built a CIDR matching engine directly in the Worker, parses IP ranges into bitmasks for efficient lookup When a request claims to be GPTBot, OAI-SearchBot, or ChatGPT-User, the source IP (from Cloudflare's cf-connecting-ip header) is checked against the official ranges If the IP doesn't match -> classified as human (potential spoofer) This means the bot counts on the analytics page are verified — only requests from OpenAI's actual infrastructure count as OpenAI bot traffic. I verified this works: a curl request from Turkey with the GPTBot user agent correctly shows up as a human visit, not a bot. Hiding Content from AI Crawlers Here's an interesting twist. I added an "About This Experiment" box on the homepage explaining that the site is AI-generated. But I didn't want GPTBot to read it and potentially use it to discount the content. The solution: render it client-side only. The HTML source contains an empty
and a