# Metehan Yesilyurt — metehan.ai (Full Content) > AI Search and SEO researcher. Publishing original research on LLM visibility, AI-powered search engines, and the next era of discoverability. ## Links - Website: https://metehan.ai - X: https://x.com/metehan777 - LinkedIn: https://www.linkedin.com/in/metehanyesilyurt - GitHub: https://github.com/metehan777 - Substack: https://metehanai.substack.com - RSS: https://metehan.ai/rss.xml - Sitemap: https://metehan.ai/sitemap-index.xml - llms.txt: https://metehan.ai/llms.txt ## Full Articles (171) ### I Built a 60,000-Page AI Website for $10: GPTBot Crawled It 30,000+ Times in 12 Hours URL: https://metehan.ai/blog/i-built-a-60-000-page-ai-website-for-10-gptbot-crawled-it-30-000-times-in-12-hours/ Date: 2026-03-04 Category: experiment A wild experiment. ## Why I Built This Let me be clear upfront: **this website is designed purely as an experiment.** I wanted to observe what real traffic looks like on a large-scale programmatic SEO site and more importantly, how AI crawlers behave in the wild. This is not a guide on how to build a sustainable business with AI-generated content. If you create programmatic SEO pages only for traffic, one of two things will happen: 1. **Your traffic will tank within weeks** due to deindexing, Google's systems are increasingly good at detecting thin, templated content at scale 2. **You'll see initial traction, then a slow bleed over a few months** as manual reviews or algorithm updates catch up I've seen this pattern play out repeatedly across the industry. The economics of generating pages are now so cheap that the barrier is essentially zero which is exactly why Google has gotten aggressive about it. So why build it? Because **the interesting part isn't the SEO. It's the bots.** I did not expect GPTBot to crawl a brand-new, zero-backlink domain at the scale it did. That was the real discovery. ## The Experiment I built [StateGlobe.com,](https://stateglobe.com) a Statista-style statistics website covering digital marketing, SEO, content marketing, and web technology across 200 countries. Every single page was generated by AI. * **60,000 pages** generated in under 30 minutes * **Total API cost: less than $10** * **Model used: `gpt-4.1-nano`** via OpenAI's Chat Completions API * **Hosted on Cloudflare Workers + D1** (serverless, edge-rendered) The entire project is open source. ## The Tech Stack ### Content Generation Pipeline The pipeline is straightforward: 1. **Taxonomy**: 300 statistical topics × 200 countries = 60,000 unique combinations 2. **Generation**: A Node.js script fires real-time API calls to `gpt-4.1-nano` with controlled concurrency (50 parallel requests) and a token bucket rate limiter 3. **Output**: Each page gets a title, meta description, 5 key statistics, 3 analysis paragraphs, and 2 FAQ items. All as structured JSON 4. **Import**: Results are bulk-imported into Cloudflare D1 (serverless SQLite) The prompt asks the model to produce 2026 projections based on industry trends. Each response costs fractions of a cent, `gpt-4.1-nano` with `max_tokens: 700` and `response_format: json_object`. ### Cloudflare Architecture * **Cloudflare Workers**: Edge-rendered HTML, no build step, no static files. Every page is assembled on-demand from D1 data * **Cloudflare D1**: Serverless SQLite storing all 60,000 pages and visit analytics * **Dynamic OG Images**: Generated on-the-fly as PNGs using `@resvg/resvg-wasm` with the Inter font loaded from a CDN. No pre-generated images, no storage costs * **Clean URLs**: `/brazil/seo-budget-allocation-statistics` No `.html` extensions, proper 404 headers * **SEO**: Structured data (Article, FAQPage, BreadcrumbList), XML sitemaps (paginated), canonical URLs, internal linking (same topic across countries, same country across topics) Total hosting cost: effectively free on Cloudflare's free tier. ## What Happened Next: The Bots Arrived Within minutes of deploying, **GPTBot** started crawling. Hard. ### First 12 Hours * **29,000+ requests from GPTBot alone** * GPTBot was hitting the site at roughly **1 request per second**, systematically crawling through pages * OAI-SearchBot and ChatGPT-User also showed up * GoogleOther appeared with 60+ requests * Googlebot, AhrefsBot, and PerplexityBot followed ### 3-Hour Snapshot after Server Side Tracking Enabled | Bot | Requests | | ------------- | -------- | | GPTBot | 5,200+ | | GoogleOther | 140+ | | OAI-SearchBot | 94 | | Googlebot | 11 | | AhrefsBot | 7 | | PerplexityBot | 2 | | ChatGPT-User | 1 | By the time server-side tracking was fully operational, Cloudflare's own analytics showed **37,000+ total requests** to the worker. ![39k requests by almost GPTBot](/wp-content/uploads/39k-requests.png) GPTBot was by far the most aggressive crawler, more active than Googlebot by orders of magnitude. ### The Part I Didn't Expect I've launched plenty of sites before. I expected Googlebot to show up, maybe some SEO tool crawlers. That's normal. What I did **not** expect was OpenAI's GPTBot hitting a brand-new domain — with zero backlinks, zero social shares, no Search Console submission, at **1 request per second** within minutes of deployment. It found the site through the XML sitemap and just started consuming everything. This raises serious questions. If GPTBot is this aggressive on fresh domains, how much of the open web is it processing daily? And what does it mean for site owners who haven't explicitly blocked it in `robots.txt`? For context: Googlebot made 11 requests in the same period that GPTBot made 5,200+. That's a **470x difference** in crawl intensity. ## Building Public Analytics I wanted anyone to see what was happening in real time, so I built a **public analytics dashboard** at [stateglobe.com/analytics](https://stateglobe.com/analytics). ### Server-Side Tracking Initially, I used a client-side beacon (`navigator.sendBeacon`). But bots don't execute JavaScript, so I was missing all bot traffic. The fix was server-side tracking: * Every request to the Worker records `page_slug`, `user_agent`, `country` (from `cf-ipcountry`), and `is_bot` directly into D1 * Bot detection runs against a pattern list (GPTBot, Googlebot, AhrefsBot, etc.) * `ctx.waitUntil()` ensures the D1 write completes without blocking the response * The client-side beacon was removed entirely — one clean tracking path ### The Dashboard Shows * **Human vs. Bot traffic** with separate summary cards (today, this week, all time) * **Daily traffic chart** (inline SVG line chart, last 30 days, human vs. bot) * **Top pages**, **Top bots**, **Top human user agents** * **Recent visits** with bot badges, paginated * Pre-tracking estimates included in totals with clear notes ## IP Verification: Catching Spoofed Bots User agent strings can be spoofed by anyone. A `curl` request with `-A "GPTBot"` would be counted as a real bot visit. So I implemented **IP verification for OpenAI bots**: 1. Downloaded the official IP ranges from [openai.com/gptbot.json](https://openai.com/gptbot.json), [searchbot.json](https://openai.com/searchbot.json), and [chatgpt-user.json](https://openai.com/chatgpt-user.json) 2. Built a CIDR matching engine directly in the Worker, parses IP ranges into bitmasks for efficient lookup 3. When a request claims to be GPTBot, OAI-SearchBot, or ChatGPT-User, the source IP (from Cloudflare's `cf-connecting-ip` header) is checked against the official ranges 4. **If the IP doesn't match -> classified as human** (potential spoofer) This means the bot counts on the analytics page are verified — only requests from OpenAI's actual infrastructure count as OpenAI bot traffic. I verified this works: a `curl` request from Turkey with the GPTBot user agent correctly shows up as a human visit, not a bot. ## Hiding Content from AI Crawlers Here's an interesting twist. I added an "About This Experiment" box on the homepage explaining that the site is AI-generated. But I didn't want GPTBot to read it and potentially use it to discount the content. The solution: **render it client-side only**. The HTML source contains an empty `
` and a `