# Metehan Yesilyurt — metehan.ai (Full Content) > AI Search and SEO researcher. Publishing original research on LLM visibility, AI-powered search engines, and the next era of discoverability. ## Links - Website: https://metehan.ai - X: https://x.com/metehan777 - LinkedIn: https://www.linkedin.com/in/metehanyesilyurt - GitHub: https://github.com/metehan777 - Substack: https://metehanai.substack.com - RSS: https://metehan.ai/rss.xml - Sitemap: https://metehan.ai/sitemap-index.xml - llms.txt: https://metehan.ai/llms.txt ## Full Articles (173) ### How I Turned Meta AI's Brain Scanner Model Into a Free SEO Tool URL: https://metehan.ai/blog/how-i-turned-meta-ais-brain-scanner-model-into-a-free-seo-tool/ Date: 2026-03-30 Category: featured-research ## Using fMRI Brain Data to Score Content Before Publishing The tool tells you how the brain actually responds to your titles, intros, and SERP screenshots before you publish anything. You can try it here: [NeuralSEO on Hugging Face](https://huggingface.co/spaces/metehan777/neuralseo) ## The problem with current SEO tools & AI Every SEO tool or AI model on the market measures the same thing: what already happened. Rankings, clicks, impressions, keyword difficulty. All lagging indicators. You publish, wait, and hope. AI models are trying to predict what you asked for and it's blind. But the real question has always been: will a human brain actually pay attention to this? That's not a metaphor. It's literally measurable. And now, thanks to neuroscience research from Meta AI, we can predict it before publishing. ## What is Meta AI TRIBE v2? [TRIBE v2](https://huggingface.co/facebook/tribev2) stands for TRImodal Brain Encoder v2. It's a foundation model from Meta AI's FAIR lab. It was trained on fMRI recordings, actual brain scans, collected while 700+ volunteers watched videos, listened to audio, and read text. It's a multilanguage model. ![](/wp-content/uploads/screenshot-at-mar-30-11-39-18.png "TRIBE v2 predicts fMRI activity in the well known language network. This approach can effectively identify the brain areas typically associated with...") The model learned to predict how the human cortex responds to any input across roughly 20,000 cortical vertices. Feed it a sentence, and it tells you which brain regions activate, how strongly, and for how long. I looked at this and immediately thought: this can help for SEO. And of course, **it's experimental.** ## What NeuralSEO does I built three core analysis tools on top of TRIBE v2. ### 1. Neural Screenshot Analyzer Upload a screenshot of a Google SERP, ChatGPT response, Perplexity answer, or Google AI Mode result. The tool splits the screenshot into layout regions like title, snippet, sidebar and so on. It crops each region and runs it through TRIBE v2's visual inference pipeline. It scores each element by neural attention activation. Then it draws live scored overlays directly on the image so you can see exactly which parts of the page grab the brain's attention. This is the closest thing to eye-tracking without actual eye-tracking hardware. ### 2. Intro Paragraph Analyzer Paste your opening paragraph (auto-trimmed to 600 characters). TRIBE v2 scores it across four neural dimensions: * Hook Strength: does the opening trigger frontal attention networks? * Engagement: global neural activation level * Salience: does it stand out from noise? * Retention: will the reader's brain encode this into memory? You get a 0 to 100 neural score, a radar chart breakdown, and optional Gemini-powered rewrite recommendations. ![](/wp-content/uploads/title-experiment.png) ### 3. Neural CTR Predictor Enter a keyword. Gemini generates 10 to 20 dynamic title tag variants. Each title runs through TRIBE v2 individually, scored by frontal attention network activation and salience response. You get a ranked list of predicted organic CTR before you publish, without A/B testing. ## How brain signals map to SEO signals Here's how TRIBE v2's brain activation patterns map to SEO-relevant signals: | Neural Signal | SEO Meaning | | --------------------------------------- | -------------------------------------- | | Language comprehension activation | Readability and clarity | | Frontal attention networks | Will readers stay or bounce? | | Activation entropy (spatial complexity) | E-E-A-T proxy: expert vs. thin content | | Salience network | Does your title demand attention? | | Default Mode Network (inverse) | Mind-wandering risk = bounce rate risk | These aren't traditional SEO metrics. They're neurological proxies, directional signals based on how the human brain processes content. ## Technical architecture The stack runs on Hugging Face Spaces with GPU allocation: * Model: facebook/tribev2, Meta's trimodal brain encoder * Inference: Text goes through TTS audio, then word-level timestamps via faster-whisper, then TRIBE v2 fMRI prediction * Visual pipeline: Image becomes a short MP4 video via moviepy, then goes through TRIBE v2 visual inference * Title generation: Google Gemini Flash generates dynamic variants and TRIBE v2 scores them * Frontend: Gradio with custom dark theme, procedural Three.js brain visualization (brain part still needs a development) * Brain viewer: Procedural mesh with 5 cortical regions that light up based on actual analysis scores The text pipeline is particularly interesting. TRIBE v2 was trained on multimodal data, so even for text analysis, the input goes through a TTS step to generate audio, which is then transcribed with word-level timestamps. This gives the model the temporal dynamics it needs to predict brain activation patterns over time. ![](/wp-content/uploads/meta-tribe-v2.png) ## Limitations This needs to be said clearly. Neural scores are directional signals, not ground truth ranking guarantees. Google's ranking algorithm doesn't use fMRI data. Or do they? Google has different patterns. GPU quotas are real. On Hugging Face's free tier, large batches can timeout. Use smaller inputs when possible. First request is slow. The TRIBE v2 model weighs around 6 GB and loads on the first inference call. Non-commercial use only. TRIBE v2 is licensed CC BY-NC 4.0. ![](/wp-content/uploads/screencapture-huggingface-co-spaces-metehan777-neuralseo-2026-03-29-23_46_01.png) ## Why I built this I've been in SEO for years, and the gap between what we measure and what actually matters to users has always bothered me. We optimize for algorithms, but algorithms are trying to approximate what humans want. TRIBE v2 skips the algorithm entirely. It predicts the human response directly. Is it perfect? No. Is it a useful signal? I believe so. At minimum, it's a fundamentally different lens on content quality, one grounded in neuroscience rather than keyword density. ## Try it NeuralSEO is free and open: If you find it useful or have feedback, reach out: * Newsletter: [metehanai.substack.com](https://metehanai.substack.com) * Website: [metehan.ai](https://metehan.ai) * X: [@metehan777](https://x.com/metehan777) * LinkedIn: [metehanyesilyurt](https://linkedin.com/in/metehanyesilyurt) - - - NeuralSEO is built on Meta AI's TRIBE v2 (CC BY-NC 4.0). For non-commercial use only. --- ### I Turned 16 Months of Google Search Console Data Into a Vector Database. Here's What I Learned. URL: https://metehan.ai/blog/i-turned-16-months-of-google-search-console-data-into-a-vector-database-heres-what-i-learned/ Date: 2026-03-18 Category: experiment I use [OpenClaw](https://github.com/openclaw/openclaw) as my daily SEO automation agent. It handles a lot of the repetitive work for me, but I kept running into the same limitation: OpenClaw is great at executing tasks and running skills, but it doesn't have deep awareness of my historical search performance. It knows what I tell it in the moment. It doesn't know that a query cluster has been declining for three months or that a page I published in January is now cannibalizing an older one. It's trying to work with the GSC API but crashes then start working on "made up" data by itself... So I built a tool that pulls 16 months of GSC data, embeds it into a local ChromaDB vector database, and lets me ask questions in plain English using Gemini, Grok, or Claude. I also wired up Parallel.ai to scrape competitor pages so the AI can tell me what content I'm missing. The tool works. But building it taught me more about when vector databases make sense and when they don't than any tutorial ever could. ![](/wp-content/uploads/google-search-console-cli-vectordb-2.png) ## What I Built The pipeline is straightforward: 1. Extract all GSC data via the API (queries, pages, clicks, impressions, CTR, position, dates) 2. Aggregate the raw rows into query-page pair documents with computed trends (rising, declining, stable, new) 3. Embed everything using Gemini's embedding model and store it in ChromaDB 4. Query the vector database semantically and feed the results to an LLM for analysis I added three LLM providers because they each bring something different. Gemini Flash is fast and free, good for quick checks. Grok has a 2M token context window, so I can send it 400 documents from the vector DB instead of 50. Claude tends to give more strategic, nuanced recommendations. For deeper analysis, I integrated Parallel.ai's search and extract APIs. This lets me scrape my own pages and competitor pages, then feed everything to the AI for a side-by-side content gap analysis. The GSC data tells me how I rank. The scraped content tells me why. ## The Honest Problem With This Approach Here's the thing I don't see people talk about enough: GSC data is structured. It's rows and columns. Queries, numbers, dates. This is exactly what SQL databases were designed for. When I ask the vector database "find queries with high impressions but low CTR," it doesn't actually do math. It embeds that sentence and finds documents whose text is semantically similar to it. That's a fundamentally different operation than `SELECT * FROM gsc WHERE impressions > 1000 AND ctr < 0.03`. The vector DB might return a query with 200 impressions because its text happened to be close in embedding space. It might miss a query with 50,000 impressions because the text representation didn't match. There's no guarantee of numerical correctness. For aggregations like "top 10 pages by clicks" or "average CTR by device type," SQL would give me the exact right answer every time. The vector DB gives me a best-effort approximation based on text similarity. ## Where the Vector DB Actually Helps That said, there are things the vector DB does that SQL simply can't. If I ask "what content about AI is performing on my site?", the vector DB finds queries like "neural network tutorial," "transformer architecture explained," and "deep learning vs machine learning." None of those contain the words "AI," but they're all semantically related. To get this from SQL, I'd need to manually build keyword lists for every possible topic. That doesn't scale. The natural language interface is genuinely useful. I can ask vague, exploratory questions like "what's happening with my technical SEO content?" and get relevant data back. With SQL, I'd need to know exactly what I'm looking for before I can write the query. The vector DB also surfaces connections I wouldn't think to look for. When related queries cluster together in embedding space, patterns emerge that I'd miss scrolling through spreadsheets. ## The Comparison Nobody Makes I keep seeing people build GSC MCP servers for real-time lookups. That works for quick checks, but you can't do bulk historical analysis across 16 months of data through an MCP. Every question is an API call, and you hit rate limits fast on complex analyses. Here's how the three approaches actually compare: | | Vector DB | GSC MCP Server | SQL DB | | --------------------------- | ------------------------- | --------------------- | ------------------------ | | Data freshness | Stale (needs refresh) | Real-time | Stale (needs import) | | Numeric precision | Fuzzy | Exact | Exact | | Semantic search | Yes | No | No | | Natural language queries | Yes | No | No | | 16 months of history | Yes | Limited by API quotas | Yes | | Speed | Instant (local) | Slow (API calls) | Instant (local) | | Competitor content analysis | Yes (via Parallel.ai) | No | No | | Best for | Discovery and exploration | Live quick lookups | Precise metric filtering | The honest answer is that the ideal setup would combine SQL for exact numerical queries with a vector DB for semantic discovery, with an LLM layer on top of both. My tool leans into the vector DB side, which makes it strong for exploratory analysis but less precise for exact metric filtering. ## What Actually Matters The part of this project that adds the most value isn't the vector database itself. It's the data processing pipeline. The raw GSC API returns one row per query per page per date. For a site with decent traffic over 16 months, that's millions of rows. My data processor aggregates all of that into meaningful documents: total clicks, average CTR, weighted average position, and a trend classification based on comparing recent performance to historical performance. That aggregation step turns noise into signal before anything touches the vector DB or the LLM. Without it, you'd be embedding raw API rows, which is mostly useless. The second most valuable piece is the Parallel.ai integration. GSC data only tells you what's happening. It doesn't tell you what your competitors are doing differently. By scraping actual page content and feeding it alongside GSC metrics to the LLM, the analysis goes from "your CTR is low" to "your competitors have comparison tables and FAQ sections that you're missing." ![](/wp-content/uploads/vector-database-gsc.png) ## How to Use It The tool is open source: [github.com/metehan777/vectordb-gsc](https://github.com/metehan777/vectordb-gsc) Setup takes a few minutes. You need a Google Cloud service account with Search Console API access, a free Gemini API key for embeddings and analysis, and optionally API keys for Grok, Claude, or Parallel.ai. ```bash git clone https://github.com/metehan777/vectordb-gsc.git cd vectordb-gsc pip install -r requirements.txt ``` First run: ```bash python main.py extract # pulls 16 months of data python main.py process # embeds into ChromaDB ``` Then ask anything: ```bash python main.py ask "which queries are declining?" --grok python main.py audit "https://yoursite.com/page/" --grok python main.py compete "your target keyword" --claude ``` ## What I'd Do Differently If I were starting over, I'd add DuckDB alongside ChromaDB. Use SQL for any question involving specific numbers or thresholds, and the vector DB for semantic discovery and topic clustering. The LLM would decide which backend to query based on the question. I'd also experiment with embedding the data differently. Right now, each document is a text description like "Query: seo tools, Page: /blog/seo, Clicks: 150, Impressions: 5000." The numerical values get lost in embedding space. A better approach might be to store metrics separately as metadata and only embed the semantic content (query text, page URL, topic). But the current version works well enough for what I actually use it for: finding patterns, discovering opportunities, and getting content recommendations that are grounded in real data rather than generic advice. The code is MIT licensed. If you find it useful or want to improve it, contributions are welcome. --- ### I Built a 60,000-Page AI Website for $10: GPTBot Crawled It 30,000+ Times in 12 Hours URL: https://metehan.ai/blog/i-built-a-60-000-page-ai-website-for-10-gptbot-crawled-it-30-000-times-in-12-hours/ Date: 2026-03-04 Category: experiment A wild experiment. ## Why I Built This Let me be clear upfront: **this website is designed purely as an experiment.** I wanted to observe what real traffic looks like on a large-scale programmatic SEO site and more importantly, how AI crawlers behave in the wild. This is not a guide on how to build a sustainable business with AI-generated content. If you create programmatic SEO pages only for traffic, one of two things will happen: 1. **Your traffic will tank within weeks** due to deindexing, Google's systems are increasingly good at detecting thin, templated content at scale 2. **You'll see initial traction, then a slow bleed over a few months** as manual reviews or algorithm updates catch up I've seen this pattern play out repeatedly across the industry. The economics of generating pages are now so cheap that the barrier is essentially zero which is exactly why Google has gotten aggressive about it. So why build it? Because **the interesting part isn't the SEO. It's the bots.** I did not expect GPTBot to crawl a brand-new, zero-backlink domain at the scale it did. That was the real discovery. ## The Experiment I built [StateGlobe.com,](https://stateglobe.com) a Statista-style statistics website covering digital marketing, SEO, content marketing, and web technology across 200 countries. Every single page was generated by AI. * **60,000 pages** generated in under 30 minutes * **Total API cost: less than $10** * **Model used: `gpt-4.1-nano`** via OpenAI's Chat Completions API * **Hosted on Cloudflare Workers + D1** (serverless, edge-rendered) The entire project is open source. ## The Tech Stack ### Content Generation Pipeline The pipeline is straightforward: 1. **Taxonomy**: 300 statistical topics × 200 countries = 60,000 unique combinations 2. **Generation**: A Node.js script fires real-time API calls to `gpt-4.1-nano` with controlled concurrency (50 parallel requests) and a token bucket rate limiter 3. **Output**: Each page gets a title, meta description, 5 key statistics, 3 analysis paragraphs, and 2 FAQ items. All as structured JSON 4. **Import**: Results are bulk-imported into Cloudflare D1 (serverless SQLite) The prompt asks the model to produce 2026 projections based on industry trends. Each response costs fractions of a cent, `gpt-4.1-nano` with `max_tokens: 700` and `response_format: json_object`. ### Cloudflare Architecture * **Cloudflare Workers**: Edge-rendered HTML, no build step, no static files. Every page is assembled on-demand from D1 data * **Cloudflare D1**: Serverless SQLite storing all 60,000 pages and visit analytics * **Dynamic OG Images**: Generated on-the-fly as PNGs using `@resvg/resvg-wasm` with the Inter font loaded from a CDN. No pre-generated images, no storage costs * **Clean URLs**: `/brazil/seo-budget-allocation-statistics` No `.html` extensions, proper 404 headers * **SEO**: Structured data (Article, FAQPage, BreadcrumbList), XML sitemaps (paginated), canonical URLs, internal linking (same topic across countries, same country across topics) Total hosting cost: effectively free on Cloudflare's free tier. ## What Happened Next: The Bots Arrived Within minutes of deploying, **GPTBot** started crawling. Hard. ### First 12 Hours * **29,000+ requests from GPTBot alone** * GPTBot was hitting the site at roughly **1 request per second**, systematically crawling through pages * OAI-SearchBot and ChatGPT-User also showed up * GoogleOther appeared with 60+ requests * Googlebot, AhrefsBot, and PerplexityBot followed ### 3-Hour Snapshot after Server Side Tracking Enabled | Bot | Requests | | ------------- | -------- | | GPTBot | 5,200+ | | GoogleOther | 140+ | | OAI-SearchBot | 94 | | Googlebot | 11 | | AhrefsBot | 7 | | PerplexityBot | 2 | | ChatGPT-User | 1 | By the time server-side tracking was fully operational, Cloudflare's own analytics showed **37,000+ total requests** to the worker. ![39k requests by almost GPTBot](/wp-content/uploads/39k-requests.png) GPTBot was by far the most aggressive crawler, more active than Googlebot by orders of magnitude. ### The Part I Didn't Expect I've launched plenty of sites before. I expected Googlebot to show up, maybe some SEO tool crawlers. That's normal. What I did **not** expect was OpenAI's GPTBot hitting a brand-new domain — with zero backlinks, zero social shares, no Search Console submission, at **1 request per second** within minutes of deployment. It found the site through the XML sitemap and just started consuming everything. This raises serious questions. If GPTBot is this aggressive on fresh domains, how much of the open web is it processing daily? And what does it mean for site owners who haven't explicitly blocked it in `robots.txt`? For context: Googlebot made 11 requests in the same period that GPTBot made 5,200+. That's a **470x difference** in crawl intensity. ## Building Public Analytics I wanted anyone to see what was happening in real time, so I built a **public analytics dashboard** at [stateglobe.com/analytics](https://stateglobe.com/analytics). ### Server-Side Tracking Initially, I used a client-side beacon (`navigator.sendBeacon`). But bots don't execute JavaScript, so I was missing all bot traffic. The fix was server-side tracking: * Every request to the Worker records `page_slug`, `user_agent`, `country` (from `cf-ipcountry`), and `is_bot` directly into D1 * Bot detection runs against a pattern list (GPTBot, Googlebot, AhrefsBot, etc.) * `ctx.waitUntil()` ensures the D1 write completes without blocking the response * The client-side beacon was removed entirely — one clean tracking path ### The Dashboard Shows * **Human vs. Bot traffic** with separate summary cards (today, this week, all time) * **Daily traffic chart** (inline SVG line chart, last 30 days, human vs. bot) * **Top pages**, **Top bots**, **Top human user agents** * **Recent visits** with bot badges, paginated * Pre-tracking estimates included in totals with clear notes ## IP Verification: Catching Spoofed Bots User agent strings can be spoofed by anyone. A `curl` request with `-A "GPTBot"` would be counted as a real bot visit. So I implemented **IP verification for OpenAI bots**: 1. Downloaded the official IP ranges from [openai.com/gptbot.json](https://openai.com/gptbot.json), [searchbot.json](https://openai.com/searchbot.json), and [chatgpt-user.json](https://openai.com/chatgpt-user.json) 2. Built a CIDR matching engine directly in the Worker, parses IP ranges into bitmasks for efficient lookup 3. When a request claims to be GPTBot, OAI-SearchBot, or ChatGPT-User, the source IP (from Cloudflare's `cf-connecting-ip` header) is checked against the official ranges 4. **If the IP doesn't match -> classified as human** (potential spoofer) This means the bot counts on the analytics page are verified — only requests from OpenAI's actual infrastructure count as OpenAI bot traffic. I verified this works: a `curl` request from Turkey with the GPTBot user agent correctly shows up as a human visit, not a bot. ## Hiding Content from AI Crawlers Here's an interesting twist. I added an "About This Experiment" box on the homepage explaining that the site is AI-generated. But I didn't want GPTBot to read it and potentially use it to discount the content. The solution: **render it client-side only**. The HTML source contains an empty `
` and a `