{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Google Discover Architecture: Clusters, Classifiers, OG Tags, NAIADES - What SDK Telemetry Reveals",
  "description": "Google Discover Architecture: Clusters, Classifiers, OG Tags, NAIADES - What SDK Telemetry Reveals",
  "datePublished": "2026-02-24T00:00:00.000Z",
  "dateModified": "2026-02-24T00:00:00.000Z",
  "url": "https://metehan.ai/blog/google-discover-architecture/",
  "category": "featured-research",
  "tags": [],
  "image": "/wp-content/uploads/2026/02/discover-2.png",
  "wordCount": 2582,
  "readTime": "13 min",
  "articleBody": "Google Discover serves content to hundreds of millions of users every day, yet its internal mechanics remain largely opaque. Most SEO guidance about Discover comes from Google's own documentation or anecdotal publisher observations. In this post, I want to share a different perspective: what we can learn by examining the observable telemetry, event naming conventions, and client-side state of the Google App.\n\n\\-UPDATED-\n\n**Important note: Everything described below reflects what the client-side code reveals at a specific point in time. This is Google. They can change any of these systems, ranking signals, pipeline stages, telemetry counters, feature flags, on the server side at any moment, without any client update. What you read here is a snapshot, not a permanent blueprint. Treat it as a lens into how these systems work today, not a guarantee of how they will work tomorrow.**\n\nWhere something is an inference rather than a direct observation, I say so. I needed to handle a \"great amount of data\" and removed many parts.\n\nThink of this as reading the nutrition label on a packaged food. You cannot see the factory, but the label tells you quite a lot about what is inside.\n\n[Special thanks to Valentin Pletzer](https://www.linkedin.com/in/valentinpletzer/)\n\n[This post is a summary. I created a full dashboard. It's accessible at metehanai.substack.com's latest post.](https://metehanai.substack.com)\n\n## The Content Pipeline\n\nDiscover's content pipeline can be mapped to several observable stages, each leaving distinct telemetry traces:\n\n1. **Content Ingestion:** Google crawls and indexes content. Entity extraction assigns Knowledge Graph MIDs (`/m/xxxxx`) or (`/g/xxxxx`) to recognized topics.\n2. **Structured Data & Open Graph Parsing:** The client-side parser extracts page metadata with a defined priority order: **Schema.org JSON-LD is checked first**, followed by Open Graph tags, then Twitter Card tags, then generic HTML meta tags. This is a hardcoded fallback chain, not a preference. If JSON-LD has the field, OG tags for that field are never reached.\n3. **Content Classification:** Content is assigned to cluster types and classified for the feed hierarchy.\n4. **Content Filtering** — The system operates two separate filter levels: `filter_collection_status` (publisher/domain level) and `filter_entity_status` (single URL level). These are tracked as Discover-specific streamz metrics at `/client_streamz/android_gsa/discover/app_content/`.\n5. **User Interest Matching:** Content entity MIDs are matched against the user's interest profile through the NAIADES personalization system (verified subtypes below).\n6. **Ranking (Server-Side):** Ranking happens server-side. The client-side code reveals what data is packaged and sent to the server, but the actual ranking models are not observable from the client.\n7. **Feed Assembly:** Content is organized and delivered to the device via gRPC streaming, background WorkManager sync, beacon push, or cache.\n8. **Feedback Loop:** User interactions (dismissals, follows, saves) feed back into personalization. Dismissed content is permanently tombstoned.\n\nWhat is notable here is the *ordering*. The collection-level filter runs *before* interest matching and ranking. This means a publisher blocked at the collection level never even reaches the ranking stage, regardless of how relevant their content might be to a user.\n\n## Structured Data & Open Graph: What the Code Actually Parses\n\nPublishers often wonder which meta tags Discover actually uses. The decompiled parser class (`dkpg.java`, self-described as `SchemaOrg{parsedMetatags, jsonLdScripts}`) reveals the exact tags and their priority order.\n\n**The critical finding:** Schema.org JSON-LD structured data is checked **first** for title, author, and publisher — not Open Graph tags. OG tags are the fallback. This is hardcoded in the fallback chain logic.\n\n### Verified Fallback Chains (from `java files`)\n\n**Title:**\n\n1. Schema.org structured data (`TEXT_TYPE_TITLE`)\n2. `og:title` (via `property` attribute)\n3. `twitter:title` (via `name` attribute)\n4. `title` (generic `name` attribute)\n\n**Author:**\n\n1. Schema.org structured data (`TEXT_TYPE_AUTHOR`)\n2. `author` (via `name` attribute)\n\n**Publisher:**\n\n1. Schema.org structured data (`TEXT_TYPE_PUBLISHER`)\n2. `og:site_name` (via `property` attribute)\n\n**Image:**\n\n1. `og:image` (via `property` attribute)\n2. `twitter:image` (via `name` attribute)\n3. `og:image:secure_url` (via `property` attribute)\n4. `twitter:image:src` (via `name` attribute)\n5. `image` (generic `name` attribute)\n\n**Language:**\n\n1. Primary language detection (execution-based)\n2. `og:locale` (via `property` attribute)\n3. `inLanguage` (Schema.org JSON-LD)\n4. Hardcoded fallback to `\"en\"` (conditional on server config flag)\n\n### Paywall Classification (from `dkri.java:173-176`, `dkqd.java`)\n\nThe system checks paywall status in this exact order:\n\n1. First: `isAccessibleForFree` (Schema.org JSON-LD boolean) — defaults to `true` if absent\n2. Then: `article:content_tier` recognized values are exactly three strings:\n\n   * `\"free\"` the expected default\n   * `\"metered\"` counted as paywalled\n   * `\"locked\"` counted as paywalled\n\nIf multiple `article:content_tier` values are found on the same page, the code logs a warning: `\"More than one content tier found\"` (event code 38468). Use only one value.\n\n### Blocking Meta Tags (from `dkri.java:304-416`)\n\nTwo meta tags halt the pipeline entirely with an exception (`dkma`):\n\n* **`nopagereadaloud`** triggers `DISALLOWED_FOR_READOUT`\n* **`notranslate`** triggers `DISALLOWED_FOR_TRANSLATION`\n\nWhen either is detected as a meta tag, the system throws an error and stops processing that page. If your CMS or translation plugin injects `notranslate` as a meta tag, your content may not enter this parsing pipeline.\n\n### JSON/OG Rewrite in Action\n\n![DonanimHaber OG Source Code](/wp-content/uploads/2026/02/og-rewrite-1-scaled.png)\n\n![DonanimHaber SERP Title](/wp-content/uploads/2026/02/title-in-serp.png)\n\nNow let's see in Discover below.\n\n![DonanimHaber OG Discover](/wp-content/uploads/2026/02/og-rewrite.jpg)\n\n*And this is the og:image link: <https://www.donanimhaber.com/images/images/haber/202436/src_340x1912xtaalas-yapay-zek-ciplerinde-devrim-yaratabilir.jpg>*\n\n*This JPG link placed in only og:image & twitter:image tags, not in the schema.org (Is it proven? No, the other image is [1400x788px wide](https://www.donanimhaber.com/images/images/haber/202436/1400x788taalas-yapay-zek-ciplerinde-devrim-yaratabilir.jpg) in schema tag, it's Google, you can decide it.)*\n\n## Content Filtering: The Two-Level Architecture\n\nDiscover's content filtering operates multiple levels, each tracked by its own Discover-specific streamz metric, identified ones are below:\n\n* **Collection level** (`/client_streamz/android_gsa/discover/app_content/filter_collection_status`) — blocks content from a publisher/domain. Parameterized by `reason`.\n* **Entity level** (`/client_streamz/android_gsa/discover/app_content/filter_entity_status`) — blocks a single URL. Parameterized by `reason`.\n\nWhen a user selects \"Don't show content from \\[Publisher]\" in the card menu, this triggers the collection-level filter. A single article that generates enough negative feedback can suppress an entire publication. That reaction applies to all content from that domain, not just the triggering article. Supressing algorithm can extend the publisher-level filtering. \n\n**Important caveat:** These are client-side telemetry counters. They confirm the filter mechanisms exist and are tracked, but the exact server-side thresholds, recovery mechanisms, and how \"reason\" values map to user actions are not observable from the client. Google can change these configurations at any time without a client update.\n\n### Tombstoning (from `bska.java`)\n\nDismissed content is permanently recorded with a `tombstoned` boolean field on the content state object. The content state also tracks `stallState`, `lastKnownState` (INSERTED/REMOVED/UNKNOWN), and `purged` — creating a complete lifecycle record. Tombstoned content does not resurface.\n\n## The NAIADES Personalization System (from `fiqc.java`)\n\nThe code reveals a personalization system called NAIADES with multiple content subtypes, all confirmed as enum values:\n\n| Subtype                                                          | Enum Value | What It Suggests                                     |\n| ---------------------------------------------------------------- | ---------- | ---------------------------------------------------- |\n| `SUBTYPE_PERSONAL_UPDATE_MID_BASED_NAIADES`                      | 793        | Entity/Knowledge Graph-based personalization         |\n| `SUBTYPE_PERSONAL_UPDATE_QUERY_BASED_NAIADES`                    | 792        | Search query-based personalization                   |\n| `SUBTYPE_PERSONAL_UPDATE_QUERY_BASED_NAIADES_PERSISTENT_LOGGING` | 805        | Same with persistent logging                         |\n| `SUBTYPE_PERSONAL_UPDATE_RECALL_BOOST`                           | 797        | Increases retrieval priority from the candidate pool |\n| `SUBTYPE_PERSONAL_UPDATE_WPAS`                                   | 800        | Web Publisher Articles Signal                        |\n| `SUBTYPE_PERSONAL_UPDATE_WPAS_PERSISTENT_LOGGING`                | 811        | Same with persistent logging                         |\n| `SUBTYPE_PERSONAL_UPDATE_AIM_THREAD_NAIADES`                     | 856        | AIM (AI Mode) thread-based                           |\n\n**`WPAS` (Web Publisher Articles Signal)** likely corresponds to Google News Publisher Center registration, meaning content from registered publishers gets a distinct classification in the personalization pipeline. **`RECALL_BOOST`** can suggest increased retrieval priority from the candidate pool, boosting content during retrieval, before ranking. (We can't see server-side configuration, attention please)\n\n**Caveat:** These are enum names in a content subtype classification system. They confirm the categories exist, but how much weight each subtype carries in ranking is a server-side decision we cannot observe.\n\n## Suppression and Counterfactual Experiments\n\nDiscover runs counterfactual experiments. The code confirms:\n\n* `SHOW_SKIPPED_DUE_TO_COUNTERFACTUAL` (from `fevu.java`, enum value 16) — content withheld for A/B testing\n* `VISIBILITY_REPRESSED_COUNTERFACTUAL` (from `eyxv.java`) — a Visual Element logging state used across the Google App (not Discover-specific) that marks elements deliberately suppressed for experiment measurement\n* `background_refresh_rug_pull_count` (from `bupa.java`, under `/client_streamz/android_gsa/discover/`)  a Discover-specific counter tracking cases where content was pushed to the feed and then removed during a background refresh. 100% verified\n\n![](/wp-content/uploads/2026/02/discover-4-scaled.png)\n\nThe \"rug pull\" counter is particularly notable. It tracks cases where content was delivered to the feed and then retroactively removed. This means Discover can withdraw content that was already in the feed, not just filter it before display.\n\n## The Beacon Push System\n\nMost Discover content arrives through pull-based feed requests, but there is also a push channel. The Beacon system allows Google's servers to proactively push content to a user's device.\n\nFrom the decompiled code (`bqmt.java`), Beacon currently handles **exactly two content types**:\n\n* **Sports scores** (`SportsScoreAmbientDataDocument`)  ordinal 0\n* **Investment/finance recaps** (`InvestmentRecapAmbientDataDocument`)  ordinal 1\n* Anything else triggers `\"Unsupported BeaconContent type: %s\"` and is rejected\n\nBeacon has its own metrics (from `bupa.java`):\n\n```\n/client_streamz/android_gsa/discover/beacon/incoming_sports_notifications_count\n/client_streamz/android_gsa/discover/beacon/donated_sports_documents_count\n/client_streamz/android_gsa/discover/beacon/dropped_sports_notifications_count\n/client_streamz/android_gsa/discover/beacon/appsearch_cleared_count\n```\n\nSports content has **10+ dedicated notification types** (from `fkld.java`): `SPORTS_AWARENESS_NOTIF`, `SPORTS_GAME_CRICKET_MILESTONE_NOTIF`, `SPORTS_BREAKING_NEWS_NOTIF`, `SPORTS_LIVE_ACTIVITY_NOTIF`, `SPORTS_PREGAME_ANALYSIS_AIM_NOTIF`, `SPORTS_LEAGUE_INSIGHTS_AIM_NOTIF`, `SPORTS_STANDINGS_NOTIF`, and more. General breaking news has just one: `BREAKING_NEWS_NOTIF`. The structural investment in sports notification infrastructure is significantly larger. It may share very similar infrastructure with Google News // This part seems dynamic, so it can change anytime. \n\n## Freshness Buckets\n\nThe code contains time-based bucketing logic (from `bemp.java:215`):\n\n```java\ndays < 1  → \"0_DAYS\"\ndays < 8  → \"1_TO_7_DAYS\"\ndays < 15 → \"8_TO_14_DAYS\"\ndays < 31 → \"15_TO_30_DAYS\"\ndays < 61 → \"31_TO_60_DAYS\"\ndays >= 61 → \"TAIL\"\n```\n\n**Important correction:** In the decompiled code, this bucketing logic appears in a gesture settings context (`GestureSettingsPreferenceFragment`), not in a Discover-specific class. The bucket names and time ranges are confirmed as exact strings, but their direct connection to Discover's content freshness scoring cannot be verified from the client side alone. The bucketing pattern is consistent with how Google typically handles content age, but I cannot prove these specific buckets are used for Discover feed ranking.\n\n## 13 Cluster Types\n\nEvery card in the Discover feed belongs to a cluster. The following cluster type names are observable:\n\n* `neoncluster`  the primary content cluster\n* `geotargetingstories`  location-based stories\n* `deeptrends` and `deeptrendsfable`  trending topic narratives\n* `freshvideos` recent video content\n* `mustntmiss`  priority/must-read content\n* `newsstoriesheadlines` breaking news\n* `homestack` widget cards (weather, sports scores)\n* `garamondrelatedarticlegrouping` related article groups\n* `trendingugc`  user-generated trending content\n* `signinlure`  sign-in prompts\n* `iospromo`  cross-platform promotion\n* `moonstone`  an internal-codename cluster\n\n`mustntmiss` suggests there is a priority queue of content the system considers essential to show. `garamondrelatedarticlegrouping` hints that the system can create related-article groupings — combining separate articles under a shared topic heading.\n\n## Real-Time Feed Delivery\n\nDiscover does not simply fetch a static list of cards. The code reveals a persistent gRPC connection architecture with distinct service endpoints (verified from `ehdf.java`):\n\n* `google.internal.discover.discofeed.feedrenderer.v1.DiscoverFeedRenderer`  standard feed with `QueryInteractiveFeed` and `QueryNextPage`\n* `google.internal.discover.discofeed.streamingfeedrenderer.v1.DiscoverStreamingFeedRenderer`  streaming variant with `QueryStreamingFeed`\n* `google.internal.discover.discofeed.actions.v1.DiscoverActions`  `UploadActions` and `BackgroundUploadActions`\n* `google.internal.discover.discofeed.reactions.v1.DiscoverReactions`  `ListReactions`\n* `google.internal.discover.discofeed.recommendations.v1.StoryRecommendations`\n* `google.internal.discover.discofeed.homestack.v1.DiscoverHomestackFeedRenderer`\n\nWhat this means for publishers: your content does not wait for the user to pull-to-refresh. The streaming feed renderer keeps a live connection. The server can inject new cards, reorder existing ones, or remove stale content mid-session. The feed is a living stream, not a snapshot.\n\n## What This Means for Publishers\n\nLet me be clear about what this analysis is and is not. It is a set of observations about how the Google App's client-side systems are instrumented. It is not a reverse-engineering of server-side ranking algorithms, which remain on Google's servers and are not directly observable.\n\nThat said, some practical observations emerge:\n\n* **Schema.org JSON-LD takes priority over OG tags.** The parser checks structured data first for title, author, and publisher. OG tags are the fallback. If you only implement og:tags without JSON-LD markup, you're relying on the second-choice path.\n* **Images are essential.** The image fallback chain is five levels deep — the system tries hard to find an image. Use images at least 1200px wide for hero card eligibility.\n* **`og:title` is packaged and sent to Google's servers** as part of the ContentMetadata payload. Whether it's a direct ranking input is plausible but unconfirmed from client-side observation alone. Either way, it is part of the data that informs server-side decisions.\n* **Collection-level blocking is tracked as a distinct metric.** The `filter_collection_status` counter confirms this mechanism exists at the publisher/domain level. **However, we can only observe the telemetry counter, not the server-side thresholds or recovery mechanisms. Google can change these at any time.**\n* **Publisher Center registration creates a distinct signal.** The `WPAS` (Web Publisher Articles Signal) subtype means registered publishers get different classification treatment in the NAIADES personalization system.\n* **`article:content_tier` matters.** The parser explicitly recognizes `free`, `metered`, and `locked`. Use one value only — multiple values trigger a warning.\n* **`notranslate` and `nopagereadaloud` meta tags can halt the parsing pipeline.** If your CMS injects these, you need to run experiments and filter Discover traffic.\n* **User dismissals are permanent.** Content is tombstoned and does not resurface.\n* **Sports publishers have a structural advantage** in the push notification pipeline. 10+ dedicated notification types vs 1 for general breaking news. Client-side confirms, we can't see server-side configuration.\n\n## MY Corrections & Caveats\n\nDuring fact-checking, several items required correction:\n\n| Original Claim                                                                   | Correction                                                                                                                                                                                                                            |\n| -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| \"Exactly 6 OG tags are parsed\"                                                   | The parser handles 6 OG tags but also parses twitter:image, twitter:title, twitter:image:src, author, title, inLanguage, isAccessibleForFree, image (generic), and Schema.org JSON-LD. Total parsed tags is significantly more than 6 |\n| `EVERGREEN_VIBRANT` is a content classification type                             | It is a **UI color palette name** in `XgadsContext` for theming, alongside CANDY_VIBRANT, GLACIER_VIBRANT, LEMON_VIBRANT etc. Not a content type                                                                                      |\n| `engagement_time_msec` is a Discover-specific engagement signal                  | It is a **standard Firebase Analytics (GA4) parameter** (`_et` on the wire), used by every app with Firebase Analytics. It measures app-level engagement, not article-level engagement                                                |\n| `freshness_delta_in_seconds` / `staleness_in_hours` are Discover content metrics | In the decompiled code, these appear in **Smartspace weather/air quality metrics**, not in Discover-specific classes                                                                                                                  |\n| Freshness buckets (1_TO_7_DAYS etc.) are Discover-specific                       | The bucket strings exist but appear in a **gesture settings** context, not a confirmed Discover class                                                                                                                                 |\n| `VISIBILITY_REPRESSED_COUNTERFACTUAL` is Discover-specific                       | It is a **Google-wide Visual Element logging state** shared across the entire Google App (Assistant, Lens, Search, etc.)                                                                                                              |\n| `isCollectionHiddenFromEmberFeed` is about Discover feed filtering               | It is about the **Ember tab** (a separate visual image discovery tab in the Google App), not the Discover feed                                                                                                                        |\n| `PCTR_MODEL_TRIGGERED` confirms a Discover pCTR model                            | **Not found** in this SDK version. This doesn't mean it doesn't exist. It may be in server-side config or a different SDK version                                                                                                     |\n| OG tags are the primary parsing target                                           | **Schema.org JSON-LD structured data is checked first** for title, author, and publisher. OG tags are the fallback. The OG tags have been in action for a long time.                                                                  |\n\n## Methodology Note\n\nAll findings in this analysis are derived from decompiling Google App 87,498 classes across 13 DEX files. Of these, 95.5% were obfuscated under package `p000/` with no ProGuard mapping file available.\n\nDeobfuscation was performed through string literal analysis, class hierarchy tracing, gRPC endpoint extraction, and Hilt dependency injection graph reconstruction. Where findings are confirmed via exact string matches in decompiled source, they are labeled as verified. Where findings are inferences based on naming conventions or code proximity, that is noted.\n\n*No server-side systems were accessed. Server-side ranking models, experiment allocations, and pipeline stages can all change independently of the client. What we can observe is the instrumentation; the questions the system asks and the answers it records, which reveal the architecture even as the parameters shift underneath.*",
  "author": {
    "@type": "Person",
    "name": "Metehan Yesilyurt",
    "url": "https://metehan.ai",
    "sameAs": [
      "https://x.com/metehan777",
      "https://www.linkedin.com/in/metehanyesilyurt",
      "https://github.com/metehan777"
    ]
  },
  "publisher": {
    "@type": "Person",
    "name": "Metehan Yesilyurt",
    "url": "https://metehan.ai"
  },
  "alternateFormat": {
    "html": "https://metehan.ai/blog/google-discover-architecture/",
    "json": "https://metehan.ai/api/post/google-discover-architecture.json",
    "rss": "https://metehan.ai/rss.xml"
  }
}