The Authority Signal AI Learned Before It Ever Heard of Your Brand

Jim Wrubel
7/1/2026

Before an AI model answers a single question about your brand, it's already made up its mind about which websites it trusts. That call was made during training, long before anyone typed a prompt. And it was shaped, in large part, by a free public dataset most marketers have never heard of: Common Crawl.
This matters for anyone who runs earned media. A placement on a site the AI treats as authoritative can echo through thousands of AI answers. A placement on a site the AI barely notices can be invisible in that same channel. The difference isn't about traffic or old-school prestige. It's about where a site sits in the structure of the web itself.
The good news? That structure is measurable. There's a number that predicts it well, and you can look yours up in seconds with our free Common Crawl Rank Checker.
Key Takeaways
- Common Crawl is one of the largest sources of AI training data. Some studies estimate GPT-3 drew over 80% of its training tokens from filtered Common Crawl content.
- Common Crawl uses a metric called Harmonic Centrality to decide which sites to crawl most often. High-centrality sites show up more in the data models learn from.
- Harmonic Centrality is a strong proxy for how authoritative an AI model treats a site, because it rewards being deeply connected in the web, not just heavily linked.
- For PR and marketing leaders, this adds a new dimension to earned media value. The authority of the host site in the web graph shapes how much an AI-era placement is worth.
- You can improve your standing over time. Wikidata, Wikipedia, and earned links with brand mentions from highly authoritative sites all move the needle.
What Common Crawl Actually Is
Common Crawl is a nonprofit that's been crawling the public web since 2008. Every month it visits billions of pages, saves what it finds, and publishes the whole thing as a free, open dataset. Anyone can download it. Researchers use it. Startups use it. And the companies building large language models lean on it hard.
How hard? When OpenAI trained GPT-3, filtered Common Crawl data made up most of the training tokens. Other major models use it heavily too. When people say an AI model was "trained on the internet," Common Crawl is a big chunk of what they mean by "the internet."
The crawler that does this work is called CCBot. If you've ever wondered whether it visits your site, you can read its full profile in our bot directory entry for CCBot. It's one of the most important bots on the web that almost nobody outside the field talks about.
Here's the part that matters for you. Common Crawl can't crawl every page every month; the web's just too big. So it has to choose. And how it chooses is where authority comes in.
The Number That Decides Who Gets Crawled
To decide what to crawl, Common Crawl builds a map of the web called a web graph. Every website is a dot. Every link between sites is a line connecting two dots. Do that across the whole web and you get a giant network with hundreds of millions of dots.
From this map, Common Crawl calculates a score for each site called Harmonic Centrality. The math is involved, but the idea is simple. Harmonic Centrality measures how central a site is in the web's neighborhood. It asks: starting from everywhere else on the web, how many short paths lead to your door?
Think of it like a city. A shop on the main square, one you can reach quickly from every direction, is central. A shop down a dead-end road on the edge of town isn't, even if a lot of people happen to live on that one road. Harmonic Centrality rewards the shop on the square.
There's a second, related number you'll also see: PageRank. It's the same idea Google made famous. PageRank measures authority based on the quality and quantity of links pointing at you. It's useful, but it has a weakness. Because it counts links, it can be inflated by link farms and other tricks. Harmonic Centrality is harder to fake, because it depends on your position in the whole network, not just how many arrows point your way. That's why it's such a trustworthy signal of real authority.

Now connect the two ideas. Common Crawl crawls high-centrality sites more often. Those sites therefore show up more in the data AI models train on. The more a model sees a site during training, the more familiar and authoritative that site feels when the model later writes an answer. Central sites get a head start that compounds.
It's no coincidence that the sites at the very top of the web graph, like Wikipedia, YouTube, and major news outlets, are also the sites AI models cite most often. They were central, so they got crawled, so they got learned, so they're trusted.
Why This Belongs on a PR Leader's Radar
For years, the value of an earned media placement came down to a few familiar things. Reach. Domain authority. The prestige of the outlet. Whether the coverage was positive. Those still matter, and they always will.
But AI has added a new variable, and it's one traditional coverage reports miss completely. When a customer asks an AI assistant for recommendations, the assistant leans on what it learned during training and on what it can find in real time. Both of those channels favor sites that are central in the web graph. A story about your brand on a highly central site doesn't just reach that site's readers. It becomes part of the raw material AI uses to describe your category.
That reframes a question every PR leader should be asking. Not just "did we get the placement?" but "is the site we got it on one that AI treats as authoritative?" Two placements that look identical on a coverage report can carry very different value in the AI channel, purely because of where each host site sits in the web graph.
This is exactly the gap our AI Placement Value Score was built to close. It combines a site's organic authority, which includes the same Common Crawl PageRank signal we've been talking about, with how accessible the site is to AI crawlers and how much weight it carries in AI training data. Your Common Crawl rank is one of the core inputs. A strong rank tends to lift the placement value of any coverage you earn on that domain.
How to Improve Your Standing
Here's the encouraging part. Your position in the web graph isn't fixed. It reflects the choices you and your team make over time. You can't buy your way to the center of the web, but you can earn your way there. A few moves matter most.
Claim and maintain your Wikidata entity. Wikidata is the structured database behind Wikipedia that feeds knowledge across the web, including into AI systems. An accurate, complete Wikidata entry for your brand connects you to the highly central knowledge core of the web. It's one of the most direct ways to become part of the neighborhood AI already trusts. And if you qualify for a Wikipedia article, that helps even more.
Chase authority, not volume. This is the single biggest shift in thinking. Link topology beats link volume. One link from a site that sits deep in the web's core can do more for your Harmonic Centrality than dozens of links from isolated, low-centrality pages. One story in a major, deeply connected publication outweighs a hundred mentions on sites nobody links to. It's about the quality of your neighborhood, not the number of your links.
Run earned media that lands real coverage on central sites. This is where PR and this metric meet head-on. When your earned media lands brand mentions and links on authoritative, well-connected outlets, you're not just reaching their audience. You're strengthening your own position in the web graph, which feeds forward into how AI sees you. Every placement on a central site is a small deposit into your long-term AI authority.
Be consistent about your brand entity. Use the same brand name, the same core facts, and consistent structured data across the sites you control and the coverage you earn. Consistency helps both the web graph and AI models connect all your mentions to a single, recognizable entity.
None of this happens overnight. Centrality builds over months and quarters, the same way reputation does. But that's also why it's defensible. A competitor can't buy their way past it in a week.
See Where You Stand
You don't have to guess where your brand sits in all this. Our free Common Crawl Rank Checker looks up any domain in the Common Crawl web graph and shows you three things in seconds: where you land as a percentile of the whole web, your Harmonic Centrality and PageRank, and a plain-language authority tier from Elite to Emerging. It also runs a full AI Placement Value Score, so you can see not just where you stand, but what your standing is worth in the AI channel.
Check your own domain. Then check the sites where your best coverage has landed. The results will tell you which of your placements are working hardest in the channel that increasingly shapes how buyers find brands. And when you're ready to turn that insight into a plan, a free Spyglasses account lets you check any domain and build an earned media strategy around the sites that actually move AI.
Glossary
What is Common Crawl?
Common Crawl is a nonprofit that crawls the public web and releases the results as free, open datasets. Because the data is free and enormous, it's become one of the primary sources of training data for large language models. If an AI model was "trained on the web," Common Crawl is a big part of what that means.
What is Harmonic Centrality?
Harmonic Centrality is a network science metric that measures how central a node is in a graph. On the web, it measures how easily a website can be reached from all other websites through short link paths. A high score means the site is deeply connected to the rest of the web. It's harder to manipulate than link-count metrics, which makes it a strong signal of real authority.
What is PageRank?
PageRank is a metric that scores a page's authority based on the number and quality of links pointing to it. It's the algorithm that originally powered Google Search. Common Crawl publishes a PageRank value for domains in its web graph. It's useful, but because it's based on link counts, it can be inflated by artificial linking.
What is CCBot?
CCBot is the web crawler Common Crawl uses to gather content from the web. When it visits your site, the pages it collects can end up in the datasets that train AI models. You can read its full profile, including how it identifies itself, in our CCBot directory entry.