Ask HN: Is Common Crawl used exhaustively by any search engine?

8 points by n1xis10t 15 hours ago

The Common Crawl has about 300 billion pages in it, and if you downloaded all of it in extracted text format it would only take up about 816 TB compressed. If someone were to make a search engine with this I think it would be more comprehensive than Bing, and possibly pretty similar to Google. The only CC based search engines that I know of use a tiny fraction of what they have available. Do you know of any that use the whole thing?

agentbox 8 hours ago

To my knowledge, no public search engine indexes the full Common Crawl corpus. Projects like Neeva (before shutting down) and some academic prototypes used parts of it for evaluation, but none have managed to process all 300B pages continuously.

The biggest practical barriers are deduplication, spam filtering, and keeping the index fresh — CC snapshots are monthly but the quality varies a lot.

For experimentation, you can look at projects like CCNet, ElasticSearch’s open-source pipelines, or small-scale engines such as Marginalia Search, which use subsets for niche purposes.

  • n1xis10t 35 minutes ago

    For freshness, I wonder how much their news crawls (which I’m pretty sure are weekly) would help.

    Thanks for the suggestions. Have you worked at all with the Common Crawl?