

In this work, we describe a prototype system we have built specializing in crawling entity-oriented deep-web sites. Although crawling such entity-oriented content is clearly useful for various purposes, existing crawling techniques optimized for document-oriented content are not best suited for entity-oriented sites. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. The paper is:Ĭrawling Deep Web Entity Pages (pdf) The image below is from the paper and illustrates how the process describe within it works: A diagram of the entity crawl system described in the paperĭeep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web contentĪ few years ago, I wrote a blog post that I titled “ Solving Different URLs with Similar Text (DUST)” which described some of the difficulties that a search engine might have indexing some URLs.Ī paper I came across this week sort of combines those topics and uses information from sources such as Freebase to better guess queries and crawl deep-web pages that focus upon products found on commerce pages that could be difficult to reach otherwise, that Google might have to fill out forms to access, and which it could use names of Entities found from sources such as Freebase (such as names of phones, like “iPhone”) to query and to find those deep web pages. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community.

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. This paper describes efforts that Google undertook to access that information:

The search engine would have to access this information by filling out a form and guessing good queries because that was the only way to access the information – they couldn’t crawl it without querying it first. If you’ve been doing SEO for a while, one of the papers you may have read describes how Google was attempting to index content found on the Web that might be difficult for their crawlers to access, such as financial statements from the SEC.
