Abstract
We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the system's architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.
- Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. 1995. Serverless network file systems. In Proceedings of SOSP. Google Scholar
- Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Tech. 1, (Aug.), 2--3. Also available online at http://dbpubs.stanford.edu/pub/2000-37. Google Scholar
- Brandman, O., Cho, J., Garcia-Molina, H., and Shivakumar, N. 2000. Crawler-friendly Web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS, Santa Clara, CA.) Held in conjunction with ACM SIGMETRICS 2000. Available online at http://dbpubs.stanford.edu/pub/2000-25. Google Scholar
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. Google Scholar
- Brown, E. W., Callan, J. P., Croft, W. B., and Moss, J. E. B. 1994. Supporting full-text information retrieval with a persistent object store. In Proceedings of the 4th International Conference on Extending Database Technology. 365--378. Google Scholar
- Burner, M. 1998. Crawling towards eternity: Building an archive of the world wide Web. Web Techniq. Mag. 2, 5 (May).Google Scholar
- Chakrabarti, S. and Muthukrishnan, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th ACM Symposium on Parallel Algorithms and Architectures. 329--335. Google Scholar
- Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the WWW Conf. Google Scholar
- Cho, J. and Garcia-Molina, H. 2000a. The evolution of the Web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth International Conference on Very Large Databases. Available online at http://dbpubs.stanford.edu/pub/1999-22. Google Scholar
- Cho, J. and Garcia-Molina, H. 2000b. The evolution of the Web and implications for an incremental crawler. In Proceedings of the VLDB Conference. Google Scholar
- Cho, J. and Garcia-Molina, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the International Conference on Management of Data. Available online at http://dbpubs.stanford.edu/pub/1999-40. Google Scholar
- Cho, J. and Garcia-Molina, H. 2000d. Synchronizing a database to improve freshness. In Proceedings of the SIGMOD Conference. Google Scholar
- Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Internet Tech. 3, 3 (Aug.). Available online at http://dbpubs.stanford.edu/pub/2000-4. Google Scholar
- Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the WWW Conference. Google Scholar
- Cho, J., Shivakumar, N., and Garcia-Molina, H. 2000. Finding replicated Web collections. In Proceedings of the International Conference on Management of Data. Available online at http://dbpubs.stanford.edu/pub/1999-64. Google Scholar
- Coffman, Jr., E., Liu, Z., and Weber, R. R. 1997. Optimal robot scheduling for Web search engines. Tech. rep. INRIA, Rocquencourt, France.Google Scholar
- Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the VLDB Conference. Google Scholar
- Eichmann, D. 1994. The RBSE spider: Balancing effective search against Web load. In Proceedings of the WWW Conference.Google Scholar
- Garcia-Molina, H., Ullman, J., and Widom, J. 2000. Database System Implementation. Prentice-Hall, Eaglewood Cliffs, NJ. Google Scholar
- Gorssman, D. A. and Driscoll, J. R. 1992. Structuring text within a relation system. In Proceedings of the 3rd International Conference on Database and Expert System Applications. 72--77.Google Scholar
- Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the Web. In Proceedings of the 3rd International Workshop on the Web and Databases ( WebDB ).Google Scholar
- Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the 11th International World Wide Web Conference. Google Scholar
- Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4 (Dec.), 219--229. Google Scholar
- Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web Base: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference. 277--293. Google Scholar
- Jeong, B.-S. and Omiecinski, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parall. Distrib. Syst. 6, 2 (Feb.), 142--153. Google Scholar
- Koster, M. 1994. A standard for robot exclusion. Available online at http://www.robotstxt.org/wc/norobots.html.Google Scholar
- Lemon, J. 2001. Kqueue: A generic and scalable event notification facility. In Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference. Google Scholar
- McBryan, O. A. 1994. GENVL and WWWW: Tools for taming the Web. In Proceedings of the WWW Conference.Google Scholar
- Melnik, S., Raghavan, S., Yang, B., and Garcia-Molina, H. 2001. Building a distributed full-text index for the Web. In Proceedings of the Tenth International World-Wide Web Conference. Google Scholar
- Miller, R. C. and Bharat, K. 1998. SPHINX: A framework for creating personal, site-specific Web crawlers. In Proceedings of the WWW Conference. Google Scholar
- Moffat, A. and Zobel, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Trans. Inform. Syst. 14, 4 (Oct.), 349--379. Google Scholar
- NgocVo, A. and Moffat, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval. 290--297. Google Scholar
- Olson, M., Bostic, K., and Seltzer, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference. Google Scholar
- Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the WWW Conference.Google Scholar
- Raghavan, S. 2003. Complex queries over Web repositories. In Proceedings of the 29th Conference on Very Large Databases (VLDB). Google Scholar
- Raghavan, S. and Garcia-Molina, H. 2003. Representing Web graphs. In Proceedings of the 19th International Conference on Data Engineering.Google Scholar
- Ribeiro-Neto, B., Moura, E. S., Neubert, M. S., and Ziviani, N. 1999. Efficient distributed algorithms to build inverted files. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval. 105--112. Google Scholar
- Tanenbaum, A. S. and Renesse, R. V. 1985. Distributed operating systems. ACM Comput. Surv. 17, 4 (Dec.), 419--470. Google Scholar
- Tomasic, A. and Garcia-Molina, H. 1993. Query processing and inverted indices in shared-nothing document information retrieval systems. VLDB J. 2, 3, 243--275. Google Scholar
- Tomasic, A., Garcia-Molina, H., and Shoens, K. 1994. Incremental update of inverted list for text document retrieval. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 289--300. Google Scholar
- Viles, C. L. and French, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th International ACM Conference on Research and Development in Information Retrieval. 12--20. Google Scholar
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan Kauffman Publishing, San Francisco, CA. Google Scholar
- Zobel, J., Moffat, A., and Sacks-Davis, R. 1992. An efficient indexing technique for full-text database systems. In Proceedings of the 18th International Conference on Very Large Databases. 352--362. Google Scholar
Index Terms
Stanford WebBase components and applications
Recommendations
WebBase: a repository of Web pages
AbstractIn this paper, we study the problem of constructing and maintaining a large shared repository of Web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We ...
Essential components of mobile web accessibility
W4A '13: Proceedings of the 10th International Cross-Disciplinary Conference on Web AccessibilityThe Web Accessibility Initiative (WAI) of the World Wide Web Consortium (W3C) develops strategies, guidelines, and resources to make the Web accessible to people with disabilities. This includes ensuring that core web technologies such as HTML and CSS ...
Components for building desktop-application-like interface in web applications
APWeb'05: Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and DevelopmentWeb-based applications have become ubiquitous over the past few years. These applications rely on a simple document markup language called HTML. However, the click-and-link interface provide by HTML-based technology is too limited for building next ...








Comments