skip to main content
article

Stanford WebBase components and applications

Published:01 May 2006Publication History
Skip Abstract Section

Abstract

We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the system's architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.

References

  1. Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. 1995. Serverless network file systems. In Proceedings of SOSP. Google ScholarGoogle Scholar
  2. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Tech. 1, (Aug.), 2--3. Also available online at http://dbpubs.stanford.edu/pub/2000-37. Google ScholarGoogle Scholar
  3. Brandman, O., Cho, J., Garcia-Molina, H., and Shivakumar, N. 2000. Crawler-friendly Web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS, Santa Clara, CA.) Held in conjunction with ACM SIGMETRICS 2000. Available online at http://dbpubs.stanford.edu/pub/2000-25. Google ScholarGoogle Scholar
  4. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. Google ScholarGoogle Scholar
  5. Brown, E. W., Callan, J. P., Croft, W. B., and Moss, J. E. B. 1994. Supporting full-text information retrieval with a persistent object store. In Proceedings of the 4th International Conference on Extending Database Technology. 365--378. Google ScholarGoogle Scholar
  6. Burner, M. 1998. Crawling towards eternity: Building an archive of the world wide Web. Web Techniq. Mag. 2, 5 (May).Google ScholarGoogle Scholar
  7. Chakrabarti, S. and Muthukrishnan, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th ACM Symposium on Parallel Algorithms and Architectures. 329--335. Google ScholarGoogle Scholar
  8. Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the WWW Conf. Google ScholarGoogle Scholar
  9. Cho, J. and Garcia-Molina, H. 2000a. The evolution of the Web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth International Conference on Very Large Databases. Available online at http://dbpubs.stanford.edu/pub/1999-22. Google ScholarGoogle Scholar
  10. Cho, J. and Garcia-Molina, H. 2000b. The evolution of the Web and implications for an incremental crawler. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  11. Cho, J. and Garcia-Molina, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the International Conference on Management of Data. Available online at http://dbpubs.stanford.edu/pub/1999-40. Google ScholarGoogle Scholar
  12. Cho, J. and Garcia-Molina, H. 2000d. Synchronizing a database to improve freshness. In Proceedings of the SIGMOD Conference. Google ScholarGoogle Scholar
  13. Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Internet Tech. 3, 3 (Aug.). Available online at http://dbpubs.stanford.edu/pub/2000-4. Google ScholarGoogle Scholar
  14. Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the WWW Conference. Google ScholarGoogle Scholar
  15. Cho, J., Shivakumar, N., and Garcia-Molina, H. 2000. Finding replicated Web collections. In Proceedings of the International Conference on Management of Data. Available online at http://dbpubs.stanford.edu/pub/1999-64. Google ScholarGoogle Scholar
  16. Coffman, Jr., E., Liu, Z., and Weber, R. R. 1997. Optimal robot scheduling for Web search engines. Tech. rep. INRIA, Rocquencourt, France.Google ScholarGoogle Scholar
  17. Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  18. Eichmann, D. 1994. The RBSE spider: Balancing effective search against Web load. In Proceedings of the WWW Conference.Google ScholarGoogle Scholar
  19. Garcia-Molina, H., Ullman, J., and Widom, J. 2000. Database System Implementation. Prentice-Hall, Eaglewood Cliffs, NJ. Google ScholarGoogle Scholar
  20. Gorssman, D. A. and Driscoll, J. R. 1992. Structuring text within a relation system. In Proceedings of the 3rd International Conference on Database and Expert System Applications. 72--77.Google ScholarGoogle Scholar
  21. Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the Web. In Proceedings of the 3rd International Workshop on the Web and Databases ( WebDB ).Google ScholarGoogle Scholar
  22. Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the 11th International World Wide Web Conference. Google ScholarGoogle Scholar
  23. Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4 (Dec.), 219--229. Google ScholarGoogle Scholar
  24. Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web Base: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference. 277--293. Google ScholarGoogle Scholar
  25. Jeong, B.-S. and Omiecinski, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parall. Distrib. Syst. 6, 2 (Feb.), 142--153. Google ScholarGoogle Scholar
  26. Koster, M. 1994. A standard for robot exclusion. Available online at http://www.robotstxt.org/wc/norobots.html.Google ScholarGoogle Scholar
  27. Lemon, J. 2001. Kqueue: A generic and scalable event notification facility. In Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference. Google ScholarGoogle Scholar
  28. McBryan, O. A. 1994. GENVL and WWWW: Tools for taming the Web. In Proceedings of the WWW Conference.Google ScholarGoogle Scholar
  29. Melnik, S., Raghavan, S., Yang, B., and Garcia-Molina, H. 2001. Building a distributed full-text index for the Web. In Proceedings of the Tenth International World-Wide Web Conference. Google ScholarGoogle Scholar
  30. Miller, R. C. and Bharat, K. 1998. SPHINX: A framework for creating personal, site-specific Web crawlers. In Proceedings of the WWW Conference. Google ScholarGoogle Scholar
  31. Moffat, A. and Zobel, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Trans. Inform. Syst. 14, 4 (Oct.), 349--379. Google ScholarGoogle Scholar
  32. NgocVo, A. and Moffat, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval. 290--297. Google ScholarGoogle Scholar
  33. Olson, M., Bostic, K., and Seltzer, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference. Google ScholarGoogle Scholar
  34. Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the WWW Conference.Google ScholarGoogle Scholar
  35. Raghavan, S. 2003. Complex queries over Web repositories. In Proceedings of the 29th Conference on Very Large Databases (VLDB). Google ScholarGoogle Scholar
  36. Raghavan, S. and Garcia-Molina, H. 2003. Representing Web graphs. In Proceedings of the 19th International Conference on Data Engineering.Google ScholarGoogle Scholar
  37. Ribeiro-Neto, B., Moura, E. S., Neubert, M. S., and Ziviani, N. 1999. Efficient distributed algorithms to build inverted files. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval. 105--112. Google ScholarGoogle Scholar
  38. Tanenbaum, A. S. and Renesse, R. V. 1985. Distributed operating systems. ACM Comput. Surv. 17, 4 (Dec.), 419--470. Google ScholarGoogle Scholar
  39. Tomasic, A. and Garcia-Molina, H. 1993. Query processing and inverted indices in shared-nothing document information retrieval systems. VLDB J. 2, 3, 243--275. Google ScholarGoogle Scholar
  40. Tomasic, A., Garcia-Molina, H., and Shoens, K. 1994. Incremental update of inverted list for text document retrieval. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 289--300. Google ScholarGoogle Scholar
  41. Viles, C. L. and French, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th International ACM Conference on Research and Development in Information Retrieval. 12--20. Google ScholarGoogle Scholar
  42. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan Kauffman Publishing, San Francisco, CA. Google ScholarGoogle Scholar
  43. Zobel, J., Moffat, A., and Sacks-Davis, R. 1992. An efficient indexing technique for full-text database systems. In Proceedings of the 18th International Conference on Very Large Databases. 352--362. Google ScholarGoogle Scholar

Index Terms

  1. Stanford WebBase components and applications

                                    Recommendations

                                    Reviews

                                    Jie Tang

                                    Stanford WebBase, a Web search and retrieval tool, has been used by scores of research and teaching organizations, mostly for investigations into Web topology and linguistic content analysis. This paper describes the WebBase system, presenting its main engineering and design tradeoffs, and discussing experimental results for, and an analysis of, some components, as well as the performance of the whole system. The crawler, indexer, and query engine are three of the most important modules in a Web search system. The principles of crawlers and indexers are not difficult to understand. However, to make them function in practice (both efficiently and effectively), especially on the Web today, a thorough and principled investigation is still required. Web pages allow researchers from different areas to share the power and facilities of the system. The paper focuses on the crawler and indexer, and on Web page distribution. It presents the engineering and design tradeoffs of the three components in the WebBase system. For the crawler, the main issues addressed are parallelization and courtesy. The authors describe the details of the design and implementation of, and an experimental analysis of, the parallelization, and briefly consider the courtesy requirements. The authors consider how to build a highly parallel, distributed, efficient, and easy-to-query indexer. A mixed-list storage scheme was designed to enhance the performance of the Berkeley database-based inverted indexing storage. For Web page distribution, the authors describe a facilitating delivery interface for retrieving pages from WebBase across the network. The distribution is designed for high performance, by allowing maximum use of parallel processing on both the server and the client sides. An example is given to illustrate the distribution facility. The paper may be suitable three kinds of readers: researchers who want to take advantage of WebBase to conduct research (for example on Web mining) on the Web; researchers who want to investigate the issues of parallel crawling, high-performance indexing, and large-scale document distribution; and researchers from industry who are interested in designing and implementing a large-scale search engine system. Some knowledge of network technology and information retrieval techniques is recommended before reading this paper. Online Computing Reviews Service

                                    Access critical reviews of Computing literature here

                                    Become a reviewer for Computing Reviews.

                                    Comments

                                    Login options

                                    Check if you have access through your login credentials or your institution to get full access on this article.

                                    Sign in

                                    Full Access

                                    PDF Format

                                    View or Download as a PDF file.

                                    PDF

                                    eReader

                                    View online with eReader.

                                    eReader
                                    About Cookies On This Site

                                    We use cookies to ensure that we give you the best experience on our website.

                                    Learn more

                                    Got it!