Java Lab 7: A Better Web Crawler


Last week's web crawler was not particularly efficient. This week you will extend your web crawler to use Java threading, so that multiple web pages can be crawled in parallel. This will produce a significant performance improvement, because the time each crawler thread spends waiting for network operations to complete, can be overlapped with other processing operations in other threads.

For a detailed introduction to multithreaded programming in Java (a very complex topic!), you can read the threading tutorial here. Most importantly, read this sub-section.

Extending the Web Crawler

This week, you will extend and modify your program from last week:

  1. Implement a class called URLPool, which will store a list of all URLs to be searched, along with the relative "level" of each of those URLs (also known as a "search depth"). The first URL you search will be at search depth 0, URLs found on that page will be search depth 1, etc. You should store URLs and their search depth together as instances of a class called URLDepthPair like you did last week. A LinkedList is recommended for storing the items, since it can perform the required operations very efficiently.

    There should be a way for the user of the URLPool class to fetch a URL-depth pair from the pool, and have it removed from the list simultaneously. There should also be a way to add a URL-depth pair to the pool. Both of these operations need to be thread-safe, since multiple threads will be interacting with the URLPool at the same time.

    Unlike the FIFO example discussed in class, the URL pool shouldn't have a maximum size limit. It really just needs the list of pending URLs, the list of scanned URLs, and one other field discussed below.

  2. To perform the web crawling in multiple threads, you should create a CrawlerTask class that implements the Runnable interface. Each CrawlerTask instance should have a reference to one instance of the URLPool class described above. (Note that all of the CrawlerTask instances share that single pool!) The crawler's job should be to:

    1. Retrieve a URL-depth pair from the pool, waiting if one is not immediately available.
    2. Retrieve the web page referred to by the URL.
    3. Search the page for more URLs. For each URL found by the page, the crawler task should add a new URL-depth pair to the URL pool. The new pairs should have a depth that is one more than the depth of the URL that was retrieved in step 1.
    4. Go back to step 1!

    This should continue until there are no more (URL, depth) pairs in the pool to crawl.

  3. Since your web crawler will be spawning some number of crawler threads, update your program to accept a third command-line parameter, specifying the number of crawler threads to spawn. Your main function will operate something like this:

    1. Process command-line arguments. Inform the user of any input errors.
    2. Create an instance of the URL pool, and put the user-specified URL into the pool with a depth of 0.
    3. Create the number of crawler tasks (and threads to run them) requested by the user. Each crawler task should be given a reference to the URL pool you instantiated.
    4. Wait for the web crawling to complete! (More on this later.)
    5. Print out the resulting list of URLs that have been found.
  4. Make sure to synchronize your URL pool object at any and all critical points, since it must now be thread-safe.

  5. Don't have your crawlers continuously poll the URL pool for another URL when it is empty. Instead, have them wait efficiently when no URL is available. You should implement this by having the URL pool's "get URL" method wait() internally, if no URL is currently available. Correspondingly, the URL pool's "add URL" method should notify() when a new URL is added to the pool.

    Note that crawler threads won't themselves be performing any of these synchronize / wait / notify operations. This is for the same reason that the URL pool hides the details of how URLs are stored and retrieved: encapsulation! Just like you want users of the URL pool to not have to worry about implementation details, you also want them to not have to worry about threading details.

Design Advice

Here are some helpful hints for implementing Lab 7!

Extra Credit


Copyright (c) 2004-2011, California Institute of Technology.
Last updated on February 23, 2011.