Lab 7: A Better Web Surfer


Last week's web crawler was not particularly efficient. This week, we'll harness the power of Java's multithreaded architecture to make multiple queries in parallel. For a brief introduction to multithreaded programming in java (a very complex topic), go here. Most importantly, read this sub-section.


Program to write

This week, you have to extend and modify your program from last week:
  1. Implement a class called URLPool whose instances store a list of all URLs to be searched, along with the relative "level" of each of those URLs (also known as a "search depth"). The first URL you search will be at search depth 0, URLs found on that page will be search depth 1, etc. You should store URLs and their search depth together as instances of a class called URLDepthPair like you did last week. We recommend you use a Vector to store the items.

    There should be a way for the user of the URLPool class to extract a URL and its search depth from the list and have it removed from the list simultaneously. There should also be a way to insert another URL with an associated search depth into the pool. None of these methods should gratuitously expose the underlying storage implementation.

  2. This time around, create a Crawler class which implements Runnable and holds a reference to one of your URLPool objects described above. The crawler's job should be to:

    1. Wait until a (URL, search depth) pair is available in the pool,
    2. remove it from the pool,
    3. write it to System.out (both the URL and the depth),
    4. if the search depth is >= a maximum search depth you specify, go back to #1; otherwise,
    5. crawl the URL and place all new-found URLs in the pool with a search depth one more than the search depth of the original pair, and
    6. go back to #1.

    Continue until there are no more (URL, depth) pairs in the pool to crawl.

  3. Since your crawler will pull URLs from a pool and crawl them in the background, you should accept a third command-line parameter indicating the number of crawlers to spawn. Create all of these crawlers, point them at your URL pool, and seed the pool with the passed-in base URL and the search depth 0. Note that there will be a separate thread for each crawler. When all the crawlers have finished (i.e. when there are no more (URL, depth) pairs in the pool to crawl), there must be a way to shut down all the crawler threads and exit gracefully.

  4. Make sure to synchronize your URL pool object at any and all critical points, since it must now be thread-safe.

  5. Don't have your crawlers continuously poll the URL pool for another URL when it is empty. Instead, have them wait() and have your other crawlers always notifyAll() whenever they add another URL to the pool.


Design advice


Extra credit