Lab 7: A Better Web Surfer

Last week's web crawler was not particularly efficient. This week, we'll harness the power of Java's multithreaded architecture to make multiple queries in parallel. For a brief introduction to multithreaded programming in java (a very complex topic), go here. Most importantly, read this sub-section.

Program to write

This week, you have to extend and modify your program from last week:

Implement a class called URLPool whose instances store a list of all URLs to be searched, along with the relative "level" of each of those URLs (also known as a "search depth"). The first URL you search will be at search depth 0, URLs found on that page will be search depth 1, etc. You should store URLs and their search depth together as instances of a class called URLDepthPair like you did last week. We recommend you use a Vector to store the items.
There should be a way for the user of the URLPool class to extract a URL and its search depth from the list and have it removed from the list simultaneously. There should also be a way to insert another URL with an associated search depth into the pool. None of these methods should gratuitously expose the underlying storage implementation.
This time around, create a Crawler class which implements Runnable and holds a reference to one of your URLPool objects described above. The crawler's job should be to:
1. Wait until a (URL, search depth) pair is available in the pool,
2. remove it from the pool,
3. write it to System.out (both the URL and the depth),
4. if the search depth is >= a maximum search depth you specify, go back to #1; otherwise,
5. crawl the URL and place all new-found URLs in the pool with a search depth one more than the search depth of the original pair, and
6. go back to #1.
Continue until there are no more (URL, depth) pairs in the pool to crawl.
Since your crawler will pull URLs from a pool and crawl them in the background, you should accept a third command-line parameter indicating the number of crawlers to spawn. Create all of these crawlers, point them at your URL pool, and seed the pool with the passed-in base URL and the search depth 0. Note that there will be a separate thread for each crawler. When all the crawlers have finished (i.e. when there are no more (URL, depth) pairs in the pool to crawl), there must be a way to shut down all the crawler threads and exit gracefully.
Make sure to synchronize your URL pool object at any and all critical points, since it must now be thread-safe.
Don't have your crawlers continuously poll the URL pool for another URL when it is empty. Instead, have them wait() and have your other crawlers always notifyAll() whenever they add another URL to the pool.

Design advice

You can re-use big hunks of your code from last week with little modification. The URLDepthPair class probably won't need to be modified at all (although feel free to do so if you can make it better). Most of your Crawler class can also be re-used. However, now your Crawler class has to be Runnable, but it doesn't have to take care of storing URLDepthPairs, so it will have to be modified accordingly.
Use the wait()/notifyAll() pattern that we discussed in class to make sure that your URLPool class is thread-safe. You shouldn't need to have any synchronized methods or any use of wait() or notifyAll() outside of this class (do you understand why?).
The trickiest part of this lab is to figure out a way to exit the program once there are no more URLs to crawl (i.e. once the maximum search depth has been reached). When this happens, all the Crawler threads will be wait()ing at the same point in the URLPool. We recommend you have an int field in the URLPool class that counts the number of waiting threads at this point (increment it before calling wait() and decrement it after). Also, have a synchronized method that returns the count of waiting threads. When this count equals the total number of threads, it's time to exit the program. You can monitor this in the main() routine and call System.exit() to shut down the java Virtual Machine (killing all threads). It's best to have the main thread sleep for a short time between checks so as not to waste too many processor cycles.

Extra credit

Figure out a way to exit the program once all the crawling is done that does not involve using System.exit().
Instead of printing out URLs as you encounter them, save them in a "processed" Vector and add code to allow them to be retrieved after crawling is finished, much like in lab 6. Print out the URLs and their depths only after crawling is finished. This illustrates why it's important to have a mechanism for shutting down threads that doesn't depend on System.exit().
Implement a master-list of URLs that have been traversed, and avoid revisiting old links. (You may wish to use a data structure with constant-time lookups for this.)
Increase the robustness of your crawlers such that they can work with real-world sites, which do not always return fully-specified URLs in their link tags.
Add an additional, optional command line parameter to specify how long in milliseconds to wait for a page to be returned from a server. Allow this time to be a threshold for the start of page-receipt, and then support an additional maxPatience parameter indicating the point at which the socket will be ignored and closed regardless.
Go ahead and finish your Google-competitor since you did extra credit #4 from last week ;-)