Last week's web crawler was not particularly efficient. This week you will extend your web crawler to use Java threading, so that multiple web pages can be crawled in parallel. This will produce a significant performance improvement, because the time each crawler thread spends waiting for network operations to complete, can be overlapped with other processing operations in other threads.
For a detailed introduction to multithreaded programming in Java (a very complex topic!), you can read the threading tutorial here. Most importantly, read this sub-section.
This week, you will extend and modify your program from last week:
Implement a class called URLPool, which will store a list of all URLs to be searched, along with the relative "level" of each of those URLs (also known as a "search depth"). The first URL you search will be at search depth 0, URLs found on that page will be search depth 1, etc. You should store URLs and their search depth together as instances of a class called URLDepthPair like you did last week. A LinkedList is recommended for storing the items, since it can perform the required operations very efficiently.
There should be a way for the user of the URLPool class to fetch a URL-depth pair from the pool, and have it removed from the list simultaneously. There should also be a way to add a URL-depth pair to the pool. Both of these operations need to be thread-safe, since multiple threads will be interacting with the URLPool at the same time.
Unlike the FIFO example discussed in class, the URL pool shouldn't have a maximum size limit. It really just needs the list of pending URLs, the list of scanned URLs, and one other field discussed below.
To perform the web crawling in multiple threads, you should create a
CrawlerTask class that implements the Runnable
interface. Each CrawlerTask instance should have a reference to
one instance of the URLPool
class described above.
(Note that all of the CrawlerTask instances share that
single pool!) The crawler's job should be to:
This should continue until there are no more (URL, depth) pairs in the pool to crawl.
Since your web crawler will be spawning some number of crawler threads, update your program to accept a third command-line parameter, specifying the number of crawler threads to spawn. Your main function will operate something like this:
Make sure to synchronize your URL pool object at any and all critical points, since it must now be thread-safe.
Don't have your crawlers continuously poll the URL pool for another URL when it is empty. Instead, have them wait efficiently when no URL is available. You should implement this by having the URL pool's "get URL" method wait() internally, if no URL is currently available. Correspondingly, the URL pool's "add URL" method should notify() when a new URL is added to the pool.
Note that crawler threads won't themselves be performing any of these synchronize / wait / notify operations. This is for the same reason that the URL pool hides the details of how URLs are stored and retrieved: encapsulation! Just like you want users of the URL pool to not have to worry about implementation details, you also want them to not have to worry about threading details.
Here are some helpful hints for implementing Lab 7!
You can probably reuse big chunks of your code from last week with little modification. The URLDepthPair class probably won't need to be modified at all. Most of the web-crawling code can also be reused. The main differences are that the URL-download/page-crawl code is now in a class that implements Runnable, and the code will now get and add URLs on a URLPool instance.
You need to synchronize access to the URLPool's internal fields, since it will be accessed from multiple threads. As discussed in class, the most straightforward approach is to use synchronized methods, and avoid trying to code synchronized blocks within the class' methods. You don't need to synchronize the URLPool constructor! Think about which methods need to be synchronized.
Write your URLPool's methods to use wait() and notify() so that crawler threads can efficiently wait for new URLs to become available.
Have the URLPool manage which URLs go into the list of pending URLs, based on the depth of each URL added to the pool. If the URL's depth is below the max depth, go ahead and add the URL-pair to the pending queue. If not, just add the URL to the "seen" list, and don't enqueue it for later crawling.
The trickiest part of this lab is to figure out a way to exit the program once there are no more URLs to crawl (i.e. once the maximum search depth has been reached). When this happens, all the CrawlerTask threads will be wait()ing for a new URL in the URLPool. We recommend that your URLPool should keep track of how many threads are waiting for a new URL, by adding an int field that is incremented immediately before calling wait(), and that is decremented immediately after wait() returns.
Once you have a count of the number of waiting threads, you can also implement a (synchronized!) method that returns the number of waiting threads. When this count equals the total number of crawler threads, it's time to exit the program. You can monitor this count in your main() function, and and call System.exit() to shut down. This is indeed another example of much-dreaded polling, but you can mitigate the cost of polling by having your main method sleep for a short time between checks so as not to waste too many processor cycles. Somewhere between 0.1 second and 1 second is probably a good amount of time to sleep.
Update your URL-depth pair to use the java.net.URL class, and upgrade your web crawler to follow relative URLs as well as absolute URLs. This will require a bit of rework, but shouldn't be too difficult.
Exiting your program by calling System.exit() isn't the cleanest thing to do. Figure out a cleaner way to exit the program once all the crawling is done.
Implement a master listing of URLs that have been traversed, and avoid revisiting old links. Use one of the Java collection classes to make this easy. Some kind of set that supports constant-time lookups and inserts would be most desirable.
Add another optional command line parameter to specify how long a
crawler thread should wait for a server to return the requested web page.
Use this time as a threshold for the start of page-receipt, and
then support an additional maxPatience
parameter indicating
the point at which the the crawler task will abort crawling that page and
move on to the next one.
Go ahead and finish your Google-competitor since you did extra credit #4 from last week ;-)