Last week's web crawler was not particularly efficient. This week, we'll harness the power of Java's multithreaded architecture to make multiple queries in parallel. For a brief introduction to multithreaded programming in java (a very complex topic), go here. Most importantly, read this sub-section.
URLPool
whose instances store a
list of all URLs to be searched, along with the relative "level" of each of
those URLs (also known as a "search depth"). The first URL you search will
be at search depth 0, URLs found on that page will be search depth 1, etc.
You should store URLs and their search depth together as instances of a class
called URLDepthPair
like you did last week. We recommend you
use a Vector
to store the items.
There should be a way for the user of the URLPool
class to
extract a URL and its search depth from the list and have it removed from the
list simultaneously. There should also be a way to insert another URL with
an associated search depth into the pool. None of these methods should
gratuitously expose the underlying storage implementation.
This time around, create a Crawler
class which implements
Runnable
and holds a reference to one of your URLPool
objects described above. The crawler's job should be to:
System.out
(both the URL and the depth),
Continue until there are no more (URL, depth) pairs in the pool to crawl.
0
. Note that there will be a separate thread for each
crawler. When all the crawlers have finished (i.e. when there are no
more (URL, depth) pairs in the pool to crawl), there must be a way to shut
down all the crawler threads and exit gracefully.
wait()
and have your other
crawlers always notifyAll()
whenever they add another URL to the
pool.
URLDepthPair
class probably won't need to be
modified at all (although feel free to do so if you can make it better).
Most of your Crawler
class can also be re-used. However, now
your Crawler
class has to be Runnable
, but it
doesn't have to take care of storing URLDepthPairs
, so it
will have to be modified accordingly.
wait()
/notifyAll()
pattern that we
discussed in class to make sure that your URLPool
class is
thread-safe. You shouldn't need to have any synchronized
methods or any use of wait()
or notifyAll()
outside
of this class (do you understand why?).
Crawler
threads will be wait()
ing at the same point
in the URLPool
. We recommend you have an int
field
in the URLPool
class that counts the number of waiting threads
at this point (increment it before calling wait()
and decrement
it after). Also, have a synchronized
method that returns the
count of waiting threads. When this count equals the total number of
threads, it's time to exit the program. You can monitor this in the
main()
routine and call System.exit()
to shut down
the java Virtual Machine (killing all threads). It's best to have the main
thread sleep for a short time between checks so as not to waste too many
processor cycles.
System.exit()
.
System.exit()
.
maxPatience
parameter indicating the point at
which the socket will be ignored and closed regardless.