Java Lab 6: Surfing the Web


Goals

In this lab, you get to write a rudimentary web crawler. Your crawler will automatically download web pages from the internet, search for new links within those pages, and repeat. This week's crawler will be just about the simplest one imaginable: it will simply look for new URLs (web page locations) on each page, collect them and print them out at the end. More sophisticated web crawlers are used to do things like index the Internet's content or scavenge for email addresses to spam; if you've ever used a search engine, you've queried against the data that a web crawler generated.

Terminology

Program To Write

Here is the specification of the program you are to write.

  1. The program should accept two parameters on the command line:

    1. a string representing the URL at which to start browsing
    2. a positive integer representing a maximum search depth (see below)

    If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.

      usage: java Crawler <URL> <depth>

    (To learn more about processing command-line arguments in Java, you can read this page.)

  2. The program should store the URL as a String along with its depth (which is 0 to start). You should create a special class to represent [URL, depth] pairs.

  3. The program should connect to the given site within the URL on port 80 using a Socket (see below) and request the specified web page,

  4. The program should parse the returned text, if any, line by line for any substrings which have the format:

      <a href="[any URL starting with http://]">

    Found URLs should be stored, along with a new depth value in a LinkedList of (URL, depth) pairs (see below for more about LinkedLists). The new depth value should be one more than the depth value of the URL corresponding to the page being parsed.

  5. The program should then close the socket connection to the host.

  6. The program should then repeat steps 3 to 6 on each new URL, as long as the depth corresponding to the URL is less than the maximum depth. Note that when a particular URL is retrieved and searched, the search depth goes up by 1. If a URL's depth has reached the maximum depth (or greater), do not retrieve or search that web page.

  7. Finally, the program should print out all the URLs visited along with their search depths.

Assumptions

Useful classes and methods

As always, see the Java API for more details. These classes and methods should get you started. Note that most of these methods throw various kinds of exceptions, which you will have to handle. Again, see the Java API to find out what they are.

Socket

To use Sockets you have to include this line in your program:

  import java.net.*;

Constructor

Socket(String host, int port) creates a new Socket from a String representing the host and a port number, and makes the connection.

Methods

Streams

To use streams you have to include this line in your program:

  import java.io.*;

To use Sockets effectively, you will want to convert the InputStream and OutputStream associated with the Socket to something more usable. InputStream and OutputStream instances are very primitive objects; they can only read bytes or arrays of bytes (not even chars). Since you want to read and write characters, you have to have objects that convert between bytes and chars and print whole lines. Unfortunately, the Java API does this in somewhat different ways for input and output.

Input Streams

For input streams you can use the InputStreamReader classes as follows:

  InputStreamReader in = new InputStreamReader(my_socket.getInputStream());

Now in is an InputStreamReader which can read characters from the Socket. However, this still isn't very friendly because you still have to work with individual chars or arrays of chars. It would be nice to be able to read in whole lines. For this you can use the BufferedReader class. You can create a BufferedReader given an InputStreamReader instance and then call the readLine method of the BufferedReader. This will read in a whole line from the other end of the socket.

Output Streams

Output streams are a bit simpler. You can create a PrintWriter instance directly from the socket's OutputStream object and then call its println method to send a line of text to the other end of the socket. You should use this constructor:

  PrintWriter(OutputStream out, boolean autoFlush)

with autoFlush set to true. This will flush the output buffer after each println.

String methods

You'll find these String methods useful. See the API for documentation.

NOTE: Do not use == to compare Strings for equality! It will only return true if the two Strings are the same object. If you want to compare the contents of the two Strings, use the equals method.

Lists

Lists are very much like arrays of Objects except that they can expand or contract easily and as needed, and they don't have the bracket-notation for looking up individual items. To use Lists you have to include this line in your program:

  import java.util.*;

You should store the (URL, depth) pairs in a LinkedList, which is a specific implementation of List. Create one like this:

  LinkedList<URLDepthPair> myList = new LinkedList<URLDepthPair>();

and then see the API for lots of useful methods on Lists and the different implementations of Lists. (Specifically, you will notice that different implementations of List provide different features. This is why LinkedList is recommended; a few of its features are particularly well-suited to this assignment.)

The special syntax for creating a LinkedList above, is making use of the new Java 1.5 generics support. This special syntax means that you don't have to cast objects that you store or retrieve from the list, which will save you a few headaches.

Exceptions

When you find something that looks like a URL but doesn't start with "http://", you should throw a MalformedURLException, which is part of the Java API.

Design Advice

Here are some design guidelines for the implementation of your web crawler. (Not doing these things usually results in redos... hint hint!)

URL-Depth Pairs

As mentioned above, you should create a special URLDepthPair class, each instance of which includes a String field representing a URL and an int representing a search depth. You should also have a toString method which will print out the contents of the pair. This makes outputting the results of your web-crawl much easier.

Individual URLs need to be broken apart into their components. This URL parsing and manipulation should be part of the URL-depth pair class you create. Good object-oriented design dictates that if a particular class is going to store a certain kind of data, then any manipulation of that data should also be implemented in that class. So, if you write any functions to break a URL apart, or to check if a URL is valid, put it in this class!

Crawlers

As mentioned above, you should design a Crawler class which will implement the main functionality of the application. This class should have a getSites method that will return a list of all URL-depth pairs that have been visited. You can call this in your main method once crawling is completed; retrieve the list, then iterate through it and print all of the URLs.

The easiest way to keep track of the sites visited is to have two lists, one for all the sites seen so far, and one that only includes sites that have not yet been processed. You should iterate through all the sites that haven't been processed, removing each site before you download its contents, and every time you find a new URL you should put it into the not-processed list. When the not-processed list is empty then you're all done - you've found all the sites.

Although you might think to yourself, "Opening a socket on a URL is an operation involving the URL, and should therefore be implemented on the URL-depth pair class," that would be a little too specialized for the purposes of the pair class. It is really just a place to store URLs and depth values, with a few extra utilities provided. The crawler is the class that is navigating web pages and searching for URLs, so the crawler class should contain the code that actually opens and closes sockets.

You will need to create a new Socket instance for each URL that you are downloading text from. Make sure to close the socket when you are finished scanning that webpage, so that the operating system doesn't run out of networking resources! (There are only so many sockets that a computer can keep open at once.) Also, don't use recursion to search more deeply nested web pages; implement this functionality as a loop. This will also make your web crawler much more efficient as far as resource usage is concerned.

Constants!

Your program will undoubtedly have strings like "http://" and "a href=\"" in it, and you will probably be tempted to just repeat these strings wherever you need them. Also, you will need these lengths for various string operations, so you will also be tempted to hard-code the lengths of these strings into your code. Don't do this!!! It makes for very unmaintainable code! If you make a typo, or later change how you search, you will have to update many different lines of code.

Instead, create String constants in your classes. For example, you might have this:

    public static final String URL_PREFIX = "http://";

Now, if you need this string, instead of directly hard-coding it, just use the URL_PREFIX constant. If you need its length, you are in luck - URL_PREFIX is a String object, so you can call URL_PREFIX.length() and get its length back.

You should also think about where you put these constants. You should only have each constant appear once in your whole project, and you should put the constant where it makes the most sense. For example, since the URL prefix is necessary to tell if a URL is valid, you should put that constant in your URL-depth pair class. If you have another constant for HTML links, put that into your crawler class. If the crawler ever needs the URL prefix, it can simply refer to the URL-depth pair constant, instead of duplicating that constant itself.

Extra Credit


[end lab 6] Copyright (C) 2003-2008, California Institute of Technology.
Last updated February 19, 2008