Lab 6: Surfing the Web


Goals

In this lab, we are going to write a rudimentary web crawler. A web crawler is a program that automatically downloads web pages from the internet, searches for some information, and uses that information to look for new pages. The web crawler we'll write will be just about the simplest possible one imaginable: it will simply look for new URLs (web page locations) on each page, collect them and print them out at the end. More sophisticated web crawlers are used to do things like search the internet for content; if you've ever used google or altavista you've used a web crawler whether you realized it or not. This lab will introduce you to some of the technologies that make web crawling possible.


Terminology


Program to write

Here is the specification of the program you are to write.
  1. The program should accept two parameters on the command line:

    1. a string representing the URL at which to start browsing and
    2. a positive integer representing a maximum search depth (see below).

    If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.

    usage: java Lab6 <URL> <depth>
    

  2. The program should store the URL as a String along with its depth (which is 0 to start). You should create a special class to hold (URL, depth) pairs (see below).

  3. The program should connect to the given site within the URL on port 80 using a Socket (see below) and request the specified web page,

  4. The program should parse the returned text (if any) line by line for any substrings which have the format

    <a href="[any URL starting with http://]">,

    storing all the URLs, along with a new depth value in a Vector of (URL, depth) pairs (see below for more about Vectors). The new depth value should be one more than the depth value of the URL corresponding to the page being parsed. If the depth value is greater than or equal to the maximum depth value, don't store the pair.

  5. The program should then close the socket connection to the host.

  6. The program should then recurse through steps 3 to 6 on each new URL, as long as the depth corresponding to the URL is less than the maximum depth. Note that each time you recurse the search depth goes up by 1.

  7. Finally, the program should print out all the URLs visited (the URL strings, not the contents of the web page corresponding to the URL) along with their search depths.

Assumptions


Useful classes and methods

As always, see the Java API for more details. These classes and methods should get you started. Note that most of these methods throw various kinds of exceptions, which you will have to handle. Again, see the Java API to find out what they are.

Socket

To use Sockets you have to include this line in your program:
import java.net.*;

Constructor

Socket(String host, int port) creates a new Socket from a String representing the host and a port number, and makes the connection.

Methods

streams

To use streams you have to include this line in your program:
import java.io.*;
To use Sockets effectively, you will want to convert the InputStream and OutputStream associated with the Socket to something more usable. InputStream and OutputStream instances are very primitive objects; they can only read bytes or arrays of bytes (not even chars!). Since you want to read and write characters, you have to have objects that convert between bytes and chars and print whole lines. Unfortunately, the java API does this in somewhat different ways for input and output.

Input streams

For input streams you can use the InputStreamReader class as follows:
    InputStreamReader in = new InputStreamReader(my_socket.getInputStream());
and now in is an InputStreamReader which can read characters from the Socket. However, this still isn't very friendly because you still have to work with individual chars or arrays of chars. It would be nice to be able to read in whole lines at a time. For this you can use the BufferedReader class. You can create a BufferedReader given an InputStreamReader instance and then call the readLine method of the BufferedReader. This will read in a whole line from the other end of the socket. You should also use the ready method of the BufferedReader to check that the input stream is capable of reading data (this is not guaranteed when using a socket).

Output streams

Output streams are a bit simpler. You can create a PrintWriter instance directly from the socket's OutputStream object and then call its println method to send a line of text to the other end of the socket. You should use this constructor:
PrintWriter(OutputStream out, boolean autoFlush)
with autoFlush set to true. This will flush the output buffer after each println.

String methods

You'll find these String methods useful. See the API for documentation. NOTE: Do not use == to compare Strings for equality! It will only return true if the two Strings are the same string. If you want to compare the contents of the two Strings, use the equals method.

Vector

Vectors are very much like arrays, except that they store any kind of Object (which means any class instance but not primitive types like ints), and they can expand or contract as needed. To use Vectors you have to include this line in your program:
import java.util.*;
You should store the (URL, depth) pairs in a Vector. See the API for lots of useful methods on Vectors. Also note that you have to do a type cast to retrieve an object from a Vector.

Exceptions

When you find something that looks like a URL but doesn't start with "http://", you should throw a MalformedURLException, which is part of the java API.


Design advice


Extra Credit