CS11 Java - Lab 6

Java Lab 6: Surfing the Web

Goals

In this lab, you get to write a rudimentary web crawler. Your crawler will automatically download web pages from the internet, search for new links within those pages, and repeat. This week's crawler will be just about the simplest one imaginable: it will simply look for new URLs (web page locations) on each page, collect them and print them out at the end. More sophisticated web crawlers are used to do things like index the Internet's content or scavenge for email addresses to spam; if you've ever used a search engine, you've queried against the data that a web crawler generated.

Terminology

URL: Uniform Resource Locator. This is a web page address. For our purposes, it consists of the string "http://", followed by the location of the web server (e.g. "www.cs.caltech.edu"), followed by the path of the web page on the server (e.g. "/courses/cs11"). If the last name in the path doesn't end in ".html" then the actual web page is provided by the server. It can be "index.html", "index.shtml", "index.php", "default.htm", or anything else the web-server considers to be the "default" document for a particular directory.

Note: There are other kinds of valid URLs as well, starting with (for instance) "mailto://" or "ftp://". We won't bother with these for this assignment.
HTTP: HyperText Transfer Protocol. This is a standard text-based protocol used for transmitting web page data over the internet. The latest specification of HTTP is version 1.1, which is the version we'll be using. An HTTP query to download the contents of the web page at
```
  http://www.cs.caltech.edu/courses/cs11
```
would look like:
```
  GET /courses/cs11 HTTP/1.1
  Host: www.cs.caltech.edu
  Connection: close
  [extra blank line here]
```
Note that the HTTP request MUST end with a blank line, or the request will be ignored. Also note the capitalization; if you try to send "Get" or "host", the webserver will not be happy with you.

Some URLs don't specify a document or resource to retrieve, such as the URL "http://www.caltech.edu". In these cases, you must still specify "/" as the resource to retrieve. In other words, the requested resource will always start with a / character.
Socket: A socket is a resource provided by the operating system that allows you to communicate with other computers over a network. You can use a socket to establish a connection with a webserver, but you must use a TCP socket, and "speak" the HTTP protocol for the server to respond.
Port: Multiple different programs on the same server can listen for connections, by listening on different ports. Each port is designated with a number in the range 1..65535; 1 through 1024 are reserved for the operating system. Most kinds of servers have a default port; for HTTP connections we normally use port 80.

Program To Write

Here is the specification of the program you are to write.

The program should accept two parameters on the command line:
1. a string representing the URL at which to start browsing
2. a positive integer representing a maximum search depth (see below)
If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.
```
  usage: java Crawler <URL> <depth>
```
(To learn more about processing command-line arguments in Java, you can read this page.)
The program should store the URL as a String along with its depth (which is 0 to start). You should create a special class to represent [URL, depth] pairs.
The program should connect to the given site within the URL on port 80 using a Socket (see below) and request the specified web page,
The program should parse the returned text, if any, line by line for any substrings which have the format:
```
  <a href="[any URL starting with http://]">
```
Found URLs should be stored, along with a new depth value in a LinkedList of (URL, depth) pairs (see below for more about LinkedLists). The new depth value should be one more than the depth value of the URL corresponding to the page being parsed.
The program should then close the socket connection to the host.
The program should then repeat steps 3 to 6 on each new URL, as long as the depth corresponding to the URL is less than the maximum depth. Note that when a particular URL is retrieved and searched, the search depth goes up by 1. If a URL's depth has reached the maximum depth (or greater), do not retrieve or search that web page.
Finally, the program should print out all the URLs visited along with their search depths.

Assumptions

It is somewhat difficult to parse, much less connect, to all of the properly- and improperly-formed hyperlinks on the web. Assume that each link reference is well-formed, with a fully-qualified host name, a resource path, and all of the above surrounded by double-quotes. You may wish to build a small sample site of your own for testing purposes, or you're welcome to start with this little one here. Also, surprisingly, there are several large websites that have many URLs of this form; you can try http://slashdot.org/, or http://www.nytimes.com. (The trailing slash on slashdot.org is important, by the way...)

Note that one of the more common kinds of URL that will not be acceptable are ones that start with something other than "http://"; common examples include "mailto://" and "ftp://". If you find these you should ignore them.
Assume that when your BufferedReader returns null, the server has completed sending the web page. This may in fact not be true for very slow web servers, but for our purposes it should be quite acceptable.

Useful classes and methods

As always, see the Java API for more details. These classes and methods should get you started. Note that most of these methods throw various kinds of exceptions, which you will have to handle. Again, see the Java API to find out what they are.

`Socket`

To use Sockets you have to include this line in your program:

  import java.net.*;

Constructor

Socket(String host, int port) creates a new Socket from a String representing the host and a port number, and makes the connection.

Methods

void setSoTimeout(int timeout) sets the timeout of the Socket in milliseconds. You should call this after creating the Socket so it knows how long to wait for data transfers from the other side. Otherwise it will wait forever, which is probably not a good idea if we want an efficient crawler.
InputStream getInputStream() returns an InputStream associated with the Socket. This allows the Socket to receive data from the other side of the connection.
OutputStream getOutputStream() returns an OutputStream associated with the Socket. This allows the Socket to send data to the other side of the connection.
void close() closes the Socket.

Streams

To use streams you have to include this line in your program:

  import java.io.*;

To use Sockets effectively, you will want to convert the InputStream and OutputStream associated with the Socket to something more usable. InputStream and OutputStream instances are very primitive objects; they can only read bytes or arrays of bytes (not even chars). Since you want to read and write characters, you have to have objects that convert between bytes and chars and print whole lines. Unfortunately, the Java API does this in somewhat different ways for input and output.

Input Streams

For input streams you can use the InputStreamReader classes as follows:

  InputStreamReader in = new InputStreamReader(my_socket.getInputStream());

Now in is an InputStreamReader which can read characters from the Socket. However, this still isn't very friendly because you still have to work with individual chars or arrays of chars. It would be nice to be able to read in whole lines. For this you can use the BufferedReader class. You can create a BufferedReader given an InputStreamReader instance and then call the readLine method of the BufferedReader. This will read in a whole line from the other end of the socket.

Output Streams

Output streams are a bit simpler. You can create a PrintWriter instance directly from the socket's OutputStream object and then call its println method to send a line of text to the other end of the socket. You should use this constructor:

  PrintWriter(OutputStream out, boolean autoFlush)

with autoFlush set to true. This will flush the output buffer after each println.

`String` methods

You'll find these String methods useful. See the API for documentation.

boolean equals(Object anObject)
String substring(int beginIndex)
String substring(int beginIndex, int endIndex)
boolean startsWith(String prefix)

NOTE: Do not use == to compare Strings for equality! It will only return true if the two Strings are the same object. If you want to compare the contents of the two Strings, use the equals method.

`List`s

Lists are very much like arrays of Objects except that they can expand or contract easily and as needed, and they don't have the bracket-notation for looking up individual items. To use Lists you have to include this line in your program:

  import java.util.*;

You should store the (URL, depth) pairs in a LinkedList, which is a specific implementation of List. Create one like this:

  LinkedList<URLDepthPair> myList = new LinkedList<URLDepthPair>();

and then see the API for lots of useful methods on Lists and the different implementations of Lists. (Specifically, you will notice that different implementations of List provide different features. This is why LinkedList is recommended; a few of its features are particularly well-suited to this assignment.)

The special syntax for creating a LinkedList above, is making use of the new Java 1.5 generics support. This special syntax means that you don't have to cast objects that you store or retrieve from the list, which will save you a few headaches.

Exceptions

When you find something that looks like a URL but doesn't start with "http://", you should throw a MalformedURLException, which is part of the Java API.

Design Advice

Here are some design guidelines for the implementation of your web crawler. (Not doing these things usually results in redos... hint hint!)

URL-Depth Pairs

As mentioned above, you should create a special URLDepthPair class, each instance of which includes a String field representing a URL and an int representing a search depth. You should also have a toString method which will print out the contents of the pair. This makes outputting the results of your web-crawl much easier.

Individual URLs need to be broken apart into their components. This URL parsing and manipulation should be part of the URL-depth pair class you create. Good object-oriented design dictates that if a particular class is going to store a certain kind of data, then any manipulation of that data should also be implemented in that class. So, if you write any functions to break a URL apart, or to check if a URL is valid, put it in this class!

Crawlers

As mentioned above, you should design a Crawler class which will implement the main functionality of the application. This class should have a getSites method that will return a list of all URL-depth pairs that have been visited. You can call this in your main method once crawling is completed; retrieve the list, then iterate through it and print all of the URLs.

The easiest way to keep track of the sites visited is to have two lists, one for all the sites seen so far, and one that only includes sites that have not yet been processed. You should iterate through all the sites that haven't been processed, removing each site before you download its contents, and every time you find a new URL you should put it into the not-processed list. When the not-processed list is empty then you're all done - you've found all the sites.

Although you might think to yourself, "Opening a socket on a URL is an operation involving the URL, and should therefore be implemented on the URL-depth pair class," that would be a little too specialized for the purposes of the pair class. It is really just a place to store URLs and depth values, with a few extra utilities provided. The crawler is the class that is navigating web pages and searching for URLs, so the crawler class should contain the code that actually opens and closes sockets.

You will need to create a new Socket instance for each URL that you are downloading text from. Make sure to close the socket when you are finished scanning that webpage, so that the operating system doesn't run out of networking resources! (There are only so many sockets that a computer can keep open at once.) Also, don't use recursion to search more deeply nested web pages; implement this functionality as a loop. This will also make your web crawler much more efficient as far as resource usage is concerned.

Constants!

Your program will undoubtedly have strings like "http://" and "a href=\"" in it, and you will probably be tempted to just repeat these strings wherever you need them. Also, you will need these lengths for various string operations, so you will also be tempted to hard-code the lengths of these strings into your code. Don't do this!!! It makes for very unmaintainable code! If you make a typo, or later change how you search, you will have to update many different lines of code.

Instead, create String constants in your classes. For example, you might have this:

    public static final String URL_PREFIX = "http://";

Now, if you need this string, instead of directly hard-coding it, just use the URL_PREFIX constant. If you need its length, you are in luck - URL_PREFIX is a String object, so you can call URL_PREFIX.length() and get its length back.

You should also think about where you put these constants. You should only have each constant appear once in your whole project, and you should put the constant where it makes the most sense. For example, since the URL prefix is necessary to tell if a URL is valid, you should put that constant in your URL-depth pair class. If you have another constant for HTML links, put that into your crawler class. If the crawler ever needs the URL prefix, it can simply refer to the URL-depth pair constant, instead of duplicating that constant itself.

Extra Credit

Add code to only add sites to the not-processed list if they haven't been seen before.
Extend your crawler's hyperlink-finding capabilities by using a regular-expression search on your gathered data. You'll also need more logic to decide what machine to connect to next. Your crawler should then be able to navigate links on various popular sites. Using regular expressions requires more knowledge but is actually much easier than manually searching for substrings in strings (and much more powerful).
Create a pool of five (or more!) crawlers, each in its own thread, each of which can receive a URL to browse and each of which will return a list of links when finished. Send new URLs to this pool as soon as each becomes available.
Extend your multi-threaded crawler to search to depth 1,000,000. Store the results of each crawl in a database via JDBC, and make note of how many times each particular unique page is referred to by the others. Include an intelligent algorithm to "find the meaning" in each page by weighting words and phrases on the basis of repetitiveness, proximity to the beginning of paragraphs and sections, font size or header style, and meta keywords, at least.

[end lab 6] Copyright (C) 2003-2008, California Institute of Technology.
Last updated February 19, 2008