Lab 6: Surfing the Web

Goals

In this lab, we are going to write a rudimentary web crawler. A web crawler is a program that automatically downloads web pages from the internet, searches for some information, and uses that information to look for new pages. The web crawler we'll write will be just about the simplest possible one imaginable: it will simply look for new URLs (web page locations) on each page, collect them and print them out at the end. More sophisticated web crawlers are used to do things like search the internet for content; if you've ever used google or altavista you've used a web crawler whether you realized it or not. This lab will introduce you to some of the technologies that make web crawling possible.

Terminology

URL: Uniform Resource Locator. This is a web page address. For our purposes, it consists of the string "http://", followed by the location of the web server (e.g. "www.cs.caltech.edu"; this is often called the "host"), followed by the path of the web page on the server (e.g. "/courses/cs11"). If the last name in the path doesn't end in ".html" then the actual web page is assumed by most browsers to be "index.html".
NOTE: There are other kinds of valid URLs as well, starting with (for instance) "mailto://" or "ftp://". We won't bother with these for this assignment.
HTML: HyperText Markup Language. This is a simple language (not a programming language) in which web pages are written. For our purposes, the most interesting part are hyperlinks, which look like this:
```
<a href="http://path-to-host/path/to/file.html">some text</a>
```
Note the URL embedded in the href tag. The hyperlink will be displayed with the text "some text" highlighted in a different color; clicking on it will cause the browser to display a new page corresponding to the new URL. Note that HTML tags are not case-sensitive.
HTTP: HyperText Transfer Protocol. This is a standard text-based format used for transmitting web page data over the internet. The latest specification of HTTP is version 1.1, which is the version we'll be using. An HTTP query to download the contents of a web page looks like this:
```
GET /cs/courses/cs11 HTTP/1.1
Host: www.caltech.edu
Connection: close
      
```
Note that the HTTP request MUST end with a blank line, or the request will be ignored.
socket: A socket is an entity defined in software which allows for two way (send and receive) communication to and from a URL. Sockets don't have to use the HTTP protocol, but for downloading web pages they do, so that's what we'll do.
port: A location on the web server that a socket can bind to. Ports are identified with a number. For HTTP connections we generally use port 80. You don't need to know anything more about ports than that.

Program to write

Here is the specification of the program you are to write.

The program should accept two parameters on the command line:
1. a string representing the URL at which to start browsing and
2. a positive integer representing a maximum search depth (see below).
If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.
```
usage: java Lab6 <URL> <depth>
```
The program should store the URL as a String along with its depth (which is 0 to start). You should create a special class to hold (URL, depth) pairs (see below).
The program should connect to the given site within the URL on port 80 using a Socket (see below) and request the specified web page,
The program should parse the returned text (if any) line by line for any substrings which have the format
<a href="[any URL starting with http://]">,
storing all the URLs, along with a new depth value in a Vector of (URL, depth) pairs (see below for more about Vectors). The new depth value should be one more than the depth value of the URL corresponding to the page being parsed. If the depth value is greater than or equal to the maximum depth value, don't store the pair.
The program should then close the socket connection to the host.
The program should then recurse through steps 3 to 6 on each new URL, as long as the depth corresponding to the URL is less than the maximum depth. Note that each time you recurse the search depth goes up by 1.
Finally, the program should print out all the URLs visited (the URL strings, not the contents of the web page corresponding to the URL) along with their search depths.

Assumptions

It is somewhat difficult to parse, much less connect, to all of the properly- and improperly-formed hyperlinks on the web. Assume that each link reference is well-formed, with a fully-qualified host name, a resource path, and all of the above surrounded by double-quotes. You may wish to build a small sample site of your own for testing purposes, or you're welcome to try a little one we built. Note that one of the more common kind of URL that will not be acceptable are ones that start with something other than http://; common examples include mailto:// and ftp://. If you find these you should discard them.
Assume that the URL is totally contained on a single line. If this isn't the case for a particular URL (I'm not sure if this is possible, but let's suppose it is), just ignore it.
Assume that a site needs at most two seconds to respond to your GET query, and then assume that when your BufferedReader no longer indicates it is ready() that the site has finished sending you the page. (This is not always true in real life.)
Do not use the URL class in java.net. This class can parse a URL and download the entire page of a URL all by itself. Since we want you to get practice using sockets and string operations, that would defeat our purposes.

Useful classes and methods

As always, see the Java API for more details. These classes and methods should get you started. Note that most of these methods throw various kinds of exceptions, which you will have to handle. Again, see the Java API to find out what they are.

`Socket`

To use Sockets you have to include this line in your program:

import java.net.*;

Constructor

Socket(String host, int port) creates a new Socket from a String representing the host and a port number, and makes the connection.

Methods

void setSoTimeout(int timeout) sets the timeout of the Socket in milliseconds. You should call this after creating the Socket so it knows how long to wait for data transfers from the other side. Otherwise it will wait forever, which is not a good design, since sometimes sockets can stop sending data for any number of reasons.
InputStream getInputStream() returns an InputStream associated with the Socket. This allows the Socket to receive data from the other side of the connection.
OutputStream getOutputStream() returns an OutputStream associated with the Socket. This allows the Socket to send data to the other side of the connection.
void close() closes the Socket.

streams

To use streams you have to include this line in your program:

import java.io.*;

To use Sockets effectively, you will want to convert the InputStream and OutputStream associated with the Socket to something more usable. InputStream and OutputStream instances are very primitive objects; they can only read bytes or arrays of bytes (not even chars!). Since you want to read and write characters, you have to have objects that convert between bytes and chars and print whole lines. Unfortunately, the java API does this in somewhat different ways for input and output.

Input streams

For input streams you can use the InputStreamReader class as follows:

    InputStreamReader in = new InputStreamReader(my_socket.getInputStream());

and now in is an InputStreamReader which can read characters from the Socket. However, this still isn't very friendly because you still have to work with individual chars or arrays of chars. It would be nice to be able to read in whole lines at a time. For this you can use the BufferedReader class. You can create a BufferedReader given an InputStreamReader instance and then call the readLine method of the BufferedReader. This will read in a whole line from the other end of the socket. You should also use the ready method of the BufferedReader to check that the input stream is capable of reading data (this is not guaranteed when using a socket).

Output streams

Output streams are a bit simpler. You can create a PrintWriter instance directly from the socket's OutputStream object and then call its println method to send a line of text to the other end of the socket. You should use this constructor:

PrintWriter(OutputStream out, boolean autoFlush)

with autoFlush set to true. This will flush the output buffer after each println.

String methods

You'll find these String methods useful. See the API for documentation.

boolean equals(Object anObject)
String substring(int beginIndex)
String substring(int beginIndex, int endIndex)

NOTE: Do not use == to compare Strings for equality! It will only return true if the two Strings are the same string. If you want to compare the contents of the two Strings, use the equals method.

`Vector`

Vectors are very much like arrays, except that they store any kind of Object (which means any class instance but not primitive types like ints), and they can expand or contract as needed. To use Vectors you have to include this line in your program:

import java.util.*;

You should store the (URL, depth) pairs in a Vector. See the API for lots of useful methods on Vectors. Also note that you have to do a type cast to retrieve an object from a Vector.

Exceptions

When you find something that looks like a URL but doesn't start with "http://", you should throw a MalformedURLException, which is part of the java API.

Design advice

Create a Lab6 class which creates a Crawler instance, gives it a starting URL and depth, starts it, and prints the results when it's done.
Create a URLDepthPair class, each instance of which includes String fields representing a URL and an int representing a search depth. You should also have a toString method which will print out the contents of the pair. It's useful to also have methods for extracting the host and path components of a URL.
Create a Crawler class which will implement the main functionality of the application. This class should have a getSites method that will return a Vector of site pairs (to be used by the Lab6 class).
The easiest way to keep track of the sites visited is to have two Vectors, one for all the sites seen so far, and one that only includes sites that have not been processed. You should iterate through all the sites that haven't been processed, removing each site before you download its contents, and every time you find a new URL you should put it into the not-processed vector. When the not-processed vector is empty you've found all the sites.
You will need to create a new Socket instance for each URL that you are downloading text from. Don't forget to close the socket when you're finished using it.

Extra Credit

Add code to only add sites to the not-processed vector if they haven't been seen before.
Extend your crawler's hyperlink-finding capabilities by using a regular-expression search on your gathered data. You'll also need more logic to decide what machine to connect to next. Your crawler should then be able to navigate links on various popular sites. Using regular expressions requires more knowledge but is actually much easier than manually searching for substrings in strings (and much more powerful).
Create a pool of five (or more!) crawlers, each in its own thread, each of which can receive a URL to browse and each of which will return a Vector of links when finished. Send new URLs to this pool as soon as each becomes available.
Extend your multi-threaded crawler to search to depth 1,000,000. Store the results of each crawl in a database via JDBC, and make note of how many times each particular unique page is referred to by the others. Include an intelligent algorithm to "find the meaning" in each page by weighting words and phrases on the basis of repetitiveness, proximity to the beginning of paragraphs and sections, font size or header style, and meta keywords, at least.
NOTE: For humor-impaired people, the last example is not to be taken seriously. But by all means, give it a shot ;-)