"http://", followed by the location of the web server (e.g.
"www.cs.caltech.edu"; this is often called the "host"), followed by the path of the web page on the server (e.g.
"/courses/cs11"). If the last name in the path doesn't end in ".html" then the actual web page is assumed by most browsers to be
NOTE: There are other kinds of valid URLs as well, starting with (for
"ftp://". We won't bother with
these for this assignment.
<a href="http://path-to-host/path/to/file.html">some text</a>Note the URL embedded in the
hreftag. The hyperlink will be displayed with the text "some text" highlighted in a different color; clicking on it will cause the browser to display a new page corresponding to the new URL. Note that HTML tags are not case-sensitive.
GET /cs/courses/cs11 HTTP/1.1 Host: www.caltech.edu Connection: closeNote that the HTTP request MUST end with a blank line, or the request will be ignored.
If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.
usage: java Lab6 <URL> <depth>
Stringalong with its depth (which is 0 to start). You should create a special class to hold (URL, depth) pairs (see below).
Socket(see below) and request the specified web page,
<a href="[any URL starting with
storing all the URLs, along with a new depth value in a
of (URL, depth) pairs (see below for more about
new depth value should be one more than the depth value of the URL
corresponding to the page being parsed. If the depth value is greater than
or equal to the maximum depth value, don't store the pair.
http://; common examples include
ftp://. If you find these you should discard them.
BufferedReaderno longer indicates it is
ready()that the site has finished sending you the page. (This is not always true in real life.)
java.net. This class can parse a URL and download the entire page of a URL all by itself. Since we want you to get practice using sockets and string operations, that would defeat our purposes.
Sockets you have to include this line in your program:
Socket(String host, int port)creates a new
Stringrepresenting the host and a port number, and makes the connection.
void setSoTimeout(int timeout)sets the timeout of the
Socketin milliseconds. You should call this after creating the
Socketso it knows how long to wait for data transfers from the other side. Otherwise it will wait forever, which is not a good design, since sometimes sockets can stop sending data for any number of reasons.
InputStream getInputStream()returns an
InputStreamassociated with the
Socket. This allows the
Socketto receive data from the other side of the connection.
OutputStream getOutputStream()returns an
OutputStreamassociated with the
Socket. This allows the
Socketto send data to the other side of the connection.
void close()closes the
import java.io.*;To use
Sockets effectively, you will want to convert the
OutputStreamassociated with the
Socketto something more usable.
OutputStreaminstances are very primitive objects; they can only read
bytes or arrays of
bytes (not even
chars!). Since you want to read and write characters, you have to have objects that convert between
chars and print whole lines. Unfortunately, the java API does this in somewhat different ways for input and output.
InputStreamReaderclass as follows:
InputStreamReader in = new InputStreamReader(my_socket.getInputStream());and now
inis an InputStreamReader which can read characters from the
Socket. However, this still isn't very friendly because you still have to work with individual
chars or arrays of
chars. It would be nice to be able to read in whole lines at a time. For this you can use the
BufferedReaderclass. You can create a
InputStreamReaderinstance and then call the
readLinemethod of the
BufferedReader. This will read in a whole line from the other end of the socket. You should also use the
readymethod of the
BufferedReaderto check that the input stream is capable of reading data (this is not guaranteed when using a socket).
PrintWriterinstance directly from the socket's
OutputStreamobject and then call its
printlnmethod to send a line of text to the other end of the socket. You should use this constructor:
PrintWriter(OutputStream out, boolean autoFlush)with
true. This will flush the output buffer after each
Stringmethods useful. See the API for documentation.
boolean equals(Object anObject)
String substring(int beginIndex)
String substring(int beginIndex, int endIndex)
Stringsfor equality! It will only return
trueif the two
Stringsare the same string. If you want to compare the contents of the two
Strings, use the
Vectors are very much like arrays, except that they store any kind of
Object(which means any class instance but not primitive types like
ints), and they can expand or contract as needed. To use
Vectors you have to include this line in your program:
import java.util.*;You should store the (URL, depth) pairs in a
Vector. See the API for lots of useful methods on
Vectors. Also note that you have to do a type cast to retrieve an object from a
"http://", you should throw a
MalformedURLException, which is part of the java API.
Lab6class which creates a
Crawlerinstance, gives it a starting URL and depth, starts it, and prints the results when it's done.
URLDepthPairclass, each instance of which includes
Stringfields representing a URL and an
intrepresenting a search depth. You should also have a
toStringmethod which will print out the contents of the pair. It's useful to also have methods for extracting the host and path components of a URL.
Crawlerclass which will implement the main functionality of the application. This class should have a
getSitesmethod that will return a
Vectorof site pairs (to be used by the
Vectors, one for all the sites seen so far, and one that only includes sites that have not been processed. You should iterate through all the sites that haven't been processed, removing each site before you download its contents, and every time you find a new URL you should put it into the not-processed vector. When the not-processed vector is empty you've found all the sites.
Socketinstance for each URL that you are downloading text from. Don't forget to close the socket when you're finished using it.
Vectorof links when finished. Send new URLs to this pool as soon as each becomes available.
NOTE: For humor-impaired people, the last example is not to be taken seriously. But by all means, give it a shot ;-)