"http://"
, followed by
the location of the web server (e.g.
"www.cs.caltech.edu"
; this is often called the "host"), followed
by the path of the web page on the server (e.g.
"/courses/cs11"
). If the last name in the path doesn't end in
".html" then the actual web page is assumed by most browsers to be
"index.html"
.
NOTE: There are other kinds of valid URLs as well, starting with (for
instance) "mailto://"
or "ftp://"
. We won't bother with
these for this assignment.
<a href="http://path-to-host/path/to/file.html">some text</a>Note the URL embedded in the
href
tag. The hyperlink will be
displayed with the text "some text" highlighted in a different color;
clicking on it will cause the browser to display a new page corresponding to
the new URL. Note that HTML tags are not case-sensitive.
GET /cs/courses/cs11 HTTP/1.1 Host: www.caltech.edu Connection: closeNote that the HTTP request MUST end with a blank line, or the request will be ignored.
If the correct arguments are not supplied, the program should immediately stop and print out a usage message e.g.
usage: java Lab6 <URL> <depth>
String
along with its
depth (which is 0 to start). You should create a special class to hold (URL,
depth) pairs (see below).
Socket
(see below) and request the specified web page,
<a href="[any URL starting with http://
]">,
storing all the URLs, along with a new depth value in a Vector
of (URL, depth) pairs (see below for more about Vector
s). The
new depth value should be one more than the depth value of the URL
corresponding to the page being parsed. If the depth value is greater than
or equal to the maximum depth value, don't store the pair.
http://
; common examples include mailto://
and
ftp://
. If you find these you should discard them.
BufferedReader
no longer
indicates it is ready()
that the site has finished sending you
the page. (This is not always true in real life.)
URL
class in java.net
. This
class can parse a URL and download the entire page of a URL all by itself.
Since we want you to get practice using sockets and string operations, that
would defeat our purposes.
Socket
Socket
s you have to include this line in your program:
import java.net.*;
Socket(String host, int port)
creates a new Socket
from a String
representing the host and a port number, and makes
the connection.
void setSoTimeout(int timeout)
sets the timeout of the
Socket
in milliseconds. You should call this after creating the
Socket
so it knows how long to wait for data transfers from the
other side. Otherwise it will wait forever, which is not a good design,
since sometimes sockets can stop sending data for any number of reasons.
InputStream getInputStream()
returns an
InputStream
associated with the Socket
. This
allows the Socket
to receive data from the other side of the
connection.
OutputStream getOutputStream()
returns an
OutputStream
associated with the Socket
. This
allows the Socket
to send data to the other side of the
connection.
void close()
closes the Socket
.
import java.io.*;To use
Socket
s effectively, you will want to convert the
InputStream
and OutputStream
associated with the
Socket
to something more usable. InputStream
and
OutputStream
instances are very primitive objects; they can only
read byte
s or arrays of byte
s (not even
char
s!). Since you want to read and write characters, you have
to have objects that convert between byte
s and
char
s and print whole lines. Unfortunately, the java API does
this in somewhat different ways for input and output.
InputStreamReader
class as
follows:
InputStreamReader in = new InputStreamReader(my_socket.getInputStream());and now
in
is an InputStreamReader which can read characters
from the Socket
. However, this still isn't very friendly
because you still have to work with individual char
s or arrays
of char
s. It would be nice to be able to read in whole lines at
a time. For this you can use the BufferedReader
class. You can
create a BufferedReader
given an InputStreamReader
instance and then call the readLine
method of the
BufferedReader
. This will read in a whole line from the other
end of the socket. You should also use the ready
method of the
BufferedReader
to check that the input stream is capable of
reading data (this is not guaranteed when using a socket).
PrintWriter
instance directly from the socket's OutputStream
object and then
call its println
method to send a line of text to the other end
of the socket. You should use this constructor:
PrintWriter(OutputStream out, boolean autoFlush)with
autoFlush
set to true
. This will flush the
output buffer after each println
.
String
methods useful. See the API for
documentation.
boolean equals(Object anObject)
String substring(int beginIndex)
String substring(int beginIndex, int endIndex)
==
to compare Strings
for
equality! It will only return true
if the two
Strings
are the same string. If you want to compare the
contents of the two String
s, use the equals
method.
Vector
Vector
s are very much like arrays, except that they store any
kind of Object
(which means any class instance but not primitive
types like int
s), and they can expand or contract as needed. To
use Vector
s you have to include this line in your program:
import java.util.*;You should store the (URL, depth) pairs in a
Vector
. See the
API for lots of useful methods on Vectors
. Also note that you
have to do a type cast to retrieve an object from a Vector
.
"http://"
, you should throw a
MalformedURLException
, which is part of the java API.
Lab6
class which creates a Crawler
instance, gives it a starting URL and depth, starts it, and prints the
results when it's done.
URLDepthPair
class, each instance of which
includes String
fields representing a URL and an
int
representing a search depth. You should also have a
toString
method which will print out the contents of the pair.
It's useful to also have methods for extracting the host and path components
of a URL.
Crawler
class which will implement the main
functionality of the application. This class should have a
getSites
method that will return a Vector
of site
pairs (to be used by the Lab6
class).
Vector
s, one for all the sites seen so far, and one that only
includes sites that have not been processed. You should iterate through all
the sites that haven't been processed, removing each site before you download
its contents, and every time you find a new URL you should put it into the
not-processed vector. When the not-processed vector is empty you've found
all the sites.
Socket
instance for each URL
that you are downloading text from. Don't forget to close the socket when
you're finished using it.
Vector
of links when finished. Send new URLs to this pool as
soon as each becomes available.
NOTE: For humor-impaired people, the last example is not to be taken seriously. But by all means, give it a shot ;-)