CS11 Erlang - Lab 7

Exposing RSS Feeds via YAWS

YAWS is an open-source web server written in Erlang. ("YAWS" stands for "Yet Another WebServer".) Although Erlang includes a webserver framework in its standard libraries (see the httpd module), it is pretty complicated to set up, and YAWS is much easier to use. Specifically, you can create dynamic web-pages that use Erlang to generate content on the server. This will make it very easy to generate our aggregate RSS feeds.

YAWS on the CS Cluster

If you are doing your Erlang assignments on the CS cluster, YAWS is installed at the directory "/cs/courses/cs11/software/yaws-1.80". To use YAWS from the Erlang prompt, you need to add the YAWS binaries to the Erlang code path, so that the Erlang emulator can find the module files. You can do this with the code:add_path(Dir) function, like this:

code:add_path("/cs/courses/cs11/software/yaws-1.80/lib/yaws/ebin").

You might find it much easier to create a .erlang file in your lab7 directory, including this line of code.

YAWS on Your Own Machine

Alternatively, if you want to do your assignment on your own computer, you can download YAWS from this URL. We recommend version 1.80, since this is what is also installed on the CS cluster.

Once you have downloaded and unpacked the distribution, you will need to build and install the software. This package uses the traditional "configure ; make ; make install sequence of operations. Something like this should work on Linux, Cygwin, or MacOS X:

# Tell the installer to put YAWS into /usr/local
./configure --prefix=/usr/local/yaws-1.80

make

# On Cygwin, just type "make install"
# On Linux, you may need to use "su" instead of "sudo"
# On MacOS X:
sudo make install

Once the make install step is complete, you should have a directory in /usr/local/yaws-1.80 that contains the webserver. As in the previous section, you can use code:add_path() to make the YAWS code visible to the Erlang emulator. However, the path will be more like this:

code:add_path("/usr/local/yaws-1.80/lib/yaws/ebin").

Running YAWS

YAWS can be run in a standalone mode, or in an embedded mode; we will be using the embedded mode to start it as part of our RSS feed-aggregator project. This webpage describes how to run YAWS in embedded mode, but it's actually very simple:

Within your lab7 directory, create a subdirectory called docroot. This is where YAWS will serve web pages from. Create a simple test page in the docroot directory called test.html, something like this:

<html>
<body>
<h1>IT WORKS!</h1>

<p>This is a web page served by yaws.</p>
</body>
</html>

Once you have created this file, go back to the lab7 directory (where your .erlang file is), start up the Erlang shell, and then start YAWS in embedded mode:

1> yaws:start_embedded("docroot").

=INFO REPORT==== 4-Mar-2009::12:26:28 ===
Yaws: Listening to 127.0.0.1:8000 for servers
 - http://localhost:8000 under docroot
ok
2>

Once you have started YAWS, you should be able to go to the URL http://localhost:8000/test.html, and see your test file displayed in the web browser.

RSS Aggregates Exposed via YAWS

Once you have YAWS set up, you can put an rss.yaws file into your docroot directory. This file will allow you to access the webserver like this:

YAWS allows dynamic content to be generated by wrapping Erlang code in <erl> tags. One of the functions specified must be an out(Arg) function, where Arg specifies the HTTP request information. By providing an implementation of this function, we can get YAWS to deliver our RSS feeds for us.

The rss.yaws file looks like this:

<erl>
out(Arg) ->
    % Figure out the name of the queue that was requested, as well as the actual
    % request URL, so that we can build our feed XML.

    RequestURL = yaws_api:format_url(yaws_api:request_url(Arg)),
    QueueName = list_to_atom(Arg#arg.querydata),

    % Make sure the specified name actually corresponds to a queue!  If not, we
    % simply report a 404 "Not Found" response.
    QPid = whereis(QueueName),
    if
        QPid == undefined ->
            [{status, 404},
             {html, [io_lib:format("<h1>404 Queue ~p Not Found</h1>", [QueueName]),
                     io_lib:format("<p>No queue with name ~p on this server.</p>",
                         [QueueName])
                    ]
             }];

        true ->
            {content, "application/xhtml+xml",
             rss_queue:get_feed_xml(QueueName, RequestURL) }
    end.
</erl>

(You can download this file here: rss.yaws)

As you can see above, the name of the queue is retrieved from the HTTP request, along with the entire request URL. Then, if the queue doesn't exist, the page generates a "404 Not Found" response. However, if the queue does exist, the page returns a response of the "application/xhtml+xml" content-type, which is what your RSS feed-reader expects. The key function above is the rss_queue:get_feed_xml(QueueName, RequestURL) function, which you must write this week.

Generating RSS Feed XML

RSS feed XML is pretty straightforward, although there are a few nuances to it. Right now, you have an rss_queue:get_all(pid() | atom()) function that returns a list of #xmlElement{name=item} elements, and this is about 90% of the effort. However, you must still wrap it with an RSS 2.0 document, that looks something like this:

<?xml version="1.0"?>
<rss ... version="2.0">
    <channel>
        <title>feed_name</title>
        <description>Aggregated feed queue feed_name</description>
        <link>The HTTP request URL</link>
        
        [ RSS Feed Items Go Here! ]
    </channel>
</rss>

This is what your implementation of rss_queue:get_feed_xml/2 needs to produce.


IMPORTANT NOTE!

For your RSS feed XML to parse correctly, you must include references to all XML namespaces that the feed items use. For example, you will see XML like this:

<feedburner:origLink> ... </feedburner:origLink>

If we were cool, we would propagate these XML namespace directives all the way from our feed-sources to the final output, but we aren't, so instead we will do something easy: just include the common XML namespace directives on our output <rss> element. Like this:

<rss xmlns:media="http://search.yahoo.com/mrss/"
     xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
     xmlns:digg="http://digg.com/docs/diggrss/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     version="2.0">

Of course, in your Erlang code, you can do something like this:

"<rss xmlns:media=\"http://search.yahoo.com/mrss/\""
    " xmlns:feedburner=\"http://rssnamespace.org/feedburner/ext/1.0\""
    " xmlns:digg=\"http://digg.com/docs/diggrss/\""
    " xmlns:dc=\"http://purl.org/dc/elements/1.1/\""
    " version=\"2.0\">\n"

(Remember that Erlang will automatically concatenate two adjacent string literals, at compile-time.)


Some hints for your implementation of rss_queue:get_feed_xml/2:

Testing

The easiest way to test your new system is to try out the rss_queue:get_feed_xml/2 function separately, before you try it through YAWS. (YAWS seems to die when the page Erlang throws an exception, so you need to restart the Erlang emulator in those situations.)

Once you know that your new function works correctly, you can start up YAWS and see if you can retrieve RSS XML feeds from your webserver. If you are using Firefox then you can visit some of the feed URLs, and the browser should recognize it as an RSS feed. If this is what you see, congratulations! You are finished!

Make sure to leave your work into your ~/cs11/erlang/lab7 directory.

References

Here are some helpful links for understanding YAWS:


Copyright (C) 2009, California Institute of Technology. All rights reserved.
Last updated March 5, 2009.