CS11 Erlang - Lab 3 - Parsing RSS Feeds

For the rest of the assignments this term, we will be building a simple RSS feed aggregator. Erlang happens to be particularly well suited to the kinds of operations that an RSS feed aggregator needs to perform, so we will benefit a lot from the abstractions that Erlang provides.

This week we are going to write the code to process RSS 2.0 feed documents. You may find it helpful to consult an RSS 2.0 specification for this assignment; if so, this website contains a simple and clear description of the spec. We won't worry about pulling feeds from the Internet yet; rather, we will work with XML files that contain RSS feeds.

We will also be working with the xmerl XML parser that comes with Erlang/OTP. You should look at the module documentation for Erlang; there are six different xmerl packages for working with XML! However, we will only be concentrating on xmerl_scan, since this is the module that contains the XML parser.

The xmerl_scan package has two functions we can use for document parsing, xmerl_scan:file/1 and xmerl_scan:string/1. This week you will use the first of these, since you will be working with data files on the local filesystem. Read the documentation for what these functions produce, but it is very simple: these functions produce a tuple, where the first element is an #xmlElement record corresponding to the root node of the document, and the second element is "the rest of the input data," in the case of a parsing issue.

You can try these functions from the Erlang shell, although you should load the xmerl record definitions before doing anything else. Here are some data files to try:

You can do something like this from the Erlang shell:

    1> rr(xmerl).
    [a list of record types]
    2> xmerl_scan:file("digg-science-rss1.xml").
    {#xmlElement{...}, []}

Of course, these results will be quite long, because every little part of the XML document is represented as a separate record. You can look at the raw XML documents to see what they contain, and then look at the corresponding set of data structures for how xmerl represents them.

Note: If you get the cryptic {error,enoent} response, this means that xmerl_scan:file couldn't find the file you specified. This return-value is documented on the file:open function, which xmerl_scan:file uses internally.

It is also very helpful to look at the actual xmerl header file containing the record definitions. To do this, you can ask Erlang where the xmerl.hrl header file is stored:

    3> code:lib_dir(xmerl).

This command will print out the path to where the xmerl library is stored in your local Erlang installation. Go to the directory reported by the above command, and then look at include/xmerl.hrl. This file contains the well-documented record-declarations for XML elements. You will specifically want to look at the #xmlElement.attributes and #xmlElement.content fields, as well as the #xmlAttribute.name and #xmlAttribute.value fields.

RSS Feed Parsing Tasks

Create a module rss_parse.
Create and export a function is_rss2_feed/1 that takes the output of a xmerl_scan:file/1 or xmerl_scan:string/1 command, and returns true if the input is an RSS 2.0 feed.

We are not going to be very sophisticated here. We aren't going to do XML schema validation or anything like that; we are going to implement a very simple test. Just make sure that the root element of the XML document is an rss element with an attribute version="2.0". You can ignore all of the other details on the root element, such as what XML namespaces (xmlns specs) are listed, and what other attributes are specified. Just return true if the root element looks something like <rss version="2.0">.
Create and export a function get_feed_items(RSS2Feed) that takes the root #xmlElement record of an RSS 2.0 XML document, and returns a list of the <item> elements (and their contents) in the document.

RSS feeds consist of feed-items, and the <item> element is what represents these feed items. Again, you can look at the raw XML documents to get a better idea of what goes into these feed items; the answer really is, just about anything! Since we are going to be routing and filtering RSS feed items, and then eventually serving them as another XML feed, we might as well just work with them in the original XML, especially because they can contain such a wide variety of data values. At least for our purposes, it wouldn't make any sense to convert them to an internal format, and then convert them back into XML at the end. So we will just work with the XML elements that represent the feed items.

Note that different RSS feeds may wrap feed-items in different elements, based on the site providing the feed! For example, digg.com wraps feed-items in a <channel> element, whereas cnn.com does not. Your function should be smart enough to just search through the entire document structure and find all <item> elements, returning them in a list.
Create and export a function get_item_time(Item) that takes an #xmlElement corresponding to an RSS 2.0 feed item, and returns the item's publication time as a single integer, representing seconds since the Gregorian year 0. (Seriously. Just keep reading.)

This may seem like a miserably complex task, but fortunately Erlang already contains the tools to make this a relatively straightforward process. You can use httpd_util:convert_request_date/1 to convert a raw string into the Erlang form of date/time values: {{Year,Month,Date},{Hour,Min,Sec}}. Then, you can use the calendar:datetime_to_gregorian_seconds/1 function to convert this date/time value into a single integer.

Your function should return the atom bad_date if the date doesn't get parsed correctly by httpd_util:convert_request_date/1. (Bad dates are dangerous; just ask Sallah.)
Create and export a function compare_feed_items(OldItem, NewItem), that takes two #xmlElement records corresponding to RSS feed items, and then returns an atom based on the relationship between the old and new feed-items. The details of what atoms to return are outlined below. But first, why do we need this function?

RSS readers present a feed as if it were a stream of items that are sent one-by-one to your reader for your perusal. However, this is not the case! Rather, when the reader accesses the feed URL, it receives a document containing a snapshot of all feed-items currently in publication. The reader must keep track of the RSS feed-items that it has already seen from the publisher, and when it sees a new item in the feed snapshot then it must add the new item to its local list. However, if the item is not new (i.e. the reader already has the same item stored locally), then the item is not new and there's no point in notifying the user about that particular item.

With this important detail in mind, this compare_feed_items/2 function will help classify incoming RSS feed items so that the aggregator can know when an item is actually new, or just an old item that we have seen again for the umpteenth time. So, the function should return the following atoms:
- Return same if the old and new RSS feed items are identical. (Of course, use value-equality, not object-identity...)
- If the first case is not true (i.e. the items are not identical), then return updated if the new RSS feed item is an updated version of the old item. Your logic should be something like this:
  1. If both items have a <guid> subelement, and the values of the <guid> subelement are the same, then the new item is an updated version of the old item.
  2. Otherwise, if both items have a <title> subelement, and the values of the <title> subelement are the same, then the new item is an updated version of the old item.
  3. Otherwise, if both items have a <link> subelement, and the values of the <link> subelement are the same, then the new item is an updated version of the old item.
  In other words, you want to use these three subselements to tell if the two records refer to different versions of the same RSS feed item. And, you want to look at them in this order, since none of them are strictly required by the RSS 2.0 spec, and this is really the best we can do. (RSS is an inexact science...)
  
  You probably want to make a helper-function that takes an #xmlElement record (e.g. corresponding to the <item> element) and a subelement-name as an atom (e.g. title), and looks for a subelement with that name in the input element's contents. Then it becomes pretty straightforward to implement the above comparison.
- Finally, if the above two cases are not true, return the atom different.
One difficulty in implementing the above functionality is that xmerl records positional data into the parsed XML representation, such as the parent of an XML element, or an element's position with respect to its siblings. Therefore nodes can't be easily compared. Here is a piece of code that will strip that positional data out so you can just use simple comparisons:
```
% @private
% @doc Given an XML element of some kind, this helper function will go through
%      and remove any details about other XML elements, for example in the
%      "parents" or "pos" fields.
%
% @spec extract_xml(Node::xmlAny()) -> xmlAny()
%
extract_xml(Elem = #xmlElement{}) ->
    Elem#xmlElement{parents=[], pos=0,
        content=lists:map(fun extract_xml/1, Elem#xmlElement.content),
        attributes=lists:map(fun extract_xml/1, Elem#xmlElement.attributes)};
extract_xml(Attr = #xmlAttribute{}) ->
    Attr#xmlAttribute{parents=[], pos=0};
extract_xml(Text = #xmlText{}) ->
    Text#xmlText{parents=[], pos=0};
extract_xml(Comment = #xmlComment{}) ->
    Comment#xmlComment{parents=[], pos=0};
extract_xml(Other) ->
    Other.
    
```

Once you have these functions coded up, it will be quite a bit easier to work with RSS feeds and feed-items in the rest of your program. Make sure to clearly document all of your functions. You will be relying on them quite heavily through the project.

You should also try retrieving your own RSS 2.0 feeds from various websites, such as CNN.com or BBC News. You can also try the Slashdot RSS feed, which is an RSS 1.0 feed, and should properly be flagged as NOT an RSS 2.0 feed by your helper functions.