CS11 Advanced C++ Track Lab 1 - Regular Expressions

Regular Expressions

Regular expressions (abbreviated "regex", and pronounced with a hard "g" like "go", not a soft "g" like "George") are a very powerful way of specifying patterns to match against text strings. The language for specifying a regular expression is very simple and compact, but they can be used in numerous ways. For example, if you want to find all the IP addresses in a string, you can specify a regex for IP addresses and then look at all the matches. If you want to validate a credit card number, or a date, or any other kind of data value, it is typically very easy to write a regex that will identify valid or invalid values. As an example, here is a regex to match floating-point numbers: "-?[0-9]+(\.[0-9+])?"

In this assignment you will complete the basic implementation of a regular expression engine. This is certainly not the fastest way to implement regexes, in fact, it is probably one of the slowest. But it still illustrates some of the fundamental principles of how regular expressions work, and it will show you why some regexes can be extremely slow to evaluate. With C++11, a regular expression library was added to the C++ Standard Library, and normally you would want to use that implementation, or some other efficient regex implementation. But, this week you will build your own.

It should be stated up front: We are not going to implement the full regular expression syntax! There are some very sophisticated features in regular expression parsing that greatly complicate the implementation. We are picking out some very basic aspects of regex to implement this week.

Overview of Regular Expressions

There are typically two operations we want to perform with regular expressions:

We can find the first occurrence of a regex pattern in a string. This will identify the first substring in the input string where the regex matches. It is also possible that the regex pattern won't occur anywhere in the input string.
We can try to match a string against a regex pattern, which means that the entire string matches against the regular expression. You can implement "match" in terms of "find" - if the "find" operation identifies the entire input string, then it is a match.

For simplicity, we will discuss matches first, and then the more general case of finding substrings afterward.

Matching Operations

Here is an overview of the kinds of regular expressions we will support in this assignment:

Most characters in the regex match against themselves. For example, the regex "abc" will match the string "abc". However, there are a number of characters that have special meaning in regular expressions.
The "." (period) is an example of a character with special meaning in a regex: it will match against any single character. Thus, the regex "a.c" will match against "abc", "aqc", "a?c", and so forth.
If you wish to identify actual "." characters in the target string, you can escape them with a "\" backslash character. For example, the regex "a\.c" regex will only match against the string "a.c", but not "abc" or any other string.
You can escape any character with special meaning in a regular expression, using the backslash character, including escaping the backslash character itself.
If you wish to match against a subset of characters - known as a character class - you can use the "[]" operator. For example, the regex "a[bdr]c" will match against the strings "abc", "adc" and "arc", but not "abdc", "ac" or "axc".
You can also exclude a subset of characters - a negated character class - with the "[^]" operator. For example, the regex "a[^bdr]c" will match against any three-letter string that starts with "a", ends with "c", and does not have a "b", "d" or "r" as the second letter. So, "axc" would match, but "arc" and "abc" would both not match.

Repeated and Optional Matches

Regular expressions also have the ability to repeat matches, or to optionally match (or not match) a character.

For example, following a character with the "?" operator makes the match optional. If we wanted to match both American-English and British-English spellings of the word "color" (or "colour"), we could specify the regex "colou?r", which will optionally include or exclude the "u".
The "*" modifier will cause the previous character to be matched zero or more times. For example, "ab*c" will match "ac", or "abc", or "abbbbbbbc". (This operator is also known as the "Kleene star," which is only relevant to this assignment because it's named after Stephen Kleene, inventor of regular expressions.)
Similarly, the "+" modifier will cause the previous character to be matched one or more times. The regex "ab+c" will not match the string "ac", but it will match "abc" or "abbbbbbc". (Note, the regex "ab+c" is equivalent to the regex "abb*c".)

It is important to note that the "*" and "+" operators are greedy: in other words, they will consume as much of the input string as possible. This can sometimes result in unexpected behaviors, and it makes the regex evaluation engine more complex. (This is why you don't have to implement that piece of the puzzle; it will be given to you.) This will be explained in a bit more detail when we discuss the "find" operation.

Other Regex Features

There are a few other highly useful regex features, which we will not be implementing because they would greatly complicate the implementation. Nonetheless, you should still know about them:

Character classes can also specify ranges of characters to make it easy to match letters or digits. For example, the regex "[0-9]+" will match nonnegative integers, and the regex "-?[0-9]+" would match all integers.
Regular expressions can group sub-expressions with parentheses "()". Additionally, all the various modifiers, such as "?", "*" and "+" can be applied to groups, as well as to individual characters. Thus, the regex "(ab)+a" would match the strings "aba", "ababababa", or any sequence of "a" characters interleaved with "b" characters.
The "re1|re2" operator lets you match either the subexpression re1 or re2. This operator is very low precedence, so if you were trying to match the words "far" and "for" with a single regex, you would need to specify "f(a|o)r", since "fa|or" would only match the strings "fa" and "or", but nothing else.

There are numerous other features that regular expressions support, but this should give you a good sense of their power.

Find Operations

You can also "find" occurrences of a regular expression in a string; unlike a "match" operation, the "find" operation simply identifies the first substring that matches the specified regex. We will specify the result of a "find" operation as the region of the string that matched the regex, indicated by the starting index (inclusive) and the ending index (exclusive).

For example, searching for the regex "abc" in the string "abcabc" will identify the region [0, 3) in the string. Similarly, finding "abc" on the string "defabcdef" will return the region [3, 6).

Recall that the "*" and "+" operators are greedy; performing "find" operations is where the greedy nature of these operators becomes evident. For example, given the regex "a.*b" and the string "abbbb" will find the region [0, 5) rather than simply [0, 2), since the ".*" operator will consume as many characters as possible while still achieving a match.

Backtracking

This is also where the "backtracking" nature of regular expressions becomes evident. As the regex is evaluated, a greedy operator may consume too much of the input string. If subsequent parts of the regex won't match, then the regex engine must backtrack until it can successfully match subsequent operations. If the engine must backtrack all the way through everything it has attempted, then it indicates a total matching failure - and the engine gives up.

For example, with the above regex "a.*b" and the string "aabbbcccc", the ".*" operator might attempt to consume the region [1, 9) - that is, everything after the first "a" character. But this means the final "b" in the regex won't match. So, the engine backs up one step, and tries again - the ".*" only matches on the range [1, 8), leaving the final "c". Of course, this still doesn't match the "b" required by the regex. So, the engine backs up again, and again, and again, and again, until it finally finds the "b" that the regex requires.

The code given to you implements this simple backtracking operation. This is also where some of the engine's complexity comes from.

Note that there are other, much more efficient ways to implement regular-expression processing; backtracking is the simplest and most understandable approach, but it is also often the slowest for even marginally complex regular expressions.

Your Tasks

You are provided with an initial version of the following files, which you should download into a working directory:

regex.h contains the regular expression operators, and other supporting types and functions
regex.cpp contains basic implementation for regex.h
engine.h contains declarations for the regular expression matching engine
engine.cpp contains some basic implementation for the engine
testbase.cpp, testbase.h and test_regex.cpp are part of the test suite for your regular expression engine

1. Regex matching operators

Your first task will be to implement the four operators we are supporting in our regular expressions. This work will go into regex.h and regex.cpp.

You will need to create a separate class for each of the operations described below; the suggested class names and other details are given below. All of these classes should derive from the provided RegexOperator. (You should also notice that the RegexOperator class is not quite ready to be used as a base class; you will need to make some changes to it, per the guidelines for base classes in the lecture.)

Match a specific character (suggested class name: MatchChar; the specific character to match is specified as a char argument to the constructor)
Match any character (suggested class name: MatchAny; this class doesn't require any data members, since it matches any character)
Match any character from a subset of characters (suggested class name: MatchFromSubset; the collection of characters should be provided as a C++ string object)
Match any character not in a subset of characters (suggested class name: ExcludeFromSubset; the collection of characters should be provided as a C++ string object)

All matching operations are provided via the following function signature:

bool match(const string &s, Range &r) const

This function returns true if the operator "matches" the character at s[r.start], or false otherwise.
The range r is an in-out parameter - the caller passes information to the function via r, and the function is expected to pass some information back to the caller through r.
Initially, r.start is the character to try to match, and r.start == r.end.
If the match() implementation will return true, r should also be updated to indicate the region that matched; for our operations, we will always match one character, so r.end will simply be set to r.start + 1.
If no match is possible (i.e. the match() function will return false), then r should be left unchanged.
Note 1: This function should always return false if r.start is past the end of the input-string s.
Note 2: This function doesn't need to call pushMatch() etc.; this is handled by the engine code.

Add this as a pure-virtual function to RegexOperator, and then provide an implementation to each of the subclasses, doing the appropriate operation. (You will notice that the implementations of the above operations are actually very simple, especially if you use the facilities provided by the C++ string class.)

NOTE: If you store the results of any of the string::find_xxxx member functions into variables, use string::size_type for the variable's type. Otherwise, you might be unpleasantly surprised with difficult-to-find bugs.

2. Parsing regular expressions

Your next task is to provide an implementation of the parseRegex() function, which turns a regex string into a vector of RegexOperator* objects that implement the regular expression. The vector returned by this function should match the order of the incoming regex: regular expressions are evaluated left-to-right (or from low index to high index), and the vector will also be evaluated in the same order.

This function is pretty straightforward to implement, although you will have to think carefully about how you implement it. You will need to handle escaped characters (characters preceded by "\"), special characters like the "." wildcard operator, the "?" optional-match modifier, and the "*" and "+" repeat modifiers. And of course, you also need to be able to match anything else that is specified, as a "regular character."

You can map all of these situations to the four matching operations from Step 1, with the addition that some of them may require setting the "minimum repeat" and "maximum repeat" counts to something other than their default values of 1. For example, the Kleene star "*" could set the last operator's minimum-repeat to 0, and the maximum-repeat to -1.

For this week, you can assume that all inputs to this function will be valid. We will add error-checking in a future assignment.

3. Implement `find()` and `match()` operations

Once you have the above implemented, you can switch over to engine.cc and implement the find() and match() operations. These are very straightforward to implement:

The find() operation should use the findAtIndex() helper (provided) to find the first match of the regular expression regex in the string s, and return a Range object indicating where the match occurred. Do this by attempting to find a match at index 0 of the string s, then at index 1, and so forth, returning the first match that is found.
The match() operation can be implemented in terms of the find() operation: if find() returns a range that starts at 0 and covers the entire input string, then match() should return true.

The findAtIndex() helper function implements the core of the backtracking algorithm. If it finds a match, it will return a Range object indicating the region that matched. If it doesn't find a match, it will return a Range object with both start and end set to -1.

4. Create a `Makefile` for the project

Since this project contains four different .cpp files, create a Makefile that builds them into the executable test_regex. Your makefile should also have an all target and a clean target. Make sure to back up your files before testing your clean target!

5. Make sure all of the tests pass!

If any of the tests in the test_regex suite don't pass, you should make sure to figure out why, and get them passing.

All Done!

Once you have completed all of the above tasks, you can submit the following files on codepost:

regex.h and regex.cpp
engine.h and engine.cpp
Makefile

You don't need to submit any of the test-code files; we will use fresh copies of them when we test your work.