Regular Expressions

Regular expressions (abbreviated "regex", and pronounced with a hard "g" like "go", not a soft "g" like "George") are a very powerful way of specifying patterns to match against text strings. The language for specifying a regular expression is very simple and compact, but they can be used in numerous ways. For example, if you want to find all the IP addresses in a string, you can specify a regex for IP addresses and then look at all the matches. If you want to validate a credit card number, or a date, or any other kind of data value, it is typically very easy to write a regex that will identify valid or invalid values. As an example, here is a regex to match floating-point numbers: "-?[0-9]+(\.[0-9+])?"

In this assignment you will complete the basic implementation of a regular expression engine. This is certainly not the fastest way to implement regexes, in fact, it is probably one of the slowest. But it still illustrates some of the fundamental principles of how regular expressions work, and it will show you why some regexes can be extremely slow to evaluate. With C++11, a regular expression library was added to the C++ Standard Library, and normally you would want to use that implementation, or some other efficient regex implementation. But, this week you will build your own.

It should be stated up front: We are not going to implement the full regular expression syntax! There are some very sophisticated features in regular expression parsing that greatly complicate the implementation. We are picking out some very basic aspects of regex to implement this week.

Overview of Regular Expressions

There are typically two operations we want to perform with regular expressions:

For simplicity, we will discuss matches first, and then the more general case of finding substrings afterward.

Matching Operations

Here is an overview of the kinds of regular expressions we will support in this assignment:

Repeated and Optional Matches

Regular expressions also have the ability to repeat matches, or to optionally match (or not match) a character.

It is important to note that the "*" and "+" operators are greedy: in other words, they will consume as much of the input string as possible. This can sometimes result in unexpected behaviors, and it makes the regex evaluation engine more complex. (This is why you don't have to implement that piece of the puzzle; it will be given to you.) This will be explained in a bit more detail when we discuss the "find" operation.

Other Regex Features

There are a few other highly useful regex features, which we will not be implementing because they would greatly complicate the implementation. Nonetheless, you should still know about them:

There are numerous other features that regular expressions support, but this should give you a good sense of their power.

Find Operations

You can also "find" occurrences of a regular expression in a string; unlike a "match" operation, the "find" operation simply identifies the first substring that matches the specified regex. We will specify the result of a "find" operation as the region of the string that matched the regex, indicated by the starting index (inclusive) and the ending index (exclusive).

For example, searching for the regex "abc" in the string "abcabc" will identify the region [0, 3) in the string. Similarly, finding "abc" on the string "defabcdef" will return the region [3, 6).

Recall that the "*" and "+" operators are greedy; performing "find" operations is where the greedy nature of these operators becomes evident. For example, given the regex "a.*b" and the string "abbbb" will find the region [0, 5) rather than simply [0, 2), since the ".*" operator will consume as many characters as possible while still achieving a match.

Backtracking

This is also where the "backtracking" nature of regular expressions becomes evident. As the regex is evaluated, a greedy operator may consume too much of the input string. If subsequent parts of the regex won't match, then the regex engine must backtrack until it can successfully match subsequent operations. If the engine must backtrack all the way through everything it has attempted, then it indicates a total matching failure - and the engine gives up.

For example, with the above regex "a.*b" and the string "aabbbcccc", the ".*" operator might attempt to consume the region [1, 9) - that is, everything after the first "a" character. But this means the final "b" in the regex won't match. So, the engine backs up one step, and tries again - the ".*" only matches on the range [1, 8), leaving the final "c". Of course, this still doesn't match the "b" required by the regex. So, the engine backs up again, and again, and again, and again, until it finally finds the "b" that the regex requires.

The code given to you implements this simple backtracking operation. This is also where some of the engine's complexity comes from.

Note that there are other, much more efficient ways to implement regular-expression processing; backtracking is the simplest and most understandable approach, but it is also often the slowest for even marginally complex regular expressions.

Your Tasks

You are provided with an initial version of the following files, which you should download into a working directory:

1. Regex matching operators

Your first task will be to implement the four operators we are supporting in our regular expressions. This work will go into regex.h and regex.cpp.

You will need to create a separate class for each of the operations described below; the suggested class names and other details are given below. All of these classes should derive from the provided RegexOperator. (You should also notice that the RegexOperator class is not quite ready to be used as a base class; you will need to make some changes to it, per the guidelines for base classes in the lecture.)

All matching operations are provided via the following function signature:

bool match(const string &s, Range &r) const

Add this as a pure-virtual function to RegexOperator, and then provide an implementation to each of the subclasses, doing the appropriate operation. (You will notice that the implementations of the above operations are actually very simple, especially if you use the facilities provided by the C++ string class.)

NOTE: If you store the results of any of the string::find_xxxx member functions into variables, use string::size_type for the variable's type. Otherwise, you might be unpleasantly surprised with difficult-to-find bugs.

2. Parsing regular expressions

Your next task is to provide an implementation of the parseRegex() function, which turns a regex string into a vector of RegexOperator* objects that implement the regular expression. The vector returned by this function should match the order of the incoming regex: regular expressions are evaluated left-to-right (or from low index to high index), and the vector will also be evaluated in the same order.

This function is pretty straightforward to implement, although you will have to think carefully about how you implement it. You will need to handle escaped characters (characters preceded by "\"), special characters like the "." wildcard operator, the "?" optional-match modifier, and the "*" and "+" repeat modifiers. And of course, you also need to be able to match anything else that is specified, as a "regular character."

You can map all of these situations to the four matching operations from Step 1, with the addition that some of them may require setting the "minimum repeat" and "maximum repeat" counts to something other than their default values of 1. For example, the Kleene star "*" could set the last operator's minimum-repeat to 0, and the maximum-repeat to -1.

For this week, you can assume that all inputs to this function will be valid. We will add error-checking in a future assignment.

3. Implement find() and match() operations

Once you have the above implemented, you can switch over to engine.cc and implement the find() and match() operations. These are very straightforward to implement:

The findAtIndex() helper function implements the core of the backtracking algorithm. If it finds a match, it will return a Range object indicating the region that matched. If it doesn't find a match, it will return a Range object with both start and end set to -1.

4. Create a Makefile for the project

Since this project contains four different .cpp files, create a Makefile that builds them into the executable test_regex. Your makefile should also have an all target and a clean target. Make sure to back up your files before testing your clean target!

5. Make sure all of the tests pass!

If any of the tests in the test_regex suite don't pass, you should make sure to figure out why, and get them passing.

All Done!

Once you have completed all of the above tasks, you can submit the following files on codepost:

You don't need to submit any of the test-code files; we will use fresh copies of them when we test your work.