Regular expressions: greedy and lazy matching

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

This is a great tutorial on regular expressions:

As you’ve seen, a greedy quantifier will try to match as much as it possibly can and only give back matched characters as needed. Every time the engine greedily consumes one more character (or repeated token in general), it has to remember that it made that choice. It will therefore persist its current state and store it so it can come back to it later in a process we call backtracking. When the regular expression engine backtracks, it performs another match attempt at a different position in the pattern.

Storing this backtracking position doesn’t come for free, and neither does the actual backtracking process. Because of that it’s desirable to minimize the amount of backtracking we’re forcing the engine to do. While this isn’t too much of a problem for successful matches in small inputs, this kind of optimization is even more relevant for large input strings.

Let’s assume the singleline flag is set (so that the dot will match any character) and consider the following pattern proposed in the StackOverflow thread:

\[(.*),(.*)\]

Note that the opening and closing brackets needed to be escaped because they’re special characters in a regular expression. With a preceding backslash, the regex engine treats them as literals rather than character class boundaries.

Here’s how the pattern is matched against some input:

First, it tries to match an opening bracket: \[

After that, it tries to match (and save) “any amount of anything”: (.*)

Now it tries to match the seprator, a literal comma: ,

Again, it tries to match (and save) “any amount of anything”: (.*)

Finally, it tries to match a closing bracket: ]

So far, so good — but where’s the problem?

Bad Performance and Incorrect Matches

Once the regex engine encounters the first .*, it’ll match every character until the end of the input because the star quantifier is greedy. However, the token following the “anything” is a comma, which means that the regex engine has to backtrack until its current position is in front of a comma. The same goes for the second .* and the closing bracket.

The .* pattern does one thing extremely well, and that’s creating a huge amount of backtracking positions that need to be saved by the regex engine. That’s why this kind of greedy matching behavior can lead to extremely poor performance when executed. Even worse, eagerly consuming that much input can result in undesired matches, as the following input shows:

Points: [x1,y1] and [x2,y2]

The values matched by the capturing groups of the above pattern are x1,y1] and [x2 and y2, which is most likely not what you wanted to match. Because there was no restriction, .* kept consuming input characters until the end and after that only gave up as many characters as needed to get a successful input match.

If you want to play around with this pattern a bit, feel free to use this regex fiddle.

Post external references

  1. 1
    http://blog.mariusschulz.com/2014/06/03/why-using-in-regular-expressions-is-almost-never-what-you-actually-want
Source