Regex

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Introduction

Regular expressions (regexes, regexen) are a Chomsky Type 3 grammar, although backtracking can give you almost all the features of a Type 2 grammar. (See Parsing.)

General Regex Concepts

Character Matches

. Any char.
char char itself, unless char has a special meaning.
\char char, if it has a special meaning.
[chars] Grouping. Any one of the contents of chars is acceptable. Note that ranges may be specified, for instance [A-Z].
\w A word. [A-Za-z0-9_]
\W A non-word.
\d A numeral.
\D A non-numeral.
\s A space (tab, space, newline...)
\S A non-space.


Positional Matches

\b Word boundary.
\B Not a word boundary.
\< \> Beginning, end of word. Not perl or python.
^ Beginning of line (multiline mode) or beginning of text.
\A Only the beginning of the text. Not grep
$ End of line or end of text. Not grep.
\Z Only at the end of the text. Not grep.
(?=pattern) Positive look-ahead. Matches this position if the characters following it match pattern. Not grep
(?!pattern) Negative look-ahead. Matches this position if the characters following it do not match pattern. Not grep.
(?<=pattern) Positive look-behind. Matches the position if the characters preceeding it match pattern. Not grep
(?<!pattern) Negative look-behing. Matches the position if the characters preceeding it do not match pattern'. Not grep.

Multipliers

* Zero or one or many. (Kleene Star)
+ One or many.
 ? Zero or one.
{n} Exactly n times.
{n,m} n to m times, inclusive.
{n,} n or more times.
{,m} m or fewer times.

By default, multipliers are greedy. They try to match as much as possible. You can ask for a non-greedy multiplier that tries to match as few as possible by appending the suffix '?'.

Backtracking

The algorithm for regex matching works pretty much like this:

  • Try to match everything.
  • If that doesn't work, don't grab so much with the greedy multipliers, or grab more with the non-greedy multipliers, backtracking.
  • If that doesn't work, try to match on the next character in the string.

The backtracking bit can take a long time for long strings. Consider this regex:

[a-z]*foo

If we wanted to match this with the string "abcdefoo", we would have to backtrack twice. See, the [a-z]* bit will "eat" up all the string--all the way to the end--without leaving room for the final "foo". So it has to backtrack and eat a little less each time until it has only eaten "abcde". At that point, the rest of the regex will match.

You can see how can go bad if you are trying to match a very, very long string with "foo" near the beginning. That's why you may want to give the non-greedy hint.

Regex in grep

grep has its own way of doing regexes. Some of the typical regex stuff works, but most of it has to be escaped with a '\'. Note that it doesn't use \b for word boundaries, but \< and \> like ViM.

Regex in ViM

ViM works pretty much like grep in the default mode.

Two things:

  • '/' searches forward, while '?' searches backward. (These are normal mode commands.)
  • You can use 'n' and 'N' to find the next or previous matches. The directions are opposite when you are using '?'.

When substituting, the substitution string treats '&' specially. I always get hung up on this. '&' is the same as $0 in perl. If you want a literal '&', use '\&'.

Regex in Perl

Perl's regexes are by far the most powerful and useful. I use the full power of it regularly.

I like the look-ahead and look-behind patterns. Once you get your head around it, they are really useful.

I also like the non-capturing grouping parens: (?:...) I always use this unless I am interested in the group, which is rare.

Regex in Python

Python has adopted perl's regexes in the 're' module.

I wish there was a way to do like perl:

if (/x(y)z/) {
  # do something with $1--'y'.
}

But the closest you can get is this:

m = re.match(s)
if m:
  # Do something with m.group(1)
   

Ehh, can't complain too loudly, though, since Python's way is a million times more readable than perl's.

Regex in Javascript

Javascript is really weak, but the regexes aren't bad.

I recommend reading Mozilla's documentation on regexes here: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions

Writing Your Own Regex Parser

It's actually fairly trivial to write your own regex parser from scratch. I'd encourage you to do it--once--just to see what it is like.

However, keep in mind that there are some very advanced algorithms that dramatically reduce the time it takes to match regular expressions with strings that are not trivial. For this reason, most people just use the Perl Regex engine for their purposes.