Abstract

These notes contain various .NET regular expression constructs used for forming regular expressions.

Anchors

Anchors cause a match to succeed or fail depending on the current position, but do not advance through the string.

AnchorMatch must…ExampleMatches
^ or \Astart at beginning of string
$ or \Zoccur at the end of the string or before the \n
\zoccur at the end of the string
\Goccur at point where previous match ended\G\d\“(1)” “(3)” “(5)” in (1)(3)(5)[7](9)
\boccur on a boundary between a \w and a \W character
\BNOT occur on a \b boundary

Character Escapes

Escaped charMatches
\aBell character
\bBackspace
\tTab
\rCarriage return (note: this is not the same as \n)
\vVertical tab
\fForm feed
\nNewline
\eEscape key (\u001b)
\\Back slash

Character Classes

ClassMatches any single…
[ae]character in the group (case-sensitive)
[^aei]character NOT in the group (case-sensitive)
[A-Z]character in the range
.character except \n
\p{Lu}character in the Unicode category of Lu
\P{Lu}character NOT in the Unicode category Lu
\w“word” character (alphanumeric)
\Wnon-word character
\swhite-space character [cr/nl/ff, tab, space]
\Snon-white-space character
\ddigit [0-9]
\Dnon-digit character

Character Class Subtraction

[base_group-[excluded_group]]

Useto match
[a-z-[m]]any character from a to z except m

Custom Character Classes

  • [aeiouAEIOU] matches any vowel, both lowercase and uppercase.
  • [a-zA-Z0-9] matches any lowercase letter, uppercase letter, or digit.

Comments

(?# comment) // an inline comment # comment // an end-of-line comment

Conditional Evaluation

(?(subexpression)yes|no) or ?(name)yes|no) where:

  • subexpression is a subexpression to match
  • name is a capturing group
  • yes is the string to match if subexpression is matched OR if name is a valid, non-empty captured group
  • no is the subexpression to match if subexpression is not matched OR name is not a valid, non-empty captured group

Consider a text document with paragraphs marked with a <PRIVATE> tag, and this regex pattern that uses conditional evaluation to assign the contents of paragraphs intended for public and private use to different capturing groups:

^(?<Pvt>\<PRIVATE\>\s)?(?(Pvt)((\w+\p{P}?\s)+)|((\w+\p{P}?\s)+))\r?$

The colors below are used to differentiate the various parts of this regular expression:

  • Begin the match at the beginning of the line ( ^ )
  • (?<Pvt>\<PRIVATE>\s)?
    • Match zero or one instances ( ? )
      • of <PRIVATE> followed by a whitespace character ( \s )
    • and assign the match to a capturing group named Pvt ( (?<Pvt>\ )
  • (?(Pvt)((\w+\p{P}?\s)+)
    • If the Pvt capturing group exists ( ?(Pvt) )
      • match one or more instances ( + )
        • of one or more ( + )
        • word characters ( \w )
        • followed by zero or one ( ? )
        • punctuation characters ( \p{P} )
        • followed by a single whitespace character ( \s )
      • and assign the match to the first capturing group ( () )
  • |((\w+\p{P}?\s)+)
    • Or if the Pvt capturing group does not exist ( | ),
      • match one or more instances ( + )
        • of one or more ( + )
        • word characters ( \w )
        • followed by zero or one ( ? )
        • punctuation characters ( \p{P} )
        • followed by a single whitespace character ( \s )
  • End the match at the end of a line ( \r? ) or the end of the string ( $ )

Here is the same regular expression pattern again, this time with colorized brackets:

^(?<Pvt>\<PRIVATE>\s)?(?(Pvt)((\w+\p{P}?\s)+)|((\w+\p{P}?\s)+))\r?$

Grouping

Atomic Group

A grouping construct that allows the backtracking engine to guarantee that a subexpression matches only the first match found for that subexpression:
(?> subexpression)

For example, consider the regular expression (a+)\w:

  • Matches one or more “a” characters ( a+ )
  • along with a word character that follows the sequence of “a” characters ( \w )
  • and assigns the sequence of “a” characters to the first capturing group ( () )

However, if the last character of the input is also an “a”, it is matched by \w and not included in the capture group. The regular expression ((?>a+))\w prevents this behavior because all consecutive “a” characters are matched without backtracking.

Other Grouping Constructs

ConstructDescription
(subexpression)Capture the matched subexpression and assign it a 1-based ordinal
(?<name> subexpression)Capture the matched subexpression into a named group
(?'name' subexpression)Same
(?: subexpression)A non-capturing group
(?> subexpression)An atomic group; allows backtracking engine to guarantee that a subexpression matches only the first match found for that subexpression

By default, groups are named with a number that matches the sequence in which they appear in the regular expression pattern. For example, the pattern (\w+?) matches one or more (+) word characters (\w) but as few as possible (?). Refer to this group via \1.

Lookahead & Lookbehind Assertions

Lookahead and lookbehind assertions are anchors that, without moving the pointer, look ahead or behind to test a condition.

ConstructNameDescriptionIn other words…Example use case
(?= subexpression)Positive lookaheadAssert what follows the current position in the string is subexpressionMatch the expression if the subexpression matchesMatch words that are not followed by punctuation symbols: @"\b[A-Z]+\b(?=\P{P})"
(?! subexpression)Negative lookaheadAssert what follows the current position in the string is NOT subexpressionMatch the expression if the subexpression fails to matchMatch words that do not begin with “non”: @"\b(?!non)\w+\b"
(?<= subexpression)Positive lookbehindAssert what precedes the current position in the string is subexpression
(?<! subexpression)Negative lookbehindAssert what precedes the current position in the string is NOT subexpression

Quantifiers

Lazy quantifiers (??, *?, +?, {n,m}?) instruct the backtracking engine to search the minimum number of repetitions first, matching as few times as possible. Contrast with greedy quantifiers (+):

  • In .+(\d+)\., the greedy quantifier .+ causes regex engine to capture only the last digit of a number.
  • In .+?(\d+)\., the lazy quantifier .+? causes regex engine to capture entire number.
QuantifierMatches the previous element…PatternMatches
? (lazy)zero or more times, but as few times as possiblea.?c“abc” in “abcbc”
+one or more timesbe+“bee” in “been”
+?one or more times, but as few times as possible"be?"“be” in “been”
?zero or one time"rai?"“rai” in “rain”
??zero or one time, but as few times as possible"rai??"“ra” in “rain”
{n}exactly n times",\d{3}"“,043” in “1,043.6”
{n}?exactly n times",\d{3}"“,043” in “1,043.6”
{n,}at least n times"\d{2,}"“166”
{n,}?at least n times, but as few times as possible"\d{2,}"“166”
{n,m}between n and m times"\d{3,5}"“19302” in “193024”
{n,m}?between n and m times, but as few times as possible"\d{3,5}"“193”, “024” in “193024”

Backreference Constructs

Backreferences allow a previously matched subexpression to be identified later in the same regular expression.

ConstructDescription
\nMatch the value of subexpression n where n is a digit
\k<name>Match the value of subexpression name

Alternation Constructs

Substitutions

CharacterDescription
$1Substitutes the substring matched by matching group 1
${foo}Substitutes the substring matched by matching group named foo
$$Substitutes a literal “$” character
$&Substitutes a copy of the whole match
$`Substitutes all of the text of the input string before the match
$'Substitutes all of the text of the input string after the match
$+Substitutes the last group that was captured
$_Substitutes the entire input string

Best Practices

Security

When processing untrusted input, always pass a timeout.

A threat actor can pass malicious input, knowing that the input will be matched against a regular expression pattern, and cause a denial of service condition by crafting that input to take excessively long to process. A timeout mitigates this.

Regex vs String Methods

Use String methods when searching for a specific string.
Use Regex

Use the Best Technique for a Specific Regex Operation

  • Validate a match with IsMatch().
  • Retrieve a single match with Match(). Retrieve subsequent matches with Match.NextMatch().
  • Replace matched text with Replace().
  • Escape characters that may be interpreted as regex operators with Escape(). Or, remove them with Unescape().

Use Timeouts

  • Use the Regex(string, RegexOptions, TimeSpan) constructor to pass a timeout value.
  • Set application-wide timeout by calling AppDomain.SetData("REGEX_DEFAULT_MATCH_TIMEOUT", TimeSpan).