Regular Expressions

Overview

This page contains recommendations for using regular expressions.

General

  • Do not use regular expressions if there is a clean non-regex solution, for example, searching for a substring or using if conditions.

  • Use regular expression engines that provide linear time expression matching at least for user-provided regular expressions or matching "hard-coded" expressions against user-controlled data, see the Linear time regular expression matching implementation section.

Clarification

Many regex engines support backtracking that causes them to work very slowly in some cases (exponentially related to input size), see the Vulnerability Mitigation: Regular Expression Denial of Service (ReDoS) page.

  • Do not use multi-line matching mode in regexes that are used for validation. Otherwise, make sure that full string matching ^...$ works as expected or rewrite regexes using more specific expressions like \A...\z.

    • Remember in some engines multi-line matching mode is a default mode, for example, the built-in regex engine in Ruby.

Clarification

In multi-line mode, the expressions with ^ and $ are matched differently. For example, $ matches not only before the end of the string but also at the end of each line. So, if there is a validation that uses a regular expression in multi-line mode, an attacker can use a new line \x0a to bypass this validation. Consider a regular expression matching in Python from the snippet below.

import re

p = re.compile(r'^\d{1,3}$')

p.match('137') is not None
# => True
p.match('1337') is not None
# => False
p.match('abc') is not None
# => False
p.match('137\nabc') is not None
# => False

The regex from the snippet matches full strings containing numbers from 1 to 3 digits long. However, enabling multi-line mode completely changes this behaviour.

import re

p = re.compile(r'^\d{1,3}$', re.MULTILINE)

p.match('137') is not None
# => True
p.match('1337') is not None
# => False
p.match('abc') is not None
# => False
p.match('137\nabc') is not None
# => True

As can be seen, in multi-line mode the string 137\nabc will be successfully matched. To avoid this behaviour, disable multi-line mode (this is the preferred solution) or rewrite the regex using \A and \Z:

import re

# preferred
p = re.compile(r'^\d{1,3}$')

p.match('137\nabc') is not None
# => False

# or
p = re.compile(r'\A\d{1,3}\Z', re.MULTILINE)

p.match('137\nabc') is not None
# => False
  • Implement input validation for strings for matching, at least for string length and allowed characters, see the Input Validation page.

  • Use the following practices to simplify regular expressions and reduce the likelihood of problems with catastrophic backtracking:

    • Avoid nested quantifiers, for example (a+)+.

    • Try to be as precise as possible and avoid the . pattern.

    • Use reasonable ranges, for example {1,10}, for repeating patterns instead of unbounded * and + patterns.

    • Simplify character ranges, for example [ab] instead of [a-z0-9].

Detecting Catastrophic backtracking

You can use doyensec/regexploit to detect Catastrophic backtracking in your regexes that lead to ReDoS.

regexploit does not guarantee the detection of 100% vulnerable regexes, this is just one of the relatively easy ways to check your regex

$ python3 -m venv .env
$ source .env/bin/activate
$ pip install regexploit
$ regexploit
v\w*_\w*_\w*$
Pattern: v\w*_\w*_\w*$
---
Worst-case complexity: 3 ⭐⭐⭐ (cubic)
Repeated character: [5f:_]
Final character to cause backtracking: [^WORD]
Example: 'v' + '_' * 3456 + '!'

Linear time regular expression matching implementation

There is the re2 engine that provides linear time expression matching. Try to find a library that is based on the re2 engine.

Use the regexp package that uses the re2 engine.

package main

import (
    "fmt"
    "regexp"
)

func main() {
    inputData := "some text to match"
    match, err := regexp.MatchString("[a-z]{1,16}", inputData)
    if err == nil {
        fmt.Println("Match:", match)
    } else {
        fmt.Println("Error:", err)
    }
}

References

Last updated