Regular expression (regex) performance: The fundamental guide
There is a good chance that you are touching regular expressions from time to time if you are in any way involved with IT operations or development. They are not easy to learn and successfully matching a string with a regex can be a very rewarding feeling after a long time of debugging.
Once you are feeling familiar and experienced enough, you should take the next step and learn how to write effective and fast regular expressions that do not suddenly degrade in performance.
There are many ways to optimize regular expressions and many depend on the input you are running the expression against. This post aims to arm you with the most important techniques.
I am expecting that you are somewhat experienced with regular expressions already. If you are not, this great site should get you started with a nice interactive tutorial.
Be specific
Tell the regex engine what you are trying to match. Avoid .*
operators whenever you can because this will lead to horrible backtracking (the regex engine jumping to the end of the input and then going back character by character). Giving as much detail about the kind of characters (number, whitespace, letter?), in which sequence (next is 4 numbers, then a whitespace, then two letters) and where to expect them is key to performance.
Many people will have seen a regular expression trying to match a date before. I benchmarked (Java 8 using the regex engine from java.util) three regular expressions that search for a ISO8601 date:
Input string:
2015-12-15T07:36:25+00:00 sundaysister kernel[0]:
**** [IOBluetoothHostControllerUSBTransport][ClearFeatureInterruptEndpointHalt] --
successfully posting another read for the mInt0InterruptPipe --
mInterruptPipeInOutstandingIOCount = 1 -- this = 0xb800";
veryDescriptiveMatch: ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}
descriptiveMatch: .{4}-.{2}-.{2}T.{2}:.{2}:.{2}\+.{2}:.{2}
openMatch: .*-.*-.*T.*:.*:.*\+
You might think that the openMatch type is written unrealistically bad but I have seen those in the wild and they will perfectly show how much slower unspecific regular expressions are.
Let’s look at the benchmark results:
# Run complete. Total time: 00:03:53
Benchmark Mode Cnt Score Error Units
DateRegexBenchmark.descriptiveMatch avgt 50 0.237 ± 0.003 us/op
DateRegexBenchmark.openMatch avgt 50 32.325 ± 0.361 us/op
DateRegexBenchmark.veryDescriptiveMatch avgt 50 0.190 ± 0.002 us/op
By telling the regular expression engine very specifically what we are searching for we are not only getting more reliable results but you also see a huge performance improvement.
If you can be specific, be specific.
Lazy quantifiers
Replacing greedy quantifiers () with lazy quantifiers (?) can improve performance, in many cases without changing the result. A greedy quantifier tells the regex engine to go to the end of the input and then backtrack until it finds what you are looking for. (which can be slow) Changing it to a lazy quantifier will make the regex engine read from the beginning of the string until it finds something.
You might already smell a caveat here: If the string you are trying to match is closer to the end of the string, a lazy quantifier might actually hurt performance.
The following benchmark showcases this behavior:
Input strings:
"early":
at=info status=302 method=GET
path="/repo/debian/dists/trusty/1.2/binary-amd64/Packages.gz"
host=packages.graylog2.org request_id=f9de9767-2aa1-4e64-8d82-36f8ace006e1
fwd="54.215.46.35" dyno=web.1 connect=0ms service=1ms
bytes=287
"late":
at=info method=GET path="/repo/debian/dists/trusty/1.2/binary-amd64/Packages.gz"
host=packages.graylog2.org request_id=f9de9767-2aa1-4e64-8d82-36f8ace006e1
fwd="54.215.46.35" dyno=web.1 connect=0ms service=1ms status=302 bytes=287
Regular expressions:
"greedy":
.*status=(\d{3}).*
"lazy":
.*?status=(\d{3}).*
The two input strings have the status=302 that we want to match at different positions: One has it closer to the beginning and one has it closer to the end of the string.
# Run complete. Total time: 00:05:07
Benchmark Mode Cnt Score Error Units
LazyQuantifierBenchmark.greedyMatchEarly avgt 50 2.432 ± 0.053 us/op
LazyQuantifierBenchmark.greedyMatchLate avgt 50 0.788 ± 0.045 us/op
LazyQuantifierBenchmark.lazyMatchEarly avgt 50 0.623 ± 0.003 us/op
LazyQuantifierBenchmark.lazyMatchLate avgt 50 2.281 ± 0.154 us/op
You see that the performance changes drastically, depending on where the string we try to match is located in the input. The greedy quantifier regex is slow with an early match and faster with a late match while the lazy quantifier regex is slower with a late match than when matching earlier in the string.
Benchmark, benchmark, benchmark
Like always in computering, you should never make any change without measuring the result. Especially regex tips are highly dependent on the input string or even if a matched part is closer to the beginning or end of it like we witnessed above.
Given the complexity of regular expressions, there is no way to be sure that a change makes a regular expression faster or that this particular change is the best way to attack the problem. The only solution for this is to benchmark the performance.
If you are benchmarking Java code, I recommend taking a look at the Java Microbenchmark Harness which is easy to get started with and a pleasure to use. This is a great blog post giving a good overview.
Here is the JMH code that was used to run a benchmark showcased in this post:
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class DateRegexBenchmark {
private static final String MESSAGE = "2015-12-15T07:36:25+00:00 sundaysister ...";
private static Pattern VERY_DESCRIPTIVE;
private static Pattern DESCRIPTIVE;
private static Pattern OPEN;
@Setup
public void prepare() {
VERY_DESCRIPTIVE = Pattern.compile("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\");
DESCRIPTIVE = Pattern.compile(".{4}-.{2}-.{2}T.{2}:.{2}:.{2}\\+.{2}:.{2}");
OPEN = Pattern.compile(".*-.*-.*T.*:.*:.*\\+");
}
@Benchmark
public void veryDescriptiveMatch(Blackhole bh) {
bh.consume(VERY_DESCRIPTIVE.matcher(MESSAGE).matches());
}
@Benchmark
public void descriptiveMatch(Blackhole bh) {
bh.consume(DESCRIPTIVE.matcher(MESSAGE).matches());
}
@Benchmark
public void openMatch(Blackhole bh) {
bh.consume(OPEN.matcher(MESSAGE).matches());
}
}
Benchmark with different inputs
Make sure to benchmark your regular expressions with different strings as inputs. Look at this example of a regular expression that ends in catastrophic backtracking with a certain string input:
Regular expression: (a+)+Z
"unexpectedInput1": aaaaaaaaaaaaaaaaaaaaaa
"unexpectedInput2": aaaaaaaaaaaaaaaaaaaaaaa
"expectedInput": aaaaaaaaaaaaaaaaaaaaaaaZ
# Run complete. Total time: 00:02:35
Benchmark Mode Cnt Score Error Units
UnexpectedInputBenchmark.expectedInput avgt 50 0.186 ± 0.034 us/op
UnexpectedInputBenchmark.unexpectedInput1 avgt 50 246381.032 ± 23649.682 us/op
UnexpectedInputBenchmark.unexpectedInput2 avgt 50 453904.349 ± 23412.333 us/op
The regular expression expects the input string to end with the character ‘Z’ and performs well under those circumstances. If the string does however not end with 'Z' you can see the compute time growing exponentially in relation to the input string length.
This is catastrophic backtracking in action where unexpected input can ruin your application performance. Attackers can even use this for Denial of Service (DoS) attacks on your systems if your regular expressions are vulnerable. This type of attack is called ReDoS.
A reminder that things like this are happening and to run regex benchmarks with unexpected or even possibly malicious payloads if possible.
Continuously run the benchmarks
Regular expressions are hard. They are hard to read and even harder to read if you did not write them yourself or if you are not very familiar with the problem they are trying to solve. It is very easy to change a regex in a way that makes it fail catastrophically with certain inputs or that simply just ruins the performance.
A system that runs tens of thousands of regular expressions a second can terribly degrade in performance after a change that only makes a difference in the microsecond range for a single run.
Just like unit tests you should run benchmarks of your regular expressions for every new revision of your code and get notified about the results. Tools like Jenkins or Travis CI are very suitable for this. (In practice you would probably run and test larger parts of the business logic and not isolated regular expressions)
Try out other regex engines
In the end a regex engine is nothing else but a library and there is a high chance that your favorite programming language has many engines available. For example, this blog post compares Java regex libraries.
But be careful: Regex engines might work differently and give you unexpected results. Something that matched before might no longer match and vice versa. Don't just replace a regex engine without a good test coverage of the affected parts.
Do you really need a regex?
Let's not fool ourselves: Regular expressions are slow. Evaluate carefully if there might be a reasonable alternative. Classic string operations can be multiple orders of magnitude faster and for example Java has Guava methods that allow you to build complex string manipulations in a DSL-like and readable structure.
Remember that trying any of this also only makes sense if you are benchmarking and comparing the results.