Regex vs index() performance
Date: Jan 30 2012 | Tags: perl, regex, index, performance, benchmarkRegular expression or index()
Perl articles and books suggest not using regular expressions for matching strings if index() function can be used, i.e. if you know what string you are matching. I always thought that the difference would be ten times or more. To check, I created a simple perl script and ran the comparison on different number of iterations. This was not a complete scientific approach, but rather a test to see how the test values change and what can be concluded from the figures.
Benchmark
I assumed that there would be no difference in running regex or index() function for several times, like one or ten times. I was interested in 100, 1000, 10000, ... runs.
The first test was to match a string somewhere in the input string for given number of times. As this repetitiveness was implemented as for loop, time needed to execute number of loops with no operations was also measured and subtracted from the results.
| #iterations | index | regex | count | index(net) | regex(net) |
|---|---|---|---|---|---|
| 100 | 0 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 | 0 |
| 10000 | 0 | 0 | 0 | 0 | 0 |
| 100000 | 0.01 | 0.03 | 0 | 0.01 | 0.03 |
| 1000000 | 0.23 | 0.27 | 0.08 | 0.15 | 0.19 |
| 10000000 | 2.23 | 2.78 | 0.72 | 1.51 | 2.06 |
| 100000000 | 22.53 | 28.03 | 7.55 | 14.98 | 20.48 |
The net value (value measured for for loop subtracted from the measured value for index() and regex tests) is added to the two rightmost columns.
This graph illustrates the results.
The second test was an attempt to see if there is difference if the matched string is found at the beginnng. Two regular expressions were used, one as before ( /$match/ ), and one with caret operator to indicate that we want to match only at the beginning of the string. The results and a graphical representation follow.
| #iterations | index | regex | ^regex | count | index(net) | regex(net) | ^regex(net) |
|---|---|---|---|---|---|---|---|
| 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10000 | 0 | 0.02 | 0.02 | 0 | 0 | 0.02 | 0.02 |
| 100000 | 0.03 | 0.03 | 0.06 | 0 | 0.03 | 0.03 | 0.06 |
| 1000000 | 0.25 | 0.27 | 0.58 | 0.09 | 0.16 | 0.18 | 0.49 |
| 10000000 | 2.39 | 2.75 | 5.65 | 0.81 | 1.58 | 1.94 | 4.84 |
| 100000000 | 23.7 | 28.08 | 69.19 | 7.72 | 15.98 | 20.36 | 61.47 |
Conclusion
Two conclusions can be drawn from the tests.
- Both index() and regex display linear growth following the number of iterations.
- Regex is somewhat slower (up to 30 percent, though more data is needed for better analysis) but not dramatically slower. This means that unless performance is crucial, regex can be used.
A final observation: matching with caret operator in regex slows it by at least two times. At the moment I can't think of the reason for this.
The script
This script was used to perform tests. Three similar functions were written. If they were combined into a single function, I'd have to use if ... then to determine whether index() or regex should be used, and that would affect the results.
use strict;
use warnings;
use Benchmark;
# Prepare input data
my $input = "IEJMBMFXZZDIGNBLKZODUTCFFKBJKCMCYTSSBHXY";
my $match = "MBMFXZZ";
my $count = $ARGV[0]
|| die "Specify number of iterations as command-line argument!";
# Benchmark execution times
my $tc = get_times_count();
my $ti = get_times_index();
my $tr = get_times_regex();
# Output data
print "count ($count iterations): $tc\n";
print "index ($count iterations): $ti\n";
print "regex ($count iterations): $tr\n";
exit 0;
# Benchmark for 'for' loop only
sub get_times_count {
my $t0 = Benchmark->new();
for (my $i = 0; $i < $count; $i++) {
}
my $t1 = Benchmark->new();
return timestr(timediff($t1, $t0));
}
# Benchmark for index function
sub get_times_index {
my $t0 = Benchmark->new();
for (my $i = 0; $i < $count; $i++) {
$_ = index($match, $input);
}
my $t1 = Benchmark->new();
return timestr(timediff($t1, $t0));
}
# Benchmark for regular expressions
sub get_times_regex {
my $t0 = Benchmark->new();
for (my $i = 0; $i < $count; $i++) {
$input =~ /$match/;
}
my $t1 = Benchmark->new();
return timestr(timediff($t1, $t0));
}
0 comments