System Edge | blog by dmars

Regex vs index() performance

Date: Jan 30 2012 | Tags: perl, regex, index, performance, benchmark

Regular expression or index()

Perl articles and books suggest not using regular expressions for matching strings if index() function can be used, i.e. if you know what string you are matching. I always thought that the difference would be ten times or more. To check, I created a simple perl script and ran the comparison on different number of iterations. This was not a complete scientific approach, but rather a test to see how the test values change and what can be concluded from the figures.

Benchmark

I assumed that there would be no difference in running regex or index() function for several times, like one or ten times. I was interested in 100, 1000, 10000, ... runs.

The first test was to match a string somewhere in the input string for given number of times. As this repetitiveness was implemented as for loop, time needed to execute number of loops with no operations was also measured and subtracted from the results.

#iterationsindexregexcountindex(net)regex(net)
10000000
100000000
1000000000
1000000.010.0300.010.03
10000000.230.270.080.150.19
100000002.232.780.721.512.06
10000000022.5328.037.5514.9820.48

The net value (value measured for for loop subtracted from the measured value for index() and regex tests) is added to the two rightmost columns.

This graph illustrates the results.

Time [s] to match in the middle of the string

The second test was an attempt to see if there is difference if the matched string is found at the beginnng. Two regular expressions were used, one as before ( /$match/ ), and one with caret operator to indicate that we want to match only at the beginning of the string. The results and a graphical representation follow.

#iterationsindexregex^regexcountindex(net)regex(net)^regex(net)
1000000000
10000000000
1000000.020.02000.020.02
1000000.030.030.0600.030.030.06
10000000.250.270.580.090.160.180.49
100000002.392.755.650.811.581.944.84
10000000023.728.0869.197.7215.9820.3661.47
Time [s] to match at the beginning of the string

Conclusion

Two conclusions can be drawn from the tests.

  1. Both index() and regex display linear growth following the number of iterations.
  2. Regex is somewhat slower (up to 30 percent, though more data is needed for better analysis) but not dramatically slower. This means that unless performance is crucial, regex can be used.

A final observation: matching with caret operator in regex slows it by at least two times. At the moment I can't think of the reason for this.

The script

This script was used to perform tests. Three similar functions were written. If they were combined into a single function, I'd have to use if ... then to determine whether index() or regex should be used, and that would affect the results.

use strict;
use warnings;
use Benchmark;

# Prepare input data
my $input = "IEJMBMFXZZDIGNBLKZODUTCFFKBJKCMCYTSSBHXY";
my $match = "MBMFXZZ";
my $count = $ARGV[0] 
    || die "Specify number of iterations as command-line argument!";

# Benchmark execution times
my $tc = get_times_count();
my $ti = get_times_index();
my $tr = get_times_regex();

# Output data
print "count ($count iterations): $tc\n";
print "index ($count iterations): $ti\n";
print "regex ($count iterations): $tr\n";

exit 0;

# Benchmark for 'for' loop only
sub get_times_count {
    my $t0 = Benchmark->new();
    for (my $i = 0; $i < $count; $i++) {
    }
    my $t1 = Benchmark->new();
    return timestr(timediff($t1, $t0));
}


# Benchmark for index function
sub get_times_index {
    my $t0 = Benchmark->new();
    for (my $i = 0; $i < $count; $i++) {
        $_ = index($match, $input);
    }
    my $t1 = Benchmark->new();
    return timestr(timediff($t1, $t0));
}

# Benchmark for regular expressions
sub get_times_regex {
    my $t0 = Benchmark->new();
    for (my $i = 0; $i < $count; $i++) {
        $input =~ /$match/;
    }
    my $t1 = Benchmark->new();
    return timestr(timediff($t1, $t0));
}

What do you think?




0 comments