Regular expressions: Rust vs F# vs Scala

Let's implement the following task: read first 10M lines from a text file of the following format:

{Idents = [|"System"; "StubHelpers"; "StubHelpers"|];
Namespace = Some [|"System"; "StubHelpers"|];
IsPublic = true;
TopRequireQualifiedAccessParent = null;
IsAttribute = false;}
{Idents = [|"Microsoft"; "Win32"; "IAssemblyEnum"|];
Namespace = Some [|"Microsoft"; "Win32"|];
IsPublic = true;
TopRequireQualifiedAccessParent = null;
IsAttribute = false;}
view raw big.txt hosted with ❤ by GitHub
then find all lines containing Microsoft namespace in them, and format the type names the usual way, like "Microsoft.Win32.IAssemblyEnum".

First, F#:

open System.Diagnostics
open System.Text.RegularExpressions
open System.IO
open System
open FSharp.Text.RegexProvider
type Pattern = Regex< @"\{.*(?<name>Microsoft.*)\|\]">
let search file pattern =
let r = Pattern (RegexOptions.ExplicitCapture ||| RegexOptions.Compiled)
let delims = [|';'; '"'; ' '|]
File.ReadLines file
|> Seq.take 10000000
|> Seq.choose (fun line ->
match r.Match line with
| m when m.Success -> Some m.name.Value
| _ -> None)
|> Seq.map (fun name ->
String.Join(".", name.Split(delims, StringSplitOptions.RemoveEmptyEntries)))
|> Seq.toArray
[<EntryPoint>]
let main _ =
let sw = Stopwatch.StartNew()
let results = search @"d:\big.txt" @"\{.*(?<name>Microsoft.*)\|\]"
sw.Stop()
printfn "Found %d lines, Elapsed %O" results.Length sw.Elapsed
0
view raw regex.fs hosted with ❤ by GitHub
Now Rust:

extern crate time;
extern crate regex;
extern crate itertools;
use time::{PreciseTime, Duration};
use std::io::prelude::*;
use std::io::BufReader;
use std::fs::File;
use regex::Regex;
use itertools::*;
fn tc<T, F: Fn() -> T>(f: F) -> (T, Duration) {
let start = PreciseTime::now();
let res = f();
(res, start.to(PreciseTime::now()))
}
#[derive(Debug)]
enum Error {
Io(std::io::Error),
Regex(regex::Error),
}
impl From<std::io::Error> for Error {
fn from(e: std::io::Error) -> Error {
Error::Io(e)
}
}
impl From<regex::Error> for Error {
fn from(e: regex::Error) -> Error {
Error::Regex(e)
}
}
macro_rules! any { ($x:expr, $($y:expr),*) => (|e| e == $x $(|| e == $y)*) }
fn cleanup_name(line: &str) -> String {
line.split(any!(' ', ';', '"'))
.filter(|x| !x.is_empty())
.join(".")
}
fn search(file: &str, pattern: &str) -> Result<Vec<String>, Error> {
let r = try!(Regex::new(pattern));
let f = try!(File::open(file));
let buff = BufReader::new(f);
buff.lines()
.take(10_000_000)
.fold_results(Vec::new(), |mut acc, line| {
match r.captures(&line).and_then(|caps| caps.name("name")) {
Some(name) => acc.push(cleanup_name(name)),
None => ()
}
acc
})
.map_err(Error::Io)
}
fn main() {
let (res, elapsed) = tc(|| {
search("d:\\big.txt", r"(?i)\{.*(?P<name>microsoft.*)\|\]").unwrap()
});
println!("Res count = {}, Elapsed {}", res.len(), elapsed);
}
view raw regex.rs hosted with ❤ by GitHub
After several launches the file was cached by the OS and both implementations became non IO-bound. F# one took 29 seconds and 31MB of RAM at peak; Rust - 11 seconds and 18MB.

The Rust code is as twice as long as F# one, but it's handling all possible errors explicitly - no surprises at runtime at all. The F# code may throw some exceptions (who knows what kind of them? Nobody). It's possible to wrap all calls to .NET framework with `Choice.attempt (fun _ -> ...)`, then define custom Error types for regex related code, for IO one and a top-level one, and the code'd be even longer then Rust's, hard to read and it would still give no guarantee that we catch all possible exceptions.

Update 4 Jan 2016: Scala added:

import scala.io._
import scala.util.matching.Regex
object Utils {
def time[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object Search {
def search(file: String, pattern: Regex): List[String] = {
val delims = Array(';', '"', ' ')
Source
.fromFile(file)
.getLines
.take(10000000)
.collect { case pattern(name) => name }
.map(_
split delims
filter (_.nonEmpty)
mkString ".")
.toList
}
}
object Main {
def main(args: Array[String]) = {
val (results, elapsed) = Utils.time {
Search.search("d:\\big.txt", """\{.*(?<name>Microsoft.*)\|\].*"""r)
}
println(s"Found ${results.length} lines, elapsed ${elapsed / 1e6} ms")
}
}
Ok, it turns out that regex performance may depend on whether it's case sensitive or not. What's worse, I tested F# with case insensitive pattern, but Rust - for case sensitive. Anyway, as I've upgraded my machine recently (i5-750 => i7-4790K), I've rerun F# and Rust versions in both the regex modes and added Scala to the mix. First, case sensitive mode:
  • F# (F# 4.0, .NET 4.6.1) - 4.8 secs
  • Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.5 secs
  • Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 5.9 secs
Now, case insensitive:
  • F# (F# 4.0, .NET 4.6.1) - 15.5 secs
  • Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.2 secs
  • Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 6.1 secs

Although case sensitive patterns performs roughly the same on all the platforms, it's quite surprising that Rust is not the winner.

Scala is faster in case insensitive mode (?), Rust is slightly slower and now the question: what's wrong with .NET implementation?.. It performs more than 3 times slower that case sensitive and the others.

Update 4 Jan 2016: D added.

import std.stdio;
import std.range;
import std.algorithm;
import std.regex;
import std.container;
import std.datetime;
auto stopTime(void function() f) {
auto sw = StopWatch(AutoStart.yes);
f();
sw.stop();
return sw.peek.msecs;
}
auto search(string file, string pattern) {
auto r = regex(pattern);
auto f = new File(file, "r");
auto lineNum = 0;
Array!string result;
while (lineNum++ < 10_000_000 && !f.eof) {
auto line = f.readln();
auto m = matchFirst(line, r);
if (!m.empty) {
auto res =
m["name"]
.splitter!(a => a == ' ' || a == ';' || a == '\"')
.filter!(a => a != """")
.join(".");
result ~= res;
}
}
return result;
}
int main(string[] argv) {
auto elapsed = stopTime(function { search("d:\\big.txt", r"\{.*(?P<name>Microsoft.*)\|\]"); });
writefln("Elapsed: %s", elapsed);
return 0;
}

  • regex - 10.6 s (DMD), 7.8 s (LDC)
  • ctRegex! - 6.9 s (DMD), 6.6 s (LDC)

Update 6 Jan 2016: Elixir added:

defmodule BigFileRegexElixir do
{:ok, file} = File.open "d:\\big.txt", [:read]
{:ok, r} = Regex.compile "\\{.*(?<name>Microsoft.*)\\|\\]"
file
|> IO.stream(:line)
|> Enum.take(10_000_000)
|> Enum.map(fn line -> Regex.named_captures(r, line) end)
|> Enum.filter_map(fn c -> c != nil end, fn c ->
c["name"]
|> String.split([" ", ";", "\""], trim: true)
|> Enum.join(".")
end)
|> Enum.to_list
File.close file
end
It takes 56 seconds to finish.

Update 6 Jan 2016: Haskell added:


module Main where
import System.IO
import Text.Regex.PCRE
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as T.IO
regex :: T.Text -> [[T.Text]]
regex line =
let l = T.unpack line in
let groups = l =~ "\\{.*(Microsoft.*)\\|\\].*" :: [[String]] in
map (map T.pack) groups
firstGroup :: [[T.Text]] -> Maybe T.Text
firstGroup [[_, x]] = Just x
firstGroup _ = Nothing
splitLine :: T.Text -> [T.Text]
splitLine = T.split (\x -> x == ' ' || x == ';' || x == '"')
cleanupName :: T.Text -> T.Text
cleanupName = T.intercalate (T.pack ".") . splitLine
main :: IO ()
main = do
handle <- openFile "d:\\big.txt" ReadMode
contents <- T.IO.hGetContents handle
let matchedLines = map regex . take 100 . T.lines $ contents
let dirtyLines = mapMaybe firstGroup matchedLines
let cleanLines = map cleanupName dirtyLines
let count = length cleanLines
putStrLn . show $ count
hClose handle
view raw regex.hs hosted with ❤ by GitHub
I takes 20 seconds.

Update 7 Jan 2016: Nemerle added:


using Nemerle.Collections;
using Nemerle.Text;
using Nemerle.Utility;
using System;
using System.IO;
using System.Console;
using System.Linq;
using System.Text.RegularExpressions;
using System.Diagnostics;
module Program {
search(file: string): array[string] {
def delims = array[';', '"', ' '];
File.ReadLines(file)
.Take(10_000_000)
.MapLazy(line =>
regexp match(line) {
| @"\{.*(?<name>Microsoft.*)\|\].*" => name
| _ => null })
.MapToArrayFiltered(
_ != null,
name => String.Join(".", name.Split(delims, StringSplitOptions.RemoveEmptyEntries)))
}
Main(): void {
def sw = Stopwatch.StartNew();
def results = search(@"d:\big.txt");
sw.Stop();
WriteLine($"Found $(results.Length) lines, Elapsed $(sw.Elapsed)");
_ = ReadLine();
}
}
view raw regex.n hosted with ❤ by GitHub
It takes 3.8 seconds (case sensitive) and 7.1 seconds (case insensitive).

Update 8 Jan 2016: Nemerle PEG added:


using System;
using System.Collections.Generic;
using System.Console;
using System.Linq;
using System.IO;
using System.Diagnostics;
using Nemerle.Collections;
using Nemerle.Peg;
[PegGrammar(Optionts = EmitDebugSources, start,
grammar
{
any = ['\u0000'..'\uFFFF'];
quote: void = "\"";
ident: string = (!quote any)* quote (!quote any)+ quote;
start: string = (!quote any)* quote "Microsoft" quote ident+;
})]
public class Parser {
ident(_: NToken, x: NToken): string {
GetText(x)
}
start(_: NToken, ms: NToken, t: List[string]): string {
GetText(ms) + "." + string.Join(".", t)
}
}
module Program {
search(file: string): array[string] {
def parser = Parser();
File.ReadLines(file)
.Take(10_000_000)
.MapLazy(parser.Parse)
.MapToArrayFiltered(Option.IsSome, _.Value)
}
Main() : void {
def sw = Stopwatch.StartNew();
def results = search(@"d:\big.txt");
sw.Stop();
WriteLine($"Found $(results.Length) lines, Elapsed $(sw.Elapsed)");
}
}
It takes 4.1 seconds.

All results so far:

Case sensitive Case insensitive
F# 4,80 15,50
Scala 3,50 3,20
Rust 5,90 6,10
DMD 6,90
LDC 6,60
Elixir 56,00
Hakell 20,00
Nemerle 3,80 7,10
Nemerle PEG 4,10 4,20



Update 3 Dec 2017: Rust's regex crate updated and code cleanup, F# run on .NET Core 2.0:

F#

open System.Diagnostics
open System.Text.RegularExpressions
open System.IO
open System
let search file pattern =
let r = Regex(pattern, RegexOptions.ExplicitCapture ||| RegexOptions.Compiled ||| RegexOptions.IgnoreCase)
let delims = [|';'; '"'; ' '|]
[| for line in File.ReadLines file |> Seq.take 10_000_000 do
match r.Match line with
| m when m.Success ->
yield m.Value.Split(delims, StringSplitOptions.RemoveEmptyEntries) |> String.concat "."
| _ -> () |]
[<EntryPoint>]
let main _ =
let sw = Stopwatch.StartNew()
let results = search @"d:\big.txt" @"\{.*(?<name>Microsoft.*)\|\]"
sw.Stop()
printfn "Found %d lines, Elapsed %O" results.Length sw.Elapsed
0

Rust

#![feature(universal_impl_trait)]
extern crate time;
extern crate regex;
extern crate itertools;
#[macro_use]
extern crate error_chain;
use time::{PreciseTime, Duration};
use std::io::prelude::*;
use std::io::BufReader;
use std::fs::File;
use regex::Regex;
mod errors {
error_chain! {
foreign_links {
Io(::std::io::Error);
Regex(::regex::Error);
}
}
}
use errors::*;
fn tc<T>(f: impl Fn() -> Result<T>) -> Result<(T, Duration)> {
let start = PreciseTime::now();
let res = f()?;
Ok((res, start.to(PreciseTime::now())))
}
fn cleanup_name(line: &str) -> String {
use itertools::*;
line.split(|x| x == ' ' || x == ';' || x == '"')
.filter(|x| !x.is_empty())
.join(".")
}
fn search(file: &str, pattern: &str) -> Result<Vec<String>> {
let r = Regex::new(pattern)?;
let f = File::open(file)?;
let buff = BufReader::new(f);
let mut res = Vec::new();
for line in buff.lines().take(10_000_000) {
let line = line?;
if let Some(name) = r.captures(&line).and_then(|caps| caps.name("name")) {
res.push(cleanup_name(name.as_str()))
}
}
Ok(res)
}
fn run() -> Result<()> {
let pat = r"(?i)\{.*(?P<name>microsoft.*)\|\]";
let (res, elapsed) = tc(|| {
search("d:\\big.txt", pat)
})?;
println!("Res count = {}, Elapsed {}", res.len(), elapsed);
Ok(())
}
fn main() {
if let Err(ref e) = run() {
use std::io::Write;
use error_chain::ChainedError;
let stderr = &mut ::std::io::stderr();
let errmsg = "Error writing to stderr";
writeln!(stderr, "{}", e.display_chain()).expect(errmsg);
std::process::exit(1);
}
}

Rust version now performs 2x faster, F# is slower on case sensitive and faster on case insensitive on Core:

All results so far:


Case sensitive Case insensitive
F# 6,57 9,56
Scala 3,50 3,20
Rust 2,97 3,02
DMD 6,90
LDC 6,60
Elixir 56,00
Hakell 20,00
Nemerle 3,80 7,10
Nemerle PEG 4,10 4,20

I removed Elixir from the chart as it's a clear outsider and its results make the chart read harder:

Comments

Anonymous said…
To be fair, regular expressions are known to be a bit of a weak spot in Rust currently. Efforts are in progress to improve the implementation, but they're still a work in progress, performance-wise.
Anonymous said…
Nice post. Did you compile Rust code with optimizations (--release flag in cargo)?
Vasily said…
@V: yes, of course I ran it via `cargo run --release`. In debug mode it runs ~30 times slower (!), 177 seconds vs 6 seconds.
jneem said…
Could you make the full input file available? I'm trying to improve rust's regex performance, and it would be nice to have more benchmarks to analyze.
Vasily said…
it is here https://drive.google.com/open?id=0B8HLQUKik9VtUWlOaHJPdG0xbnM
Anonymous said…
It would be interesting to see memory usage..
Unknown said…
What compilation flags did you use for D?
Unknown said…
D supports a variant of stdio.readln that re-uses the buffer, what's the performance like if you change it to that?
Unknown said…
Answering my own question, for D using readln(buf) shaves half a second off the D time on my laptop.
Unknown said…
FYI, the Rust version also allocates a new string per line.
kozzi11 said…
My results:

D(LDC): 5.778s
D(GDC): 5.612s
D(DMD): 5.267s
scalac: 5.748s
rustc: 9.287s
Anonymous said…
I posted a faster version on the dlang forums, you can find it here if you want to give it a whirl:

https://paste.ee/p/Bb1Ns

you may need to alter the filename on line 10, you appear to be on Windows and I adjusted it for my Linux OS.

It runs twice as fast as the original ctRegex implementation in your blog post for me. The output is the same, feel free to verify.

Elapsed: 3187
~/regex 3.17s user 0.04s system 99% cpu 3.217 total

Elapsed: 6283
~/regex2 6.15s user 0.14s system 99% cpu 6.307 total


compiled with ldc -O3 -release -boundscheck=off -singleobj regex.d
dmd version is ~70% slower now
Bye.
raichoo said…
I took the liberty to modify the Haskell example to use ByteString instead of constantly packing
and unpacking Text. This now runs in about 5 seconds on my laptop. https://gist.github.com/raichoo/3ae8dd14699b1f731916
Unknown said…
FYI, running this on the latest Rust regex library (0.1.51) gives a 2-3x boost.
KubuS said…
I bet .NET is converting all strings to UPPER-CASE or lower-case before comparison instead of comparing strings on char/ASCII code level.
Elixir code should not be placed directly inside defmodule, when you don't want compile time execution. Besides, comparing a compile time of Elixir code with run times for other languages is unfair.
Besides, take a look at https://github.com/elixir-lang/flow for best practices of parallel computations in Elixir.
Vasily said…
I updated results on:

* the latest Rust compiler and regex crate
* F# 4.1 on .NET Core 2.0

Rust is not the leader, F# results is somewhat better.
Vasily said…
Typo, Rust is _now_ the leader.
rmouniak said…
Nice post, Thanks for sharing Get more update at
.Net Online Course
Janu said…
Thanks for sharing such informative guide on .Net technology. This post gives me detailed information about the .net technology. I am working as trainer in leading IT training academy offering Dot Net Training in Chennai and i use your guide









Dot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery

Cyberz Pc said…
Wow, that is appealing studying. i am glad i found this and were given to artifice in it. I preferred it loads. thanks for the colossal and precise data. Prezi Crack
Cyberz Pc said…
It changed into wondering if I may want to use this write-taking area regarding my supplementary website, notable thanks. Wifi Password Hacker
Jobi Johnson said…
This comment has been removed by the author.

Popular posts from this blog

Hash maps: Rust, F#, D, Go, Scala

Haskell: performance