Regular expressions: Rust vs F# vs Scala
Let's implement the following task: read first 10M lines from a text file of the following format:
then find all lines containing Microsoft namespace in them, and format the type names the usual way, like "Microsoft.Win32.IAssemblyEnum".
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{Idents = [|"System"; "StubHelpers"; "StubHelpers"|]; | |
Namespace = Some [|"System"; "StubHelpers"|]; | |
IsPublic = true; | |
TopRequireQualifiedAccessParent = null; | |
IsAttribute = false;} | |
{Idents = [|"Microsoft"; "Win32"; "IAssemblyEnum"|]; | |
Namespace = Some [|"Microsoft"; "Win32"|]; | |
IsPublic = true; | |
TopRequireQualifiedAccessParent = null; | |
IsAttribute = false;} |
First, F#:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
open System.Diagnostics | |
open System.Text.RegularExpressions | |
open System.IO | |
open System | |
open FSharp.Text.RegexProvider | |
type Pattern = Regex< @"\{.*(?<name>Microsoft.*)\|\]"> | |
let search file pattern = | |
let r = Pattern (RegexOptions.ExplicitCapture ||| RegexOptions.Compiled) | |
let delims = [|';'; '"'; ' '|] | |
File.ReadLines file | |
|> Seq.take 10000000 | |
|> Seq.choose (fun line -> | |
match r.Match line with | |
| m when m.Success -> Some m.name.Value | |
| _ -> None) | |
|> Seq.map (fun name -> | |
String.Join(".", name.Split(delims, StringSplitOptions.RemoveEmptyEntries))) | |
|> Seq.toArray | |
[<EntryPoint>] | |
let main _ = | |
let sw = Stopwatch.StartNew() | |
let results = search @"d:\big.txt" @"\{.*(?<name>Microsoft.*)\|\]" | |
sw.Stop() | |
printfn "Found %d lines, Elapsed %O" results.Length sw.Elapsed | |
0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
extern crate time; | |
extern crate regex; | |
extern crate itertools; | |
use time::{PreciseTime, Duration}; | |
use std::io::prelude::*; | |
use std::io::BufReader; | |
use std::fs::File; | |
use regex::Regex; | |
use itertools::*; | |
fn tc<T, F: Fn() -> T>(f: F) -> (T, Duration) { | |
let start = PreciseTime::now(); | |
let res = f(); | |
(res, start.to(PreciseTime::now())) | |
} | |
#[derive(Debug)] | |
enum Error { | |
Io(std::io::Error), | |
Regex(regex::Error), | |
} | |
impl From<std::io::Error> for Error { | |
fn from(e: std::io::Error) -> Error { | |
Error::Io(e) | |
} | |
} | |
impl From<regex::Error> for Error { | |
fn from(e: regex::Error) -> Error { | |
Error::Regex(e) | |
} | |
} | |
macro_rules! any { ($x:expr, $($y:expr),*) => (|e| e == $x $(|| e == $y)*) } | |
fn cleanup_name(line: &str) -> String { | |
line.split(any!(' ', ';', '"')) | |
.filter(|x| !x.is_empty()) | |
.join(".") | |
} | |
fn search(file: &str, pattern: &str) -> Result<Vec<String>, Error> { | |
let r = try!(Regex::new(pattern)); | |
let f = try!(File::open(file)); | |
let buff = BufReader::new(f); | |
buff.lines() | |
.take(10_000_000) | |
.fold_results(Vec::new(), |mut acc, line| { | |
match r.captures(&line).and_then(|caps| caps.name("name")) { | |
Some(name) => acc.push(cleanup_name(name)), | |
None => () | |
} | |
acc | |
}) | |
.map_err(Error::Io) | |
} | |
fn main() { | |
let (res, elapsed) = tc(|| { | |
search("d:\\big.txt", r"(?i)\{.*(?P<name>microsoft.*)\|\]").unwrap() | |
}); | |
println!("Res count = {}, Elapsed {}", res.len(), elapsed); | |
} |
The Rust code is as twice as long as F# one, but it's handling all possible errors explicitly - no surprises at runtime at all. The F# code may throw some exceptions (who knows what kind of them? Nobody). It's possible to wrap all calls to .NET framework with `Choice.attempt (fun _ -> ...)`, then define custom Error types for regex related code, for IO one and a top-level one, and the code'd be even longer then Rust's, hard to read and it would still give no guarantee that we catch all possible exceptions.
Update 4 Jan 2016: Scala added:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scala.io._ | |
import scala.util.matching.Regex | |
object Utils { | |
def time[A](f: => A): (A, Long) = { | |
val s = System.nanoTime | |
val ret = f | |
(ret, System.nanoTime - s) | |
} | |
} | |
object Search { | |
def search(file: String, pattern: Regex): List[String] = { | |
val delims = Array(';', '"', ' ') | |
Source | |
.fromFile(file) | |
.getLines | |
.take(10000000) | |
.collect { case pattern(name) => name } | |
.map(_ | |
split delims | |
filter (_.nonEmpty) | |
mkString ".") | |
.toList | |
} | |
} | |
object Main { | |
def main(args: Array[String]) = { | |
val (results, elapsed) = Utils.time { | |
Search.search("d:\\big.txt", """\{.*(?<name>Microsoft.*)\|\].*"""r) | |
} | |
println(s"Found ${results.length} lines, elapsed ${elapsed / 1e6} ms") | |
} | |
} |
- F# (F# 4.0, .NET 4.6.1) - 4.8 secs
- Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.5 secs
- Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 5.9 secs
Now, case insensitive:
Although case sensitive patterns performs roughly the same on all the platforms, it's quite surprising that Rust is not the winner.
Scala is faster in case insensitive mode (?), Rust is slightly slower and now the question: what's wrong with .NET implementation?.. It performs more than 3 times slower that case sensitive and the others.
Update 4 Jan 2016: D added.
- F# (F# 4.0, .NET 4.6.1) - 15.5 secs
- Scala (2.11.7, Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65) - 3.2 secs
- Rust (1.7.0-nightly (bfb4212ee 2016-01-01) - 6.1 secs
Although case sensitive patterns performs roughly the same on all the platforms, it's quite surprising that Rust is not the winner.
Scala is faster in case insensitive mode (?), Rust is slightly slower and now the question: what's wrong with .NET implementation?.. It performs more than 3 times slower that case sensitive and the others.
Update 4 Jan 2016: D added.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import std.stdio; | |
import std.range; | |
import std.algorithm; | |
import std.regex; | |
import std.container; | |
import std.datetime; | |
auto stopTime(void function() f) { | |
auto sw = StopWatch(AutoStart.yes); | |
f(); | |
sw.stop(); | |
return sw.peek.msecs; | |
} | |
auto search(string file, string pattern) { | |
auto r = regex(pattern); | |
auto f = new File(file, "r"); | |
auto lineNum = 0; | |
Array!string result; | |
while (lineNum++ < 10_000_000 && !f.eof) { | |
auto line = f.readln(); | |
auto m = matchFirst(line, r); | |
if (!m.empty) { | |
auto res = | |
m["name"] | |
.splitter!(a => a == ' ' || a == ';' || a == '\"') | |
.filter!(a => a != """") | |
.join("."); | |
result ~= res; | |
} | |
} | |
return result; | |
} | |
int main(string[] argv) { | |
auto elapsed = stopTime(function { search("d:\\big.txt", r"\{.*(?P<name>Microsoft.*)\|\]"); }); | |
writefln("Elapsed: %s", elapsed); | |
return 0; | |
} |
- regex - 10.6 s (DMD), 7.8 s (LDC)
- ctRegex! - 6.9 s (DMD), 6.6 s (LDC)
Update 6 Jan 2016: Elixir added:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
defmodule BigFileRegexElixir do | |
{:ok, file} = File.open "d:\\big.txt", [:read] | |
{:ok, r} = Regex.compile "\\{.*(?<name>Microsoft.*)\\|\\]" | |
file | |
|> IO.stream(:line) | |
|> Enum.take(10_000_000) | |
|> Enum.map(fn line -> Regex.named_captures(r, line) end) | |
|> Enum.filter_map(fn c -> c != nil end, fn c -> | |
c["name"] | |
|> String.split([" ", ";", "\""], trim: true) | |
|> Enum.join(".") | |
end) | |
|> Enum.to_list | |
File.close file | |
end |
Update 6 Jan 2016: Haskell added:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module Main where | |
import System.IO | |
import Text.Regex.PCRE | |
import Data.Maybe | |
import qualified Data.Text as T | |
import qualified Data.Text.IO as T.IO | |
regex :: T.Text -> [[T.Text]] | |
regex line = | |
let l = T.unpack line in | |
let groups = l =~ "\\{.*(Microsoft.*)\\|\\].*" :: [[String]] in | |
map (map T.pack) groups | |
firstGroup :: [[T.Text]] -> Maybe T.Text | |
firstGroup [[_, x]] = Just x | |
firstGroup _ = Nothing | |
splitLine :: T.Text -> [T.Text] | |
splitLine = T.split (\x -> x == ' ' || x == ';' || x == '"') | |
cleanupName :: T.Text -> T.Text | |
cleanupName = T.intercalate (T.pack ".") . splitLine | |
main :: IO () | |
main = do | |
handle <- openFile "d:\\big.txt" ReadMode | |
contents <- T.IO.hGetContents handle | |
let matchedLines = map regex . take 100 . T.lines $ contents | |
let dirtyLines = mapMaybe firstGroup matchedLines | |
let cleanLines = map cleanupName dirtyLines | |
let count = length cleanLines | |
putStrLn . show $ count | |
hClose handle |
Update 7 Jan 2016: Nemerle added:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using Nemerle.Collections; | |
using Nemerle.Text; | |
using Nemerle.Utility; | |
using System; | |
using System.IO; | |
using System.Console; | |
using System.Linq; | |
using System.Text.RegularExpressions; | |
using System.Diagnostics; | |
module Program { | |
search(file: string): array[string] { | |
def delims = array[';', '"', ' ']; | |
File.ReadLines(file) | |
.Take(10_000_000) | |
.MapLazy(line => | |
regexp match(line) { | |
| @"\{.*(?<name>Microsoft.*)\|\].*" => name | |
| _ => null }) | |
.MapToArrayFiltered( | |
_ != null, | |
name => String.Join(".", name.Split(delims, StringSplitOptions.RemoveEmptyEntries))) | |
} | |
Main(): void { | |
def sw = Stopwatch.StartNew(); | |
def results = search(@"d:\big.txt"); | |
sw.Stop(); | |
WriteLine($"Found $(results.Length) lines, Elapsed $(sw.Elapsed)"); | |
_ = ReadLine(); | |
} | |
} |
Update 8 Jan 2016: Nemerle PEG added:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using System; | |
using System.Collections.Generic; | |
using System.Console; | |
using System.Linq; | |
using System.IO; | |
using System.Diagnostics; | |
using Nemerle.Collections; | |
using Nemerle.Peg; | |
[PegGrammar(Optionts = EmitDebugSources, start, | |
grammar | |
{ | |
any = ['\u0000'..'\uFFFF']; | |
quote: void = "\""; | |
ident: string = (!quote any)* quote (!quote any)+ quote; | |
start: string = (!quote any)* quote "Microsoft" quote ident+; | |
})] | |
public class Parser { | |
ident(_: NToken, x: NToken): string { | |
GetText(x) | |
} | |
start(_: NToken, ms: NToken, t: List[string]): string { | |
GetText(ms) + "." + string.Join(".", t) | |
} | |
} | |
module Program { | |
search(file: string): array[string] { | |
def parser = Parser(); | |
File.ReadLines(file) | |
.Take(10_000_000) | |
.MapLazy(parser.Parse) | |
.MapToArrayFiltered(Option.IsSome, _.Value) | |
} | |
Main() : void { | |
def sw = Stopwatch.StartNew(); | |
def results = search(@"d:\big.txt"); | |
sw.Stop(); | |
WriteLine($"Found $(results.Length) lines, Elapsed $(sw.Elapsed)"); | |
} | |
} |
All results so far:
Case sensitive | Case insensitive | |
F# | 4,80 | 15,50 |
Scala | 3,50 | 3,20 |
Rust | 5,90 | 6,10 |
DMD | 6,90 | |
LDC | 6,60 | |
Elixir | 56,00 | |
Hakell | 20,00 | |
Nemerle | 3,80 | 7,10 |
Nemerle PEG | 4,10 | 4,20 |
Update 3 Dec 2017: Rust's regex crate updated and code cleanup, F# run on .NET Core 2.0:
F#
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
open System.Diagnostics | |
open System.Text.RegularExpressions | |
open System.IO | |
open System | |
let search file pattern = | |
let r = Regex(pattern, RegexOptions.ExplicitCapture ||| RegexOptions.Compiled ||| RegexOptions.IgnoreCase) | |
let delims = [|';'; '"'; ' '|] | |
[| for line in File.ReadLines file |> Seq.take 10_000_000 do | |
match r.Match line with | |
| m when m.Success -> | |
yield m.Value.Split(delims, StringSplitOptions.RemoveEmptyEntries) |> String.concat "." | |
| _ -> () |] | |
[<EntryPoint>] | |
let main _ = | |
let sw = Stopwatch.StartNew() | |
let results = search @"d:\big.txt" @"\{.*(?<name>Microsoft.*)\|\]" | |
sw.Stop() | |
printfn "Found %d lines, Elapsed %O" results.Length sw.Elapsed | |
0 |
Rust
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#![feature(universal_impl_trait)] | |
extern crate time; | |
extern crate regex; | |
extern crate itertools; | |
#[macro_use] | |
extern crate error_chain; | |
use time::{PreciseTime, Duration}; | |
use std::io::prelude::*; | |
use std::io::BufReader; | |
use std::fs::File; | |
use regex::Regex; | |
mod errors { | |
error_chain! { | |
foreign_links { | |
Io(::std::io::Error); | |
Regex(::regex::Error); | |
} | |
} | |
} | |
use errors::*; | |
fn tc<T>(f: impl Fn() -> Result<T>) -> Result<(T, Duration)> { | |
let start = PreciseTime::now(); | |
let res = f()?; | |
Ok((res, start.to(PreciseTime::now()))) | |
} | |
fn cleanup_name(line: &str) -> String { | |
use itertools::*; | |
line.split(|x| x == ' ' || x == ';' || x == '"') | |
.filter(|x| !x.is_empty()) | |
.join(".") | |
} | |
fn search(file: &str, pattern: &str) -> Result<Vec<String>> { | |
let r = Regex::new(pattern)?; | |
let f = File::open(file)?; | |
let buff = BufReader::new(f); | |
let mut res = Vec::new(); | |
for line in buff.lines().take(10_000_000) { | |
let line = line?; | |
if let Some(name) = r.captures(&line).and_then(|caps| caps.name("name")) { | |
res.push(cleanup_name(name.as_str())) | |
} | |
} | |
Ok(res) | |
} | |
fn run() -> Result<()> { | |
let pat = r"(?i)\{.*(?P<name>microsoft.*)\|\]"; | |
let (res, elapsed) = tc(|| { | |
search("d:\\big.txt", pat) | |
})?; | |
println!("Res count = {}, Elapsed {}", res.len(), elapsed); | |
Ok(()) | |
} | |
fn main() { | |
if let Err(ref e) = run() { | |
use std::io::Write; | |
use error_chain::ChainedError; | |
let stderr = &mut ::std::io::stderr(); | |
let errmsg = "Error writing to stderr"; | |
writeln!(stderr, "{}", e.display_chain()).expect(errmsg); | |
std::process::exit(1); | |
} | |
} |
Rust version now performs 2x faster, F# is slower on case sensitive and faster on case insensitive on Core:
All results so far:
Case sensitive | Case insensitive | |
F# | 6,57 | 9,56 |
Scala | 3,50 | 3,20 |
Rust | 2,97 | 3,02 |
DMD | 6,90 | |
LDC | 6,60 | |
Elixir | 56,00 | |
Hakell | 20,00 | |
Nemerle | 3,80 | 7,10 |
Nemerle PEG | 4,10 | 4,20 |
I removed Elixir from the chart as it's a clear outsider and its results make the chart read harder:
Comments
D(LDC): 5.778s
D(GDC): 5.612s
D(DMD): 5.267s
scalac: 5.748s
rustc: 9.287s
https://paste.ee/p/Bb1Ns
you may need to alter the filename on line 10, you appear to be on Windows and I adjusted it for my Linux OS.
It runs twice as fast as the original ctRegex implementation in your blog post for me. The output is the same, feel free to verify.
Elapsed: 3187
~/regex 3.17s user 0.04s system 99% cpu 3.217 total
Elapsed: 6283
~/regex2 6.15s user 0.14s system 99% cpu 6.307 total
compiled with ldc -O3 -release -boundscheck=off -singleobj regex.d
dmd version is ~70% slower now
Bye.
and unpacking Text. This now runs in about 5 seconds on my laptop. https://gist.github.com/raichoo/3ae8dd14699b1f731916
* the latest Rust compiler and regex crate
* F# 4.1 on .NET Core 2.0
Rust is not the leader, F# results is somewhat better.
.Net Online Course
Dot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery