Optimize block parsing by replacing regexes #137

robinst · 2018-09-09T23:53:07Z

Block parsers use regex matching to check whether a line matches that block. That means for every line, we try a bunch of regexes. These can add up. The use of Matcher showed up in a profiling session on Android.

One way to fix that would be to check only one character first, and only if it matches what the block parser expects, run the full regex. But that seems like a bit of a strange mix.

Instead, this PR replaces all the regex use in block parsers with fairly simple looping code.

This gives a nice speedup for the benchmarks. Before:

Benchmark                 Mode  Cnt    Score   Error  Units
SpecBenchmark.wholeSpec  thrpt  200  109.766 ± 0.366  ops/s
SpecBenchmark.examples   thrpt  200  257.182 ± 0.789  ops/s

After:

Benchmark                 Mode  Cnt    Score   Error  Units
SpecBenchmark.wholeSpec  thrpt  200  127.930 ± 0.600  ops/s
SpecBenchmark.examples   thrpt  200  305.566 ± 0.959  ops/s

So that's about a 15% to 18% improvement! 🎉

There's probably some potential for improvement in inline parsing as well, but haven't looked at that yet.

The use of Matcher in there showed up showed up in a profiling session on Android. Using Matcher also means doing more short-lived allocations. Replace it with some simple looping code. This gives a nice speedup for the whole spec benchmark (which is a long document with quite a few fenced code blocks). Before: Benchmark Mode Cnt Score Error Units SpecBenchmark.wholeSpec thrpt 200 104.297 ± 1.124 ops/s After: SpecBenchmark.wholeSpec thrpt 200 124.558 ± 0.485 ops/s

The speedup should be mostly in the common case where there's no match, in that case the new code just checks the first character and then aborts whereas the old code would try a full regex match.

…-parsing

codecov · 2018-09-09T23:58:11Z

Codecov Report

Merging #137 into master will increase coverage by 0.13%.
The diff coverage is 98.1%.

@@             Coverage Diff              @@
##             master     #137      +/-   ##
============================================
+ Coverage     95.18%   95.32%   +0.13%     
  Complexity      161      161              
============================================
  Files            93       93              
  Lines          2619     2714      +95     
  Branches        327      356      +29     
============================================
+ Hits           2493     2587      +94     
+ Misses           63       62       -1     
- Partials         63       65       +2

Impacted Files	Coverage Δ	Complexity Δ
...ain/java/org/commonmark/internal/util/Parsing.java	`96.36% <100%> (+8.48%)`	`0 <0> (ø)`	⬇️
.../java/org/commonmark/internal/ListBlockParser.java	`96.84% <100%> (+0.84%)`	`0 <0> (ø)`	⬇️
...in/java/org/commonmark/internal/HeadingParser.java	`100% <100%> (ø)`	`0 <0> (ø)`	⬇️
...g/commonmark/internal/IndentedCodeBlockParser.java	`92.59% <91.66%> (-2.87%)`	`0 <0> (ø)`
...a/org/commonmark/internal/ThematicBreakParser.java	`96.15% <93.33%> (-3.85%)`	`0 <0> (ø)`
...org/commonmark/internal/FencedCodeBlockParser.java	`98.36% <96.96%> (-1.64%)`	`0 <0> (ø)`

robinst · 2018-09-12T04:57:27Z

Here's some numbers from Android. Before:

Simple (Parse): avg=0.05ms median=0.04ms min=0.04ms max=0.05ms
Short (Parse): avg=0.28ms median=0.28ms min=0.26ms max=0.32ms
Medium (Parse): avg=2.55ms median=2.53ms min=2.42ms max=2.78ms
Long (Parse): avg=27.90ms median=27.67ms min=26.69ms max=29.35ms
UltraLong (Parse): avg=671.34ms median=667.70ms min=662.73ms max=712.41ms

After:

Simple (Parse): avg=0.05ms median=0.05ms min=0.04ms max=0.05ms
Short (Parse): avg=0.28ms median=0.27ms min=0.26ms max=0.31ms
Medium (Parse): avg=1.84ms median=1.84ms min=1.75ms max=1.97ms
Long (Parse): avg=20.05ms median=19.84ms min=19.50ms max=20.91ms
UltraLong (Parse): avg=602.48ms median=599.91ms min=589.54ms max=647.74ms

For simple and short only have a single paragraph. Medium and long have multiple.

So it looks like with text that has a lot of blocks, this reduces parsing time by about 30%, nice :).

robinst added 7 commits August 13, 2018 13:40

Optimize heading block parsing by replacing regex matching

da3bde8

The speedup should be mostly in the common case where there's no match, in that case the new code just checks the first character and then aborts whereas the old code would try a full regex match.

Merge branch 'optimize-fenced-code-block-parsing' into optimize-block…

c2f8a82

…-parsing

Use new Parsing utils for FencedCodeBlockParser

7abda68

Replace regex matching in ThematicBreakParser

ad397cb

Replace regex matching in ListBlockParser

69af683

Replace regex matching and simplify IndentedCodeBlockParser

0facee1

robinst merged commit 0a15efa into master Sep 17, 2018

robinst deleted the optimize-block-parsing branch November 29, 2018 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize block parsing by replacing regexes #137

Optimize block parsing by replacing regexes #137

robinst commented Sep 9, 2018

codecov bot commented Sep 9, 2018 •

edited

Loading

robinst commented Sep 12, 2018

Optimize block parsing by replacing regexes #137

Optimize block parsing by replacing regexes #137

Conversation

robinst commented Sep 9, 2018

codecov bot commented Sep 9, 2018 • edited Loading

Codecov Report

robinst commented Sep 12, 2018

codecov bot commented Sep 9, 2018 •

edited

Loading