Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize block parsing by replacing regexes #137

Merged
merged 7 commits into from
Sep 17, 2018
Merged

Conversation

robinst
Copy link
Collaborator

@robinst robinst commented Sep 9, 2018

Block parsers use regex matching to check whether a line matches that block. That means for every line, we try a bunch of regexes. These can add up. The use of Matcher showed up in a profiling session on Android.

One way to fix that would be to check only one character first, and only if it matches what the block parser expects, run the full regex. But that seems like a bit of a strange mix.

Instead, this PR replaces all the regex use in block parsers with fairly simple looping code.

This gives a nice speedup for the benchmarks. Before:

Benchmark                 Mode  Cnt    Score   Error  Units
SpecBenchmark.wholeSpec  thrpt  200  109.766 ± 0.366  ops/s
SpecBenchmark.examples   thrpt  200  257.182 ± 0.789  ops/s

After:

Benchmark                 Mode  Cnt    Score   Error  Units
SpecBenchmark.wholeSpec  thrpt  200  127.930 ± 0.600  ops/s
SpecBenchmark.examples   thrpt  200  305.566 ± 0.959  ops/s

So that's about a 15% to 18% improvement! 🎉

There's probably some potential for improvement in inline parsing as well, but haven't looked at that yet.

The use of Matcher in there showed up showed up in a profiling session
on Android. Using Matcher also means doing more short-lived allocations.

Replace it with some simple looping code. This gives a nice speedup for
the whole spec benchmark (which is a long document with quite a few
fenced code blocks). Before:

Benchmark                 Mode  Cnt    Score   Error  Units
SpecBenchmark.wholeSpec  thrpt  200  104.297 ± 1.124  ops/s

After:

SpecBenchmark.wholeSpec  thrpt  200  124.558 ± 0.485  ops/s
The speedup should be mostly in the common case where there's no match,
in that case the new code just checks the first character and then
aborts whereas the old code would try a full regex match.
@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #137 into master will increase coverage by 0.13%.
The diff coverage is 98.1%.

@@             Coverage Diff              @@
##             master     #137      +/-   ##
============================================
+ Coverage     95.18%   95.32%   +0.13%     
  Complexity      161      161              
============================================
  Files            93       93              
  Lines          2619     2714      +95     
  Branches        327      356      +29     
============================================
+ Hits           2493     2587      +94     
+ Misses           63       62       -1     
- Partials         63       65       +2
Impacted Files Coverage Δ Complexity Δ
...ain/java/org/commonmark/internal/util/Parsing.java 96.36% <100%> (+8.48%) 0 <0> (ø) ⬇️
.../java/org/commonmark/internal/ListBlockParser.java 96.84% <100%> (+0.84%) 0 <0> (ø) ⬇️
...in/java/org/commonmark/internal/HeadingParser.java 100% <100%> (ø) 0 <0> (ø) ⬇️
...g/commonmark/internal/IndentedCodeBlockParser.java 92.59% <91.66%> (-2.87%) 0 <0> (ø)
...a/org/commonmark/internal/ThematicBreakParser.java 96.15% <93.33%> (-3.85%) 0 <0> (ø)
...org/commonmark/internal/FencedCodeBlockParser.java 98.36% <96.96%> (-1.64%) 0 <0> (ø)

@robinst
Copy link
Collaborator Author

robinst commented Sep 12, 2018

Here's some numbers from Android. Before:

Simple (Parse): avg=0.05ms median=0.04ms min=0.04ms max=0.05ms
Short (Parse): avg=0.28ms median=0.28ms min=0.26ms max=0.32ms
Medium (Parse): avg=2.55ms median=2.53ms min=2.42ms max=2.78ms
Long (Parse): avg=27.90ms median=27.67ms min=26.69ms max=29.35ms
UltraLong (Parse): avg=671.34ms median=667.70ms min=662.73ms max=712.41ms

After:

Simple (Parse): avg=0.05ms median=0.05ms min=0.04ms max=0.05ms
Short (Parse): avg=0.28ms median=0.27ms min=0.26ms max=0.31ms
Medium (Parse): avg=1.84ms median=1.84ms min=1.75ms max=1.97ms
Long (Parse): avg=20.05ms median=19.84ms min=19.50ms max=20.91ms
UltraLong (Parse): avg=602.48ms median=599.91ms min=589.54ms max=647.74ms

For simple and short only have a single paragraph. Medium and long have multiple.

So it looks like with text that has a lot of blocks, this reduces parsing time by about 30%, nice :).

@robinst robinst merged commit 0a15efa into master Sep 17, 2018
@robinst robinst deleted the optimize-block-parsing branch November 29, 2018 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant