Excerpts from the RavenDB Performance team report: Optimizing Memory Comparisons, size does matter

architecture (588) rss
bugs (440) rss
challanges (123) rss
community (372) rss
databases (468) rss
design (878) rss
development (604) rss
hibernating-practices (69) rss
miscellaneous (591) rss
performance (375) rss
programming (1049) rss
raven (1389) rss
ravendb.net (470) rss
reviews (183) rss
uberprof (41) rss

2024
- April (1)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
- July (5)
- June (15)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- February (13)
- January (16)
- April (14)
- March (15)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
- August (16)
- July (11)
2020
- December (10)
- November (13)
- October (15)
- August (9)
- July (9)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
- September (6)
- June (17)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- January (23)
- August (8)
- February (19)
2018
- December (15)
- October (19)
- September (18)
- May (23)
- April (15)
- March (23)
- February (19)
- November (14)
- August (23)
- July (20)
- June (20)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- May (21)
- April (21)
- June (24)
- August (23)
- July (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
- October (22)
- September (18)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- January (36)
- February (33)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
- August (96)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- April (73)
- March (124)
- February (102)
- January (68)
- May (115)
2006
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- February (78)
- March (143)
- November (53)
- January (64)
- December (95)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- January (31)
- February (84)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jan 29 2015

Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, size does matter

time to read 3 min | 502 words

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy we were able to uncover potential things we could do. In this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. What if we could do that on 8 bytes?

To achieve that we will use a ulong instead of a uint. This type of optimization makes sense for 2 reasons.

Most of our users are already running RavenDB in x64 where the native word is 8 bytes and Voron is compiled on x64 only. But even if that were not true, since the late 2000’ most CPUs would have a 64 bytes L1 cache line with half a cycle cost for a hit. So even if you can’t handle 64 bits in one go and the JIT or processor have to issue 2 instructions you are still getting a L1 cache hit and no pipeline stall. Which is GREAT Smile .

So without farther ado, this is the resulting code:

Ayende’s note: In the code, the lp += (IntPtr)8/8; is actually defined as lp += 1; What is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ILSpy decided to represent that for some reason.
The actual IL generated for this is good:

It is just that the translation here is kind of strange.

Therefore the question to ask here is: Will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8 bytes instead of 4 bytes?

Well the answer is a resounding yes. It won’t have much impact in the first 32 bytes (around 3% or less). We won’t lose, but we won’t win much either. But after that it skyrocket.

// Bandwidth optimization kicks in
Size:       32 Original:     535 Optimized:   442 Gain:    5.01%
Size:       64 Original:     607 Optimized:   493 Gain:    7.08%
Size:    128 Original:     752 Optimized:   573 Gain:   11.77%
Size:     256 Original: 1,080 Optimized:   695 Gain:   35.69%
Size:     512 Original: 1,837 Optimized:   943 Gain:   74.40%
Size: 1,024 Original: 3,200 Optimized: 1,317 Gain: 122.25%
Size: 2,048 Original: 5,135 Optimized: 2,110 Gain: 123.13%
Size: 4,096 Original: 8,753 Optimized: 3,690 Gain: 117.29%

Those are real measurements. You can see that when bandwidth optimization kicks in the gains start to get really high. This means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.

Not bad for 2 lines of work.

Tweet Share Share 15 comments

Tags:

Comments

29 Jan 2015
12:25 PM

alex

I guess I should wait for the rest of the series, but I am assuming that you are limiting this to architectures where alignment is not _required_, and misalignment of either lhs or rhs is not that costly.

29 Jan 2015
12:37 PM

Ayende Rahien

Alex, Yes, and no. We are currently looking at x64 systems, yes. Which are very forgiving for alignment errors, but we actually can't enforce alignment. We make sure to do at least alignment on 2, but we need to pack data tightly, and more than that would be very wasteful, so we don't bother.

30 Jan 2015
00:01 AM

Federico Lois

@alex While there are architectures where alignment is required (older ARM) the new ARM11 and Cortex-A/R already support unaligned access. However, I seriously doubt we will run the RavenDB engine on such a processor :)

I haven't done the benchmarks yet to corroborate it (so I am still performing a couple of operations to work in aligned fashion in the current code), but there are plenty of reports that the performance hit for newer architectures is a myth.

http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

{quote}On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.{/quote} http://www.agner.org/optimize/blog/read.php?i=142&v=t

Needless to say that I will get rid of those instructions when that is benchmarked.

30 Jan 2015
11:30 AM

Thomas Levesque

Did you benchmark it against a P/Invoke call to memcmp ? In my own benchmark memcmp is a bit faster.

30 Jan 2015
11:53 AM

Ayende Rahien

Thomas, Note that you have to be careful whatever you are benching it in x86 or x64

30 Jan 2015
13:19 PM

Thomas Levesque

Good point... Also, I just realized that my test was run with optimizations disabled, which made the results useless.

I tested both, and here's the result:

32 bits: memcmp: 00:00:00.1036329 custom: 00:00:00.0593725

64 bits: memcmp: 00:00:00.0462231 custom: 00:00:00.0501936

So on 64 bits, memcmp is slightly faster, but on 32 bits the custom comparison is clearly faster...

Here's the code for my benchmark: https://gist.github.com/thomaslevesque/558763165d7e49341076

30 Jan 2015
13:23 PM

Ayende Rahien

Thomas, What sizes are you testing this on? In smaller size (very common for us), our optmized code is really much faster than memcmp. On larger stuff, memcmp is somewhat faster.

30 Jan 2015
13:47 PM

Thomas Levesque

I tested on rather large arrays (4MB), but I get the same results with 4K. Anyway, my implementation isn't exactly the same as yours (since I don't have your full implementation), which might explain the difference.

30 Jan 2015
15:17 PM

Federico Lois

@Thomas There are at least 2 reasons why you are getting better results. I will also speak about similar mistakes I made. The perils of microbenchmarking :)

The first is that you are reusing your buffer. That means that for smaller arrays you have an L1/L2 cache hit which in real life is a condition that would never hold. That also means that you are not checking the case of 4mb buffers, your are checking the case of the position where the buffers differ. Which is actually the worst case scenario every single time (they are equal). That will skew your distribution and show the case where P/Invoke will really shine. But that is not a measurement of your application behavior. You can't beat SSE2 operations in .Net YET :D.

The second is that you are working with too small iterations, therefore the jitter caused by CPU scheduling, the GC, or even other processes will be very high.

In a couple of posts (don't remember which one) we work out those details on how to perform an amortized analysis based on some more reasonable assumptions. Which are the optimization benchmark tests we used, before unleash the optimized algorithm on RavenDB itself.

30 Jan 2015
19:28 PM

Federico Lois

@Thomas I have been working on your code and devised a couple of ideas based on it. They will be added in a future commit. :)

This is the final tally:

Size: 2 Optimized: 193 Native: 448 Gain: 132,1244% Size: 4 Optimized: 261 Native: 426 Gain: 63,21839% Size: 8 Optimized: 277 Native: 388 Gain: 40,0722% Size: 16 Optimized: 317 Native: 427 Gain: 34,70031% Size: 32 Optimized: 351 Native: 457 Gain: 30,19943% Size: 64 Optimized: 411 Native: 497 Gain: 20,92458% Size: 128 Optimized: 544 Native: 628 Gain: 15,44118% Size: 256 Optimized: 887 Native: 949 Gain: 6,989849% Size: 512 Optimized: 1530 Native: 1552 Gain: 1,437914% Size: 1024 Optimized: 2961 Native: 3141 Gain: 6,07903% Size: 2048 Optimized: 4505 Native: 4418 Gain: -1,93119% Size: 4096 Optimized: 6760 Native: 6841 Gain: 1,19822%

As you see the native implementation will still suffer when faced with a non-worst case scenario (you don't need to look into all the bytes to decide). If you know you are comparing lots of statistically equal memory, you would certainly not use our version, our is tuned to the case we expect many of those to be different (usual case when traversing the trees).

In your benchmark, even though your results are correct for the worst case, most of the jitter was caused by the test harness. It accounted for more than 50% of the whole cost of execution t.Run(). I will talk about it in a later post but the .Start() .Stop() causes lots of jittering in the results. And those errors are additive.

Having said that, I also discovered a source of jittering in my own benchmark code too which penalized the native version more than it should be. It didn't changed radically the results, but for the sake of completeness I have to say it.

30 Jan 2015
21:35 PM

Ayende Rahien

Thomas, Both 4MB & 4KB are huge, when talking about smaller, we talk about 8 bytes, 32 bytes, etc. Think comparing keys in index

31 Jan 2015
17:44 PM

alex

@Federico. At the peril of focusing too much on a single microbenchmark. It does seem (on at least my system, and likely on many modern Intel processors), that a little misalignment is not that bad. Also at a certain size of the compared buffer (around 256/512) it makes more sense to incur the cost of a call to memcmp (which will likely be doing REP SCAS* ops).

See some results vs. the "MemoryUtils.Compare" version I just copied from the Voron source: https://gist.github.com/anonymous/67f270d9ce353ea19d1a

01 Feb 2015
23:50 PM

Federico Lois

@alex Actually I saw this comment today. I benchmarked your routine against our latest one (not in the code yet https://gist.github.com/anonymous/d113f06e6b007e87173b ).

The results:

Size: 2 Optimized: 231 Alt: 245 Native: 514 Gain: -5,714285% Size: 4 Optimized: 250 Alt: 226 Native: 368 Gain: 10,61947% Size: 8 Optimized: 275 Alt: 276 Native: 342 Gain: -0,3623188% Size: 16 Optimized: 291 Alt: 317 Native: 383 Gain: -8,201891% Size: 32 Optimized: 318 Alt: 339 Native: 396 Gain: -6,194693% Size: 64 Optimized: 378 Alt: 401 Native: 457 Gain: -5,73566% Size: 128 Optimized: 541 Alt: 556 Native: 576 Gain: -2,697843% Size: 256 Optimized: 880 Alt: 879 Native: 929 Gain: 0,1137614% Size: 512 Optimized: 1633 Alt: 1669 Native: 1665 Gain: -2,156979%

As you can see we followed a similar path.

About the REP SCAS, havent looked into memcmp itself, but benchmarks suggest that when you account for the P/Invoke the difference is not good enough to surpass a managed version for less that 64kb. We still call memcmp after 512 because the performance is similar and if there is a change in memcmp for specific architectures we are going to pick it if we need it.

02 Feb 2015
18:32 PM

alex

@Federico. I see that it is indeed the same approach with some optimizations along the decision path. Good to know this approach holds.

Regarding the P/Invoke overhead: what I am seeing is that when the average size to compare is above around 1K (i.e. well before 64K) memcmp is faster than 8 byte (unaligned) compares in a loop.

03 Feb 2015
00:33 AM

Federico Lois

The issue is that we always have measurement bias, basically the computer :)

We had to retire some optimizations that worked great on older hardware but not on newer and viceversa. Testing optimizations at this level requires to be very careful, you should not overfit for a particular architecture because you will pay on the rest.

We had an optimization that was AWESOME on first generation i5/i7 and Core2 but on the fourth generation i7 saying it was bad is an understatement, it was awful.

Knowing those differences is because we go full memcmp at the 512 threshold. We prefer to err in the low-side but not on the upper where it hurt more.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, size does matter

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

More posts in "Excerpts from the RavenDB Performance team report" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication