A lot of time in code is spent comparing memory, particularly when operating on streams. Memory is compared, and most frequently, it is not a match.
This led me into curious territory when memcmp() became a serious bottleneck in my data recovery code. I had written a program which opens a device such as /dev/rdisk0, reads it in chunks of a few MB’s each, and then, in simplified terms, it would memcmp() the start of filesystem blocks against a known list of file headers to find out where a file started.
This worked fine for a while until it was requested that I add a ‘forensics mode’ to the software. There, instead of searching every start of a block on a hard disk for a file header, it was to search for files embedded inside other files. Every single byte on the drive would have to be memcmp()’d against the list of file signatures for a file header.
After finding the code running extremely slowly by this addition I began to seek ways to speed it up. To my surprise, I found out that using my own modified version of memcmp() made the program dramatically faster.
Due to the fact that this was during the era of OS X version 10.3, I assumed that more optimized versions of the standard library were still yet to come due to the nascent state of the operating system, and left it at that.
To my surprise, revisiting this problem now in 2017, around a decade later, I find my version of the memory comparison routine to still be a lot faster, given the following conditions:
- The two values are not aligned on a 4 byte or 8 byte boundary:
When comparing 4 bytes or 8 bytes, the native memcmp() will be faster than this replacement. I would assume that most of the time when memcmp() is used it’s not on 32-bit or 64-bit boundaries, when instead you can just typecast the values as ints or long longs and compare them as numbers.
- The two values being compared are not equal:
If the two values are equal, again, the native memcmp() will be faster. It is a reasonable assumption that most of the time when memcmp() executes the two values will not be equal. For example, it may be searching through a stream for a jpeg header for the bytes “FFD8FF”. It’ll be comparing billions of bytes against “FFD8FF” looking for a header, so more than 99% of the time the result of the comparison will unequal.
The memcmp() replacement is extremely simple. Just take the two values cmp1 and cmp2 that need comparing, typecast them to a long, and, if they are different, return the result of cmp1 – cmp2. Like this:
As stated previously, this method is only faster if cmp1 and cmp2 are not equal. If they are equal this will be slower. So this is a very practical approach to speeding things up, and it is based upon the assumption that the majority of the times memcmp()’s result will not be equal. If the first 4 bytes are equal, it simply return’s memcmp()’s result as it is unable to compete with the speed of the native memcmp() in that case.
Here is some code for running a simple test to get a sense of the speed difference it can make. I also put it on github for easy access.
And the output:
Native memcmp: 3.910372
Modified memcmp: 2.955118
Which is 35% faster for the modified memcmp(). Given that in my use case this directly addressed the bottleneck, it resulted in an instantaneous ~35% increase in the speed of the data recovery software.
What makes this possible is the recent advancement in Intel CPU’s whereby accessing unaligned data has no performance penalty at all. Comparing two 4-byte int’s is faster than comparing two single-byte chars, and it takes fewer CPU cycles to do it.
Considering that memory comparison is one of the 5 or so basic instructions in a CPU’s instruction set, it still strikes me as strange that the standard standard library implementation of it would not be faster than a simple customized version.