今天继续在原来内存拷贝代码上优化:

1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
3. 去除目标地址头部对齐的分支判断,用一次xmm拷贝完成目标对齐,性能替升10%。
4. 增加测试用例:为贴近实际,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置和长度的测试

为避免随机数生成影响结果,提前生成随机数,最终平均性能达到gcc4.9配套标准库的2倍以上:

https://github.com/skywind3000/FastMemcpy

最新代码测试结果(可以对比老的表看新版本性能是否有所提升):

benchmark(size=32 bytes, times=16777216):  
result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms  
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms  
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms  
result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms

benchmark(size=64 bytes, times=16777216):  
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms  
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms  
result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms  
result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms

benchmark(size=512 bytes, times=8388608):  
result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms  
result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms  
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms  
result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms

benchmark(size=1024 bytes, times=4194304):  
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms  
result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms  
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms  
result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms

benchmark(size=4096 bytes, times=524288):  
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms  
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms  
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms  
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms

benchmark(size=8192 bytes, times=262144):  
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms  
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms  
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms  
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms

benchmark(size=1048576 bytes, times=2048):  
result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms  
result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms  
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms  
result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms

benchmark(size=4194304 bytes, times=512):  
result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms  
result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms  
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms  
result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms

benchmark(size=8388608 bytes, times=256):  
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms  
result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms  
result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms  
result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms

benchmark random access:  
memcpy_fast=515ms memcpy=1014ms

老的测试结果为:

result: gcc4.9 (msvc 2012 got a similar result):  
  
benchmark(size=32 bytes, times=16777216):  
result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms  
result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms  
result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms  
result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms  
  
benchmark(size=64 bytes, times=16777216):  
result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms  
result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms  
result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms  
result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms  
  
benchmark(size=512 bytes, times=8388608):  
result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms  
result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms  
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms  
result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms  
  
benchmark(size=1024 bytes, times=4194304):  
result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms  
result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms  
result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms  
result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms  
  
benchmark(size=4096 bytes, times=524288):  
result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms  
result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms  
result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms  
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms  
  
benchmark(size=8192 bytes, times=262144):  
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms  
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms  
result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms  
result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms  
  
benchmark(size=1048576 bytes, times=2048):  
result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms  
result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms  
result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms  
result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms  
  
benchmark(size=4194304 bytes, times=512):  
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms  
result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms  
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms  
result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms  
  
benchmark(size=8388608 bytes, times=256):  
result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms  
result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms  
result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms  
result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms

benchmark random access:  
memcpy_fast=594ms memcpy=1161ms

旧文索引:

内存拷贝优化(1)-小内存拷贝优化

内存拷贝优化(2)-全尺寸拷贝优化