内存拷贝优化(3)-深入优化
今天继续在原来内存拷贝代码上优化:
1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
3. 去除目标地址头部对齐的分支判断,用一次xmm拷贝完成目标对齐,性能替升10%。
4. 增加测试用例:为贴近实际,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置和长度的测试
为避免随机数生成影响结果,提前生成随机数,最终平均性能达到gcc4.9配套标准库的2倍以上:
https://github.com/skywind3000/FastMemcpy
最新代码测试结果(可以对比老的表看新版本性能是否有所提升):
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms
result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms
result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms
result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms
result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms
result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms
result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms
result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms
result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms
result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms
result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms
result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms
result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms
result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms
benchmark random access:
memcpy_fast=515ms memcpy=1014ms
老的测试结果为:
result: gcc4.9 (msvc 2012 got a similar result):
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms
result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms
result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms
result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms
result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms
result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms
result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms
result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms
result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms
result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms
result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms
result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms
result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms
result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms
result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms
result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms
result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms
result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms
result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms
result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms
result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms
result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms
benchmark random access:
memcpy_fast=594ms memcpy=1161ms
旧文索引: