Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other mult...

01-21 21:54

In theprevious blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust . In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster , a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository .

Table of Contents

  1. Baseline Implementation
  2. First Optimization – Assertions
  3. First Optimization – Assertions Try 2
  4. Second Optimization – Iterate a bit more
  5. Third Optimization – Getting rid of the bounds check finally

Baseline Implementation

This is how the baseline implementation looks like.

pub fn bgrx_to_gray_chunks_no_asserts(
    in_data: &[u8],
    out_data: &mut [u8],
    in_stride: usize,
    out_stride: usize,
    width: usize,
) {
    let in_line_bytes = width * 4;
    let out_line_bytes = width * 4;
 
    for (in_line, out_line) in in_data
        .chunks(in_stride)
        .zip(out_data.chunks_mut(out_stride))
    {
        for (in_p, out_p) in in_line[..in_line_bytes]
            .chunks(4)
            .zip(out_line[..out_line_bytes].chunks_mut(4))
        {
            let b = u32::from(in_p[0]);
            let g = u32::from(in_p[1]);
            let r = u32::from(in_p[2]);
            let x = u32::from(in_p[3]);
 
            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536;
            let grey = grey as u8;
            out_p[0] = grey;
            out_p[1] = grey;
            out_p[2] = grey;
            out_p[3] = grey;
        }
    }
}

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32 , multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

So what can be improve on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still ) unstable benchmark feature of Cargo.

#![feature(test)]
#![feature(exact_chunks)]
 
extern crate test;
 
pub fn bgrx_to_gray_chunks_no_asserts(...)
    [...]
}
 
#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    use std::iter;
 
    fn create_vec(w: usize, h: usize) -> Vec<u8> {
        iter::repeat(0).take(w * h * 4).collect::<_>()
    }
 
    #[bench]
    fn bench_chunks_1920x1080_no_asserts(b: &mut Bencher) {
        let i = test::black_box(create_vec(1920, 1080));
        let mut o = test::black_box(create_vec(1920, 1080));
 
        b.iter(|| bgrx_to_gray_chunks_no_asserts(&i, &mut o, 1920 * 4, 1920 * 4, 1920));
    }
}

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

.LBB4_19:
  cmp r15, r11
  mov r13, r11
  cmova r13, r15
  mov rdx, r8
  sub rdx, r13
  je .LBB4_34
  cmp rdx, 3
  jb .LBB4_35
  inc r9
  movzx edx, byte ptr [rbx - 1]
  movzx ecx, byte ptr [rbx - 2]
  movzx esi, byte ptr [rbx]
  imul esi, esi, 19595
  imul edx, edx, 38470
  imul ecx, ecx, 7471
  add ecx, edx
  add ecx, esi
  shr ecx, 16
  mov byte ptr [r10 - 3], cl
  mov byte ptr [r10 - 2], cl
  mov byte ptr [r10 - 1], cl
  mov byte ptr [r10], cl
  add r10, 4
  add r8, -4
  add r15, -4
  add rbx, 4
  cmp r9, r14
  jb .LBB4_19

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

.LBB4_34:
  lea rdi, [rip + .Lpanic_bounds_check_loc.D]
  xor esi, esi
  xor edx, edx
  call core::panicking::panic_bounds_check@PLT
  ud2
.LBB4_35:
  cmp r15, r11
  cmova r11, r15
  sub r8, r11
  lea rdi, [rip + .Lpanic_bounds_check_loc.F]
  mov esi, 2
  mov rdx, r8
  call core::panicking::panic_bounds_check@PLT
  ud2

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do ( r9 ), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx , note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul ), 3) combining of the results and division (i.e. shift) (the 2 add and the shr ), 4) storing of the result in the output (the 4 mov ). Afterwards the slice pointers are increased by 4 ( rbx and r10 ) and the lengths (used for bounds checking) are decreased by 4 ( r8 and r15 ). Finally there’s a check ( cmp ) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions ( cmp , cmov ) and jumps (the instructions starting with j ). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

assert_eq!(in_data.len() % 4, 0);
assert_eq!(out_data.len() % 4, 0);
assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);
 
assert!(in_line_bytes <= in_stride);
assert!(out_line_bytes <= out_stride);

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

.LBB4_19:
  cmp rbx, r11
  mov r9, r11
  cmova r9, rbx
  mov r14, r12
  sub r14, r9
  lea rax, [r14 - 1]
  mov qword ptr [rbp - 120], rax
  mov qword ptr [rbp - 128], r13
  mov qword ptr [rbp - 136], r10
  cmp r14, 5
  jne .LBB4_33
  inc rcx
  [...]

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug . This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)
test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

.LBB4_19:
  cmp rbx, r12
  mov r13, r12
  cmova r13, rbx
  add r13, r14
  jne .LBB4_33
  inc r9
  [...]

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert! .

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

let grey = in_p.iter()
    .zip(RGB_Y.iter())
    .map(|(i, c)| u32::from(*i) * c)
    .sum::<u32>() / 65536;

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)
test tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns/iter (+/- 242,379)

If we look at the assembly of the assertion-less variant, it’s a complete mess now

.LBB0_19:
  cmp rbx, r13
  mov rcx, r13
  cmova rcx, rbx
  mov rdx, r8
  sub rdx, rcx
  cmp rdx, 4
  mov r11d, 4
  cmovb r11, rdx
  test r11, r11
  je .LBB0_20
  movzx ecx, byte ptr [r15 - 2]
  imul ecx, ecx, 19595
  cmp r11, 1
  jbe .LBB0_22
  movzx esi, byte ptr [r15 - 1]
  imul esi, esi, 38470
  add esi, ecx
  movzx ecx, byte ptr [r15]
  imul ecx, ecx, 7471
  add ecx, esi
  test rdx, rdx
  jne .LBB0_23
  jmp .LBB0_35
.LBB0_20:
  xor ecx, ecx
.LBB0_22:
  test rdx, rdx
  je .LBB0_35
.LBB0_23:
  shr ecx, 16
  mov byte ptr [r10 - 3], cl
  mov byte ptr [r10 - 2], cl
  cmp rdx, 3
  jb .LBB0_36
  inc r9
  mov byte ptr [r10 - 1], cl
  mov byte ptr [r10], cl
  add r10, 4
  add r8, -4
  add rbx, -4
  add r15, 4
  cmp r9, r14
  jb .LBB0_19

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore ( .LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

.LBB3_19:
  cmp r11, r12
  mov r13, r12
  cmova r13, r11
  add r13, r14
  jne .LBB3_33
  inc r9
  movzx ecx, byte ptr [rdx - 2]
  imul r13d, ecx, 19595
  movzx ecx, byte ptr [rdx - 1]
  imul ecx, ecx, 38470
  add ecx, r13d
  movzx ebx, byte ptr [rdx]
  imul ebx, ebx, 7471
  add ebx, ecx
  shr ebx, 16
  mov byte ptr [r10 - 3], bl
  mov byte ptr [r10 - 2], bl
  mov byte ptr [r10 - 1], bl
  mov byte ptr [r10], bl
  add r10, 4
  add r11, -4
  add r14, 4
  add rdx, 4
  cmp r9, r15
  jb .LBB3_19

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

pub fn bgrx_to_gray_exact_chunks(
    in_data: &[u8],
    out_data: &mut [u8],
    in_stride: usize,
    out_stride: usize,
    width: usize,
) {
    assert_eq!(in_data.len() % 4, 0);
    assert_eq!(out_data.len() % 4, 0);
    assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);
 
    let in_line_bytes = width * 4;
    let out_line_bytes = width * 4;
 
    assert!(in_line_bytes <= in_stride);
    assert!(out_line_bytes <= out_stride);
 
    for (in_line, out_line) in in_data
        .exact_chunks(in_stride)
        .zip(out_data.exact_chunks_mut(out_stride))
    {
        for (in_p, out_p) in in_line[..in_line_bytes]
            .exact_chunks(4)
            .zip(out_line[..out_line_bytes].exact_chunks_mut(4))
        {
            assert!(in_p.len() == 4);
            assert!(out_p.len() == 4);
 
            let b = u32::from(in_p[0]);
            let g = u32::from(in_p[1]);
            let r = u32::from(in_p[2]);
            let x = u32::from(in_p[3]);
 
            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536;
            let grey = grey as u8;
            out_p[0] = grey;
            out_p[1] = grey;
            out_p[2] = grey;
            out_p[3] = grey;
        }
    }
}

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)
test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)
test tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns/iter (+/- 242,379)
test tests::bench_exact_chunks_1920x1080      ... bench:   2,007,459 ns/iter (+/- 112,287)

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

.LBB0_10:
  movzx edx, byte ptr [rsi - 2]
  movzx r15d, byte ptr [rsi - 1]
  movzx r12d, byte ptr [rsi]
  imul r13d, edx, 19595
  imul edx, r15d, 38470
  add edx, r13d
  imul ebx, r12d, 7471
  add ebx, edx
  shr ebx, 16
  mov byte ptr [rcx - 3], bl
  mov byte ptr [rcx - 2], bl
  mov byte ptr [rcx - 1], bl
  mov byte ptr [rcx], bl
  add rcx, 4
  add rsi, 4
  dec r10
  jne .LBB0_10

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

原文链接:https://coaxion.net/blog/2018/01/speeding-up-rgb-to-grayscale-conversion-in-rust-by-a-factor-of-2-2-and-various-other-multimedia-related-processing-loops/?utm_source=tuicool&utm_medium=referral
标签: Rust 汇编程序
© 2014 TuiCode, Inc.