Blitting around

Today, as another filler until I get enough time to resume OSdeving, I had some time to try out a quick performance test that has important implications for PC OSdeving : is it still possible, as of today, to get decent graphics performance out of software rendering in case a dedicated GPU driver is not available ? I’m not talking about displaying millions of triangles per second here, just displaying bitmap animations at a modern screen’s native refresh rate. Well, let’s find out !

Testing scenario

To answer this question, I decided to test the blitting (bitmap copy) performance of a relatively modern CPU, my home computer’s Intel Core i3-3220, using various blitting algorithms ranging from the simplest to the most sophisticated, and making variable amounts of assumptions.

To be representative of state-of-the-art computer screens, my performance target was to achieve 100 full screen refreshes per second on the buffer equivalent of a 4K screen (4096×2160 pixels) operating at 32 bits per pixels. To prevent excessive compiler optimization from biasing the test, and to help with code testing, the input data was marked volatile and randomized.

To put things into perspective, such blitting performance requires moving data around at 7.1 GB/s (accounting for both read and writes), which is a bit less than a quarter of this CPU’s theoretical maximal memory bandwidth (25.6 GB/s). But of course, the memory bus is only part of the story, and when other considerations such as TLB misses, cache invalidation or precise RAM characteristics are taken into account, it remains to be seen whether such performance is achievable. Enter this test.

The test program was written in C++11 and AMD64 assembly. It makes use of GCC intrinsics for low-level data manipulation. The algorithms I experimented with were the following :

  • Naive C array copy from the source buffer to the destination (“the compiler will optimize it out”), with optional OpenMP parallelization from which I expected little gain as the performance is unlikely to be compute-bound.
  • Standard C and C++ library functions, memcpy and std::copy()
  • Standard AMD64 string copy instruction REP MOVSQ
  • Optimized SSE2 assembly (assuming 128-bit data alignment)
  • Unoptimized SSE2 assembly code (no alignment assumptions)
  • Optimized AVX assembly code (assuming 256-bit data alignment)
  • Unoptimized AVX assembly code (no alignment assumptions)

As low-level assembly code is notoriously tricky to write, I added a blitter checking mode to the test program in order to assert that the blitter preconditions are met and that the random input buffer data is properly copied to the output buffer at the end of the algorithm.

Show me the code !

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <stdint.h>
#include <string.h>

// === PROGRAM PARAMETERS AND GLOBAL VARIABLES ===

// Screen target parameters
using RGBAColor = uint32_t;
const size_t SCREEN_WIDTH = 4096;
const size_t SCREEN_HEIGHT = 2160;
const int TARGET_FRAME_RATE = 100;

// Algorithm parameters go here
const int AVERAGING = 1000;
#define CURRENT_BLITTER blit_avx_unaligned
#define DATA_ALIGNMENT 32
//#define CHECK_BLITTER

// The backbuffer must be volatile to prevent undue compiler optimizations
const size_t TOTAL_PIXELS = SCREEN_WIDTH * SCREEN_HEIGHT;
volatile RGBAColor __attribute__ ((aligned (DATA_ALIGNMENT))) backbuffer[TOTAL_PIXELS];
RGBAColor __attribute__ ((aligned (DATA_ALIGNMENT))) framebuffer[TOTAL_PIXELS];


// === BLITTING FUNCTIONS ===

// Naive blitter
void blit_naive(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
    #pragma omp for
    for(size_t pixel = 0; pixel < pixel_count; ++pixel)
    {
        framebuffer[pixel] = backbuffer[pixel];
    }
}

// memcpy-based blitter
void blit_memcpy(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
    memcpy((void *)output, (void *)input, pixel_count*sizeof(RGBAColor));
}

// C++ std::copy blitter
void blit_stdcopy(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
    std::copy(output, output+pixel_count, input);
}

// Standard x86 quadword blitter. It assumes that the host is 64-bit and that pixel_count*sizeof(RGBAColor) is a multiple of 8.
// NOTE : This blitter must be compiled at O0 optimization or GCC will optimize it out :-(
void blit_movsq(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
#ifdef CHECK_BLITTER
    // Blitter precondition checks
    std::cout << "Checking MOVSQ blitter preconditions...";
    if(sizeof(void *) != sizeof(uint64_t)) std::abort();
    if(pixel_count*sizeof(RGBAColor) % 8) std::abort();
    std::cout << " SUCCESS" << std::endl;
#endif

    // This is the MOVSQ blitter code
    size_t qword_amount = pixel_count*sizeof(RGBAColor)/8;
    __asm__ __volatile__ (
        // RSI is the source pointer, RDI is the destination pointer, RCX holds the amount of iterations
        "REP MOVSQ;"
        : "+S"(input), "+D"(output)
        : "c"(qword_amount)
        : "cc", "memory"
    );
}

// SSE2 optimized blitter. It assumes that the host is 64-bit, that framebuffer addresses are 128-bit aligned, and that pixel_count*sizeof(RGBAColor) is a multiple of 16.
void blit_sse2(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
#ifdef CHECK_BLITTER
    // Blitter precondition checks
    std::cout << "Checking optimized SSE2 blitter preconditions...";
    if(sizeof(void *) != sizeof(uint64_t)) std::abort();
    if((uint64_t)input % 16) std::abort();
    if((uint64_t)output % 16) std::abort();
    if(pixel_count*sizeof(RGBAColor) % 16) std::abort();
    std::cout << " SUCCESS" << std::endl;
#endif

    // This is the SSE2 blitter code
    volatile RGBAColor* input_end = input + pixel_count;
    __asm__ __volatile__(
        // RAX is the source pointer, RBX is the final source pointer, RCX is the destination pointer, XMM0 is used to hold memory fetches
        "1:"
        "MOVDQA  (%%rax), %%xmm0;"
        "MOVNTDQ %%xmm0, (%%rcx);"
        "ADDQ $16, %%rax;"
        "ADDQ $16, %%rcx;"
        "CMPQ %%rax, %%rbx;"
        "JNE 1b;"
        : "+a"(input), "+c"(output)
        : "b"(input_end)
        : "cc", "xmm0", "memory"
    );
}

// SSE2 unaligned blitter. Still assumes that the host is 64-bit and that pixel_count*sizeof(RGBAColor) is a multiple of 16, but removes pointer alignment constraints.
void blit_sse2_unaligned(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
#ifdef CHECK_BLITTER
    // Blitter precondition checks
    std::cout << "Checking unaligned SSE2 blitter preconditions...";
    if(sizeof(void *) != sizeof(uint64_t)) std::abort();
    if(pixel_count*sizeof(RGBAColor) % 16) std::abort();
    std::cout << " SUCCESS" << std::endl;
#endif

    // This is the SSE2 blitter code
    volatile RGBAColor* input_end = input + pixel_count;
    __asm__ __volatile__(
        // RAX is the source pointer, RBX is the final source pointer, RCX is the destination pointer, XMM0 is used to hold memory fetches
        "1:"
        "MOVDQU (%%rax), %%xmm0;"
        "MOVDQU %%xmm0, (%%rcx);"
        "ADDQ $16, %%rax;"
        "ADDQ $16, %%rcx;"
        "CMPQ %%rax, %%rbx;"
        "JNE 1b;"
        : "+a"(input), "+c"(output)
        : "b"(input_end)
        : "cc", "xmm0", "memory"
    );
}

// AVX optimized blitter. Assumes that the host is 64-bit, that framebuffer addresses are 256-bit aligned, and that pixel_count*sizeof(RGBAColor) is a multiple of 32.
void blit_avx(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
#ifdef CHECK_BLITTER
    // Blitter precondition checks
    std::cout << "Checking AVX blitter preconditions...";
    if(sizeof(void *) != sizeof(uint64_t)) std::abort();
    if((uint64_t)input % 32) std::abort();
    if((uint64_t)output % 32) std::abort();
    if(pixel_count*sizeof(RGBAColor) % 32) std::abort();
    std::cout << " SUCCESS" << std::endl;
#endif

    // This is the AVX blitter code
    volatile RGBAColor* input_end = input + pixel_count;
    __asm__ __volatile__(
        // RAX is the source pointer, RBX is the final source pointer, RCX is the destination pointer, YMM0 is used to hold memory fetches
        "1:"
        "VMOVDQA  (%%rax), %%ymm0;"
        "VMOVNTDQ %%ymm0, (%%rcx);"
        "ADDQ $32, %%rax;"
        "ADDQ $32, %%rcx;"
        "CMPQ %%rax, %%rbx;"
        "JNE 1b;"
        : "+a"(input), "+c"(output)
        : "b"(input_end)
        : "cc", "ymm0", "memory"
    );
}

// AVX unaligned blitter. Still assumes that the host is 64-bit and that pixel_count*sizeof(RGBAColor) is a multiple of 32, but removes pointer alignment constraints.
void blit_avx_unaligned(volatile RGBAColor input[], RGBAColor output[], size_t pixel_count)
{
#ifdef CHECK_BLITTER
    // Blitter precondition checks
    std::cout << "Checking unaligned AVX blitter preconditions...";
    if(sizeof(void *) != sizeof(uint64_t)) std::abort();
    if(pixel_count*sizeof(RGBAColor) % 32) std::abort();
    std::cout << " SUCCESS" << std::endl;
#endif

    // This is the AVX blitter code
    volatile RGBAColor* input_end = input + pixel_count;
    __asm__ __volatile__(
        // RAX is the source pointer, RBX is the final source pointer, RCX is the destination pointer, YMM0 is used to hold memory fetches
        "1:"
        "VMOVDQU (%%rax), %%ymm0;"
        "VMOVDQU %%ymm0, (%%rcx);"
        "ADDQ $32, %%rax;"
        "ADDQ $32, %%rcx;"
        "CMPQ %%rax, %%rbx;"
        "JNE 1b;"
        : "+a"(input), "+c"(output)
        : "b"(input_end)
        : "cc", "ymm0", "memory"
    );
}


// === PERFORMANCE TESTS ===

int main()
{
    // Prepare for performance measurements
    std::cout << "Evaluating software blitting viability at resolution " << SCREEN_WIDTH << "x" << SCREEN_HEIGHT << std::endl;
    namespace chrono = std::chrono;
    chrono::time_point<chrono::high_resolution_clock> start, end;

    // Fill the backbuffer with garbage
    std::cout << "Filling the backbuffer..." << std::endl;
    std::random_device rd;
    std::default_random_engine eng(rd());
    std::uniform_int_distribution<RGBAColor> uniform_dist(0, 0xffffffff);
    for(size_t pixel = 0; pixel < TOTAL_PIXELS; ++pixel)
    {
        backbuffer[pixel] = uniform_dist(eng);
    }

    // Measure the time it takes to blit a frame
    std::cout << "Blitting " << AVERAGING << " frames..." << std::endl;
    start = chrono::high_resolution_clock::now();
    #pragma omp parallel default(none) shared(backbuffer, framebuffer)
    {
        for(int i = 0; i < AVERAGING; ++i)
        {
            // Run the blitter
            CURRENT_BLITTER(backbuffer, framebuffer, TOTAL_PIXELS);
            
            #ifdef CHECK_BLITTER
                // Optionally check proper blitter operation and abort
                std::cout << "Checking blitter results...";
                for(size_t pixel = 0; pixel < TOTAL_PIXELS; ++pixel)
                {
                    if(framebuffer[pixel] != backbuffer[pixel]) std::abort();
                }
                std::cout << " SUCCESS" << std::endl;
                std::abort();
            #endif
        }
    }
    end = chrono::high_resolution_clock::now();

    // Analyze results
    chrono::duration<double> elapsed_seconds = end - start;
    double total_seconds = elapsed_seconds.count();
    double seconds_per_frame = total_seconds / (double)AVERAGING;
    double frames_per_second = 1.0d / seconds_per_frame;

    // Display them
    std::cout << "=== TEST COMPLETE === " << std::endl;
    std::cout << "Total blitting time : " << total_seconds << " s" << std::endl;
    std::cout << "Average time per frame : " << seconds_per_frame << " s" << std::endl;
    std::cout << "Frame rate : " << frames_per_second << " FPS" << std::endl;
    std::cout << "Performance goal met : " << (frames_per_second > TARGET_FRAME_RATE ? "YES" : "NO") << std::endl;

    return 0;
}

Results

This code was compiled with gcc version 4.9.3, using flags “–std=c++11 -march=native”, and optionally “-fopenmp” for the OpenMP tests. The blitter was selected by changing the CURRENT_BLITTER define and recompiling the program.

C/++ algorithms were tested at O3 optimization level, while assembly code was tested at O0 optimization level to prevent the compiler from being a bit too clever and removing the blitting loop altogether. I also tested the naive algorithm without any compiler optimization enabled out of curiosity.

Performance results had, on my Linux x86_64 machine, an accuracy of about 0.1 ms/frame. Here they are, with FPS numbers rounded towards 0 :

  • Naive algorithm : 18.5 ms/frame (54 FPS) without optimization, 11.6 ms/frame (86 FPS) with O3 optimization, 12.0 ms/frame (83 FPS) in OpenMP mode.
  • memcpy() : 11.7 ms/frame (85 FPS)
  • std::copy() : 10.5 ms/frame (95 FPS)
  • AMD64 string copy : 8.5 ms/frame (117 FPS)
  • Optimized SSE2 : 7.8 ms/frame (128 FPS)
  • Unoptimized SSE2 : 11.5 ms/frame (86 FPS)
  • Optimized AVX : 7.8 ms/frame (128 FPS)
  • Unoptimized AVX : 11.5 ms/frame (86 FPS)

Discussion

These results are, overall, very interesting.

One can first conclude that using this specific compiler, the performance target is not reachable without using assembly code. It is somewhat surprising that GCC apparently does not use the simple REP MOVSQ memory copy instruction in this scenario, as it is a standard instruction supported by all x86_64 processors, that has no data alignment constraints and achieves quite decent performance when compared to alternatives.

It is also worth emphasizing that when using SIMD instructions to move data around, data alignment is of critical importance. Unaligned SSE2 or AVX memory accesses perform much worse than REP MOVSQ. This is most likely because in the case of the later instruction, the CPU is aware that many memory fetches are incoming and can prefetch aligned blocks of data.

When alignment constraints are met, SSE2 and AVX both seem to saturate the memory bus well, reaching equivalent performance within the measurement margin of error. This allows them to perform slightly better than REP MOVSQ, at the cost of a somewhat lower portability. In the case of SSE2, however, the portability disadvantage is counterbalanced by the fact that this SIMD extension to the x86 instruction set is very old (2001) and supported even by the lowest-end processors these days.

For this specific generation of CPUs (Ivy Bridge), the use of AVX instructions consequently seems unnecessary.

In the end, the stated performance target of 100 refreshes per second at 4K resolution seems attainable on all modern hardware. But there are quite a number of caveats :

  • Performance margins are very tight, as draw code only has 1.5 to 2.2 ms per frame to do its work before a screen refresh must start. This might be fine for simple drawing code (e.g. “UI button changes color”), but is quite problematic for large-scale blitting operations like scrolling.
  • Since x86 CPUs have no DMA controller handy for this kind of task, a CPU core must be kept busy during screen refreshes, possibly consuming a lot of power along the way.
  • Other CPU cores may be accessing the RAM at the same time, degrading the blitting performance accordingly.
  • Due to the above, realistic software rendering code should use damaged display region detection, in order to avoid refreshing screen regions that haven’t changed, and keep full-screen refreshes as a worst case scenario (e.g. during video playback).
  • The most optimized code has a lot of preconditions that must be met in order to reach its peak performance. Failure to meet these constraints will result in instant failure with a #GP CPU exception raised, or an infinite RAM copy loop, both likely leading to program termination.

To rapidly conclude on the other code, one can first point out that OpenMP parallelization is, as expected, useless when it comes to delivering better memory transfer performance. The CPU memory bus is obviously saturated during this test, one way or another, so adding more CPU cores to the task does not help at all and actually hurts performance quite a bit (likely due to API overhead).

Also, the performance difference between memcpy() and std::copy() is interesting. This difference likely originates from the fact that memcpy() must use a generic memory copy algorithm, whereas std::copy, as a C++ template, is aware of some details of the memory being copied, and can thus perform a couple of compile-time optimizations related to e.g. data alignment.

Implications

The main and most important implication of these results is that, on the tested hardware at least, projects aiming at simulating GPU hardware on a CPU like llvmpipe are very much pointless. The memory bus of a modern CPU is simply not fast enough to do the kind of work that is asked of a modern GPU, and adding a layer of hardware emulation to software graphics rendering will only lead to hurt its constrained performance further.

When doing software graphics rendering on a modern screen, performance margins are very tight. This means that any software rendering algorithm must make the most of the underlying hardware’s performance characteristics. As it turns out, when a GPU driver is not available, all a CPU can get is a framebuffer to write bitmaps into (through VESA VBE or UEFI GOP), so we should better make this blitting as fast as possible by acknowledging that it happens under the hood.

This is bad news for GUI toolkit developers, because it means that the grand unification that was promised over and over again between CPU and GPU rendering code is still not happening. What will most likely happen is that

  • Either graphic libraries will keep around two separate code backends for software and hardware UI rendering
  • Or software rendering users will, as usual in the open-source world, be left in the cold and asked to use either software GPU emulation of inadequate performance, or buggy GPU drivers.

Finally, it is important to point out that these issues do not arise on last-generation 1920×1080 screens operating at 60 Hz. On these screens, even an unoptimized naive blitter achieves a decent 4.3 ms/frame (232 FPS), and it is possible to achieve 2.0 ms/frame (500 FPS) with a REP MOVSQ blitter or 1.7 ms/frame (578 FPS) with an AVX blitter.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s