[Food for thought] Towards a “vintage”, more elegant alternative to screen tearing

Everyone has once met the phenomenon of screen tearing, illustrated by the appearance of horizontal random lines in the video output of a computer. If you think you never have, here’s an example of an image with artificially enhanced tearing, and an example of one without that is sure to revive some bad video playback memories. In this post, I’m going to discuss why it happens, how it can be avoided sometimes, and an interesting fallback method to reduce its effect in situations where it can’t be avoided.

Why tearing happens

While pretty much around here has probably dealt with tearing at some point, what you may not know is that at the core of screen tearing is nothing but a good old race condition. Basically, what happens is that when a moving picture is displayed on screen, software continuously blits the new pictures in video memory, whose contents are in turn regularly displayed on screen by the GPU. However, there is no thread synchronization mechanism in place to ensure that both events cannot be happening at the same time. Consequently, the GPU will regularly display half-printed pictures on the screen, resulting in the jagged picture effect shown earlier.

If both software-side picture display routines and GPU-side screen update routines operated at exactly the same rate, moving pictures would exhibit a static jagged line, corresponding to the position at which software blitting lies at the time where the screen is updated. Most of the time, they aren’t, so the line between what has been updated and what hasn’t is constantly moving, both figuratively and concretely.

Avoid tearing entirely

Solutions against screen tearing exist. The first one is vertical synchronization, which revolves around the GPU notifying software in some way of periods when it’s not updating the screen. That is not a very good solution, because it gives software relatively little time to display a picture on screen. It is, however, remarkably easy to implement in hardware (just redirect a screen update clock to a CPU interrupt line), so GPU manufacturers like to leave you with that. And here, I assumed that they would bother to actually rewire the clock interrupt, whereas sometimes software may have to constantly poll video memory until it receives the VSync greenlight, wasting precious CPU cycles.

The second solution is multiple buffering, which revolves around having at least two video memory buffers and a pointer that can be set to point to one buffer or the other. In this scenario, software draws on one buffer while the screen is updated using the other one, then the buffer pointer is flipped and the next screen update will thus use the new image. Using more than two buffers allows software to continue drawing when it’s done with a first picture and the screen update is still not over, allowing continuous software blitter operation and thus increased display responsiveness.

Multiple buffering is by far the best solution against tearing, but it requires that GPU manufacturers stop hardcoding memory addresses in their hardware and learn the use of pointers. For someone who cannot even follow a standards document, that’s an incredibly hard task, and so they generally end up either giving up or implementing it in a nonstandard way. Because although you may not know it, multiple buffering really is a key area of GPU differentiation that cannot just be implemented in the same way as the neighbouring manufacturer. Yeah, right…

Noise as a fallback path

So in effect, there are three possible situations:

  • We have a reliable multiple buffering mechanism in place for the current GPU, and we know how to use it: everything is fine, no tearing will occur. This will rarely happen, and when it does will likely require GPU-specific drivers.
  • We have a vertical synchronization mechanism in place, we know how to use it, and we can draw video frames fast enough: No tearing will occur, but we may have to hog the CPU for polling purposes which is not always appropriate.
  • If we have neither multiple buffering, nor VSync and an extremely optimized blitter, tearing is bound to occur at some point. This is likely to be the common case for a hobby OS project like TOSP.

In the two latter cases, we may decide that some amount of screen artifacts can be acceptable after all. But it would be preferable to experience a visual artifact that is more subtle than regular tearing’s jagged lines. Here, I’ll show that by playing with persistence of human vision, this can be done, if you can afford to pay higher CPU or GPU costs at the video memory blitting step.

To understand why, we have to go back to the question of what makes screen tearing so blatantly obvious to the eye. And the reason is that our eye is extremely good at shape analysis, and one of the easiest things which it can process is a jagged line shape. Which is precisely what we’re dealing with when tearing occurs, and thus what makes the effect highly visible. Thus, what I claim is that if tearing artifacts were replaced by a fully random combination of pixels from the old and the new picture, the effect should be more subtle, especially when we’re talking about pictures that only last at most 1/60th of a second each.

To convince yourself of that, compare the previous tearing simulation with a variant where tearing is replaced by random noise in the updated screen region. Obviously, there still are artifacts, but a different kind of them, similar to those of vintage video tapes, that works better on moving picture in my opinion. Note that for the above examples, the rate at which your computer is able to display animated gif frames will directly determine how smooth the noise-induced blur looks. You may want to display these pictures at smaller sizes, or even save them and open them in other software that can display animated gifs. If I had more time, I’d build proper videos, but for now I don’t have the will to learn how to do that.

Producing noise in practice

The way tearing can be replaced by noise depends on how the system picture blitter works. If the fastest way to blit a picture into video memory is to fill individual pixels or small chunks thereof, then the previously displayed noise effect is trivially achieved by randomizing the locations at which new pixels are drawn when a picture is blitted. This will have some performance cost, since a lookup table has to be parsed and computer memories work faster when they are accessed in a sequential fashion, but it shouldn’t incur too much overhead considering how slow communication between the CPU and video memory is to begin with.

Note that in this scenario, each individual frame displayed by the screen will contain a different, slowly varying fraction of the initial and final picture, so that the actual effect achieved will be more subtle than constant random noise, in turn making shapes significantly more recognizable. Here’s an example, to be compared with the random noise scenario (again, try it at smaller sizes and/or with a standalone gif viewer). If we know the screen refresh rate, we could voluntarily run video memory updates at a slightly lower or higher rate so as to control how fast the displayed fraction of new vs old picture is changing. This will require some fine-tuning, though, because if this fraction changes too slowly, it will result in visible variations of the noise contrast with time.

If dedicated GPU drivers are available, however, or if you are dealing with a computer that has a really smart firmware, there can be an integrated blitter that is much faster at commiting whole pictures to video memory than individual pixels. In this case, the random noise effect can be approximated through the use of several intermediary frames per picture, each frame representing a different mix between the picture that was initially displayed and the picture that is now going to be displayed. Such intermediary pictures can be computed really quick through the use of pre-computed noisy binary masks. The amount of intermediary frames that should be used should probably depend on how fast the GPU can actually do the blitting: the more intermediary frames, the least noticeable tearing in the intermediary frames will be, but the higher the cost.

6 thoughts on “[Food for thought] Towards a “vintage”, more elegant alternative to screen tearing

  1. Alfman April 25, 2013 / 10:55 pm

    Hadrien,

    I wouldn’t have thought to produce noise to hide the scanline race. I mostly remember screen tearing being a big problem with the 8 bit palette color modes. If the palette was continuously updated (I have some demos that did this), the tearing could happen across most of the screen.

    You are targeting VBE 2.0 right? Are you sure there exists VESA hardware that doesn’t support page flipping? I don’t know what the caveats are.

    http://www.petesqbsite.com/sections/tutorials/tuts/vbe3.pdf

    FUNCTION 07H – SET/GET DISPLAY START

    This required function selects the pixel to be displayed in the upper left corner of the display. This
    function can be used to pan and scroll around logical screens that are larger than the displayed
    screen. This function can also be used to rapidly switch between two different displayed screens
    for double buffered animation effects.
    For the VBE 2.0 32-bit protected mode version, the value passed in DX:CX is the 32 bit offset in
    display memory, aligned to a plane boundary. For planar modes this means the value is the byte
    offset in memory, but in 8+ bits per pixel modes this is the offset from the start of memory divided
    by 4. Hence the value passed in is identical to the value that would be programmed into the
    standard VGA CRTC start address register. Note that it is up to the protected mode application
    to keep track of the color depth and scan line length to calculate the new start address. If a value
    that is out of range is programmed, unpredictable results will occur. For VBE 3.0 the application
    program may optionally pass the missing two bits of information in the top two bits of DX, to
    allow for pixel perfect horizontal panning. For example (32-bit protected mode interface only):

    VBE 3.0 defines seven new subfunctions (02h, 03h, 04h, 05h, 06h, 82h, 83h) to support hardware
    triple buffering and stereoscopic LC shutter glasses. Functions 02h and 03h schedule a display
    start address change to occur during the next vertical retrace, and returns immediately. Function
    04h can then be used by the application to determine if the scheduled flip has occurred or not,
    which can be used for hardware triple buffering to avoid writing to the page being displayed by
    the CRT controller. Functions 04h and 05h are used to enable and disable free running hardware
    stereoscopic mode. Functions 82h and 83h schedule the display start address change to occur, and
    then wait until the address has changed before returning.

    I wasn’t able to test this software, but he describes using the function to achieve smooth animation.

    http://gameprogrammer.com/1-vbe.html

    This function lets you chose which part of the logical display is viewable on the screen. You can use this function to do smooth scrolling or to do “page flipping” to get smooth animation. The setVisiblePage function in vg.cpp uses this function for just that purpose and the pixel and rect demos in demos.cpp use setVisiblepage. Look there for examples of how to use function 07h.

  2. Hadrien May 14, 2013 / 8:26 am

    Regarding framebuffer support, I’m planning to support VBE 2+ first, add support for the UEFI equivalent (called Graphical Output Protocol) later, and then leave room for other standard framebuffer devices that may emerge in the future.

    I still don’t know if I will support run-time calls to VBE functions, since it involves writing a full x86 real mode emulator, as the protected mode interface to VBE does not work in 64-bit mode. UEFI GOP, on its side, forbids run-time calls to its functions by design.

    After looking around the VBE spec, it seems that the VBE2 07h function may or may not work for multiple buffering purpose, in a hardware-dependent way. To clarify, the spec only specifies that it will change the base address of the CRT controller’s framebuffer, and does not explicitly state if the GPU firmware will wait for pending display refreshes to complete in order to do that. If it doesn’t, then tearing will still occur.

    This is different from the triple buffering extensions introduced by VBE3, which explicitly support synchronization with display refreshes. Alas, according to Brendan’s tests, not all modern GPU hardware bothers to support VBE3, perhaps because that spec was released at a time where such BIOS extensions were already falling into disuse, replaced by proprietary hardware drivers.

    Thus, I’ll probably still need some kind of fallback mode against tearing on many hardware, even if I can also support more optimal solutions on some hardware.

  3. Alfman May 14, 2013 / 10:33 pm

    Hadrien,

    “I still don’t know if I will support run-time calls to VBE functions, since it involves writing a full x86 real mode emulator, as the protected mode interface to VBE does not work in 64-bit mode. UEFI GOP, on its side, forbids run-time calls to its functions by design.”

    I wouldn’t even attempt the route of a real mode emulator, even the real mode BIOS calls may be switching to PM internally to access the memory mapped display buffer.

    Are you sure you cannot run a 32bit PM firmware on AMD64 using the exact same mechanics that you’d use to run a 32bit PM program on AMD64? Doesn’t 64bit linux still support vesa? I really don’t know the answers to these questions, but it’d be interesting to learn about.

    “After looking around the VBE spec, it seems that the VBE2 07h function may or may not work for multiple buffering purpose, in a hardware-dependent way. To clarify, the spec only specifies that it will change the base address of the CRT controller’s framebuffer, and does not explicitly state if the GPU firmware will wait for pending display refreshes to complete in order to do that. If it doesn’t, then tearing will still occur.”

    That doesn’t sound right, VBE1.2 function 7 called with BX = 80h is supposed to “Set Display Start during Vertical Retrace”.

    “This is different from the triple buffering extensions introduced by VBE3, which explicitly support synchronization with display refreshes. Alas, according to Brendan’s tests, not all modern GPU hardware bothers to support VBE3, perhaps because that spec was released at a time where such BIOS extensions were already falling into disuse, replaced by proprietary hardware drivers.”

    I don’t really think you need the VBE1.3 extensions. The document I linked earlier lists some of the new sub-functions in VBE1.3 (BX=82h, 83H), according to the doc (on page 51):

    “Functions 02h and 82h are preferable because they allow for correct page flipping operation in all color depths. Functions 00h and 80h have problems in 24bpp modes where each pixel is represented as three bytes, since there are some combinations of (x,y) starting addresses that may not map”.

    For page flipping this limitation isn’t really applicable since your driver can chose arbitrary addresses that do map, and 24bit modes are rarely used anyways. I found the section on “Using Hardware Triple Buffering” informative as well. It seems like it should work in a hardware independent way. The one thing that disappoints me greatly is that there’s no standard vesa callback mechanism for screen refresh, only polling. Never the less, I’d wager that the screen’s refresh rate is predictable enough that you could sync a system timer with it with high accuracy (like a virtual phase locked loop). Sure it’s a hack, but if such a hack works with high resolution audio drivers (pulse audio), then it should be a piece of cake for a 60hz-80hz refresh signal.

    “Thus, I’ll probably still need some kind of fallback mode against tearing on many hardware, even if I can also support more optimal solutions on some hardware.”

    I always look forward to reading about what you come up with :)

  4. Hadrien May 14, 2013 / 11:30 pm

    I wouldn’t even attempt the route of a real mode emulator, even the real mode BIOS calls may be switching to PM internally to access the memory mapped display buffer.

    Are you sure you cannot run a 32bit PM firmware on AMD64 using the exact same mechanics that you’d use to run a 32bit PM program on AMD64? Doesn’t 64bit linux still support vesa? I really don’t know the answers to these questions, but it’d be interesting to learn about.

    I remember reading on OSdev about the protected mode VESA interface relying on x86’s virtual 8086 mode, which is not supported by long mode’s protected mode emulation. The VBE spec also suggests this by referring to “dual-mode BIOS code”, which is designed to run in “either real mode or 16-bit protected mode”. No absolute certitude at this point though.

    What I know is that on Linux x64, VESA calls are made using an x86 emulator called v86d, as alluded to here. I don’t know how much work it would be to port that to a non-linux system, but it would probably be interesting to have a look at the source.

    That doesn’t sound right, VBE1.2 function 7 called with BX = 80h is supposed to “Set Display Start during Vertical Retrace”.

    Indeed, I mixed that one with the VBE 3.0 functions 82h and 83h, my mistake. Then if the display refresh rate is known (and it should be), double buffering should be possible on VBE 2.0, using a CPU timer together with that function.

    I don’t really think you need the VBE1.3 extensions. The document I linked earlier lists some of the new sub-functions in VBE1.3 (BX=82h, 83H), according to the doc (on page 51):

    “Functions 02h and 82h are preferable because they allow for correct page flipping operation in all color depths. Functions 00h and 80h have problems in 24bpp modes where each pixel is represented as three bytes, since there are some combinations of (x,y) starting addresses that may not map”.

    For page flipping this limitation isn’t really applicable since your driver can chose arbitrary addresses that do map, and 24bit modes are rarely used anyways.

    Really? I thought that using these was common at some point, since these modes offer the most memory-efficient packing of screen pixels (even if they become a pain to access in the way). Are 32-bit modes always available nowadays?

    I found the section on “Using Hardware Triple Buffering” informative as well. It seems like it should work in a hardware independent way. The one thing that disappoints me greatly is that there’s no standard vesa callback mechanism for screen refresh, only polling. Never the less, I’d wager that the screen’s refresh rate is predictable enough that you could sync a system timer with it with high accuracy (like a virtual phase locked loop). Sure it’s a hack, but if such a hack works with high resolution audio drivers (pulse audio), then it should be a piece of cake for a 60hz-80hz refresh signal.

    Totally agree. As long as the screen refresh rate is known, capping display updates to it should be a piece of cake with modern nanosecond-accurate APIC CPU timers.

    If the screen refresh rate isn’t known for some reason, perhaps a decent fallback would be to assume a refresh rate of 60 Hz, which is the lowest (and most common) refresh rate encountered on modern monitors. Doing it that way would prevent users from benefitting from faster refresh rates when they are available, but then again I’m not sure if it’s possible to go faster with software rendering anyway.

  5. Alfman May 15, 2013 / 7:12 am

    Hadrien,

    “Really? I thought that using these was common at some point, since these modes offer the most memory-efficient packing of screen pixels (even if they become a pain to access in the way). Are 32-bit modes always available nowadays?”

    I have no idea :(
    None of my current cards support 24bit modes. When I did video programming back in the day, my software buffers were 32bit, but the blitter could handle conversion to other pixel representations. This saved me from needing to implement a multitude of drawing routines and having to bitmask the color channels on every pixel as every color channel was always a byte.

    With the 16bit blitters, color banding was extremely noticeable unless I dithered the output first to average color representation across more pixels at the expense of pixel definition. This worked extremely well, at least for natural photos. I’ve heard of adaptive algorithms that can use dithering in scenes with lots of gradients and switch to accurate pixel modes for text.

    “If the screen refresh rate isn’t known for some reason, perhaps a decent fallback would be to assume a refresh rate of 60 Hz, which is the lowest (and most common) refresh rate encountered on modern monitors.”

    Why not poll the VESA refresh status function to take an accurate measurement? You could also repoll the VESA periodically slightly ahead and slightly behind of the scheduled interval to catch timer drift. I’m assuming you can solve the 64bit VESA call problem. If not, we talked about initializing graphics before going into 64bit mode, you might also initialize the refresh rate timer before 64bit mode without the benefit of resyncing it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s