FFmpeg x86 Assembly

2026-02-23

x86 SIMD ffmpeg

Notes from three lectures on writing x86 assembly for FFmpeg (from the FFmpeg asm-lessons). Covers registers, SIMD fundamentals, branching, loops, instruction sets, and common optimization tricks used in video processing.

Registers

What are they?

Registers are areas in the CPU where data can be processed. CPUs don't operate on memory directly:

Data in memory → loaded into registers → processed → written back to memory

Generally in assembly you cannot directly copy data from one memory location to another without first passing that data through a register.

General Purpose Registers (GPRs)

GPRs can contain either data (up to a 64-bit value) or a memory address (a pointer). A value in a GPR can be processed through operations like addition, multiplication, shifting, etc. In FFmpeg, GPR complexity is abstracted away.

Vector Registers (SIMD)

SIMD registers contain multiple data elements. There are various types:

mm registers — MMX, 64-bit, historical and rarely used
xmm registers — XMM, 128-bit, widely available
ymm registers — YMM, 256-bit, some complications
zmm registers — ZMM, 512-bit, limited availability

Most calculations in video compression/decompression are integer-based.

An xmm register holding 16 bytes can be interpreted as:

16 bytes — 8-bit data
8 words — 16-bit data
4 doublewords — 32-bit data
2 quadwords — 64-bit data
1 double quadword — 128-bit data

x86inc.asm

x86inc.asm is a lightweight abstraction layer used in FFmpeg, x264, and dav1d. One useful thing it does is label GPRs as r0, r1, r2, etc., so you don't have to remember platform-specific register names. GPRs are generally just scaffolding, so this makes life easier.

Simple Scalar Assembly

Scalar assembly operates on individual data items, one at a time:

mov r0q, 3
inc r0q
dec r0q
imul r0q, 5

mov r0q, 3 — immediate value 3 stored into r0 as a quadword
inc r0q — increments r0 to 4
dec r0q — decrements back to 3
imul r0q, 5 — multiplies by 5, r0 now contains 15

Mnemonics

Human-readable instructions like mov and inc, assembled into machine code by the assembler, are known as mnemonics. FFmpeg uses lowercase (you'll see uppercase online).

First SIMD Function

%include "x86inc.asm"

SECTION .text

;static void add_values(uint8_t *src, const uint8_t *src2)
INIT_XMM sse2
cglobal add_values, 2, 2, 2, src, src2
    movu m0, [srcq]
    movu m1, [src2q]

    paddb m0, m1

    movu [srcq], m0

    RET

Breakdown:

%include "x86inc.asm" — include the helper header for macros like cglobal
SECTION .text — executable code section (vs .data for constants)
INIT_XMM sse2 — initialize to use XMM registers with the SSE2 instruction set
cglobal add_values, 2, 2, 2, src, src2 — define function with 2 args, 2 GPRs, 2 XMM registers
movu m0, [srcq] — 128-bit unaligned load from *src (brackets = dereference, like *ptr in C)
paddb m0, m1 — packed bytewise addition. "p" = packed (vector), "b" = byte granularity
movu [srcq], m0 — store result back to *src
RET — function return macro

Note: vector registers are referred to as m0, m1, etc. (abstracted form), not their full names like xmm0. The q suffix on srcq refers to pointer size (64-bit on 64-bit systems), but the underlying load is 128-bit.

Branches & Loops

Labels and Jumps

mov r0q, 3
.loop:
    dec r0q
    jmp .loop

jmp moves execution to the label. The dot prefix makes it a local label, allowing the same name across multiple functions. The above is an infinite loop (no boundary check).

FLAGS Register

The FLAGS register contains flags like Zero-Flag, Sign-Flag, and Overflow-Flag, set based on the output of most non-mov scalar instructions (arithmetic, shifts, etc.).

mov r0q, 3
.loop:
    ; do something
    dec r0q
    jg .loop    ; jump if greater than zero

Equivalent C:

int i = 3;
do {
    // do something
    i--;
} while (i > 0);

A more natural C-style for loop:

for (int i = 0; i < 3; i++) {
    // do something
}

Roughly translates to:

xor r0q, r0q
.loop:
    ; do something
    inc r0q
    cmp r0q, 3
    jl .loop    ; jump if (r0q - 3) < 0, i.e. r0q < 3

xor r0q, r0q is a common way to zero a register — on some systems faster than mov r0q, 0 because no actual load takes place. Also works on SIMD registers: pxor m0, m0.

cmp subtracts the second operand from the first (without storing the result) and sets FLAGS. The extra cmp instruction is why the countdown form is preferred — fewer instructions generally means faster code.

Common Jump Mnemonics

Mnemonic    Description                                     FLAGS
JE/JZ       Jump if Equal/Zero                              ZF = 1
JNE/JNZ     Jump if Not Equal/Not Zero                      ZF = 0
JG/JNLE     Jump if Greater (signed)                        ZF = 0 and SF = OF
JGE/JNL     Jump if Greater or Equal (signed)               SF = OF
JL/JNGE     Jump if Less (signed)                           SF ≠ OF
JLE/JNG     Jump if Less or Equal (signed)                  ZF = 1 or SF ≠ OF

Constants

SECTION_RODATA

constants_1: db 1,2,3,4
constants_2: times 2 dw 4,3,2,1

SECTION_RODATA — read-only data section
db (declare byte) — equivalent to uint8_t constants_1[4] = {1,2,3,4};
times 2 dw — repeats the declared words: uint16_t constants_2[8] = {4,3,2,1, 4,3,2,1};

These labels (converted to memory addresses by the assembler) can be used in loads but not stores, since they are read-only.

Offsets

Offsets are the distance (in bytes) between consecutive elements in memory. In C, the compiler precalculates these offsets for you. In assembly, you do it yourself.

Syntax for memory address calculations:

[base + scale*index + disp]

base — a GPR (usually a pointer from a C function argument)
scale — can be 1, 2, 4, or 8 (default: 1)
index — a GPR (usually a loop counter)
disp — an integer displacement (up to 32-bit)

x86inc provides the constant mmsize, which gives you the size of the SIMD register you are working with.

;static void simple_loop(const uint8_t *src)
INIT_XMM sse2
cglobal simple_loop, 1, 2, 2, src
    movq r1q, 3
.loop:
    movu m0, [srcq]
    movu m1, [srcq+2*r1q+3+mmsize]

    ; do something

    add srcq, mmsize

    dec r1q
    jg .loop

    RET

The assembler precalculates the displacement constant in [srcq+2*r1q+3+mmsize].

LEA (Load Effective Address)

Lets you perform multiplication and addition in one instruction — faster than using multiple instructions. Despite the name, it's used for normal arithmetic as well as address calculations.

lea r0q, [r1q + 8*r2q + 5]

Does not affect r1q or r2q, and does not affect FLAGS. The equivalent without LEA would require multiple instructions and a temporary register:

movq r0q, r1q
movq r3q, r2q
sal r3q, 3       ; shift arithmetic left 3 = * 8
add r3q, 5
add r0q, r3q

Scale is limited to 1, 2, 4, or 8, but multiplication by these values plus a fixed offset is very common in practice.

Instruction Sets

MMX (1997) — first SIMD in Intel processors, 64-bit registers, historic
SSE (1999) — 128-bit registers
SSE2 (2000) — many new instructions
SSSE3 (2006) — introduced pshufb, arguably the most important instruction in video processing
SSE4 (2008) — packed minimum and maximum, among others
AVX (2011) — 256-bit registers (float only), new three-operand syntax
AVX2 (2013) — 256-bit integer instructions
AVX-512 (2017) — 512-bit registers, operation masks. Initially had CPU frequency downscaling issues. Full 512-bit shuffle with vpermb
AVX-512 ICL (2019) — no more clock frequency downscaling
AVX10 — upcoming

Instruction sets can be removed as well as added. AVX-512 was removed from Alder Lake.

FFmpeg uses function pointers that default to C and are replaced with a particular instruction set variant at runtime. Detection is done once. This allows optimized functions to be toggled on/off and avoids hardcoding a specific instruction set (which makes perfectly functional hardware obsolete).

Real-World Availability (Steam Survey, Nov 2024)

Instruction Set     Availability
SSE2                100%
SSE3                100%
SSSE3               99.86%
SSE4.1              99.80%
AVX                 97.39%
AVX2                94.44%
AVX-512             14.09%

For FFmpeg with billions of users, even 0.1% is a massive number of users and bug reports.

References

Pointer Offset Trick

Taking the original add_values function but adding a width argument. We use ptrdiff_t instead of int to ensure the upper 32-bits of the 64-bit argument are zero (otherwise they can contain arbitrary values).

;static void add_values(uint8_t *src, const uint8_t *src2, ptrdiff_t width)
INIT_XMM sse2
cglobal add_values, 3, 3, 2, src, src2, width
    add srcq, widthq
    add src2q, widthq
    neg widthq

.loop:
    movu m0, [srcq+widthq]
    movu m1, [src2q+widthq]

    paddb m0, m1

    movu [srcq+widthq], m0
    add widthq, mmsize
    jl .loop

    RET

The trick:

add srcq, widthq / add src2q, widthq — advance both pointers to the end of the buffer
neg widthq — negate width so it becomes negative
[srcq+widthq] — on first iteration, the negative offset points back to the original start
add widthq, mmsize / jl .loop — widthq approaches zero by mmsize each iteration. It serves as both the pointer offset and the loop counter, saving a cmp instruction

Alignment

Many CPUs load/store data faster when it's aligned (memory address divisible by SIMD register size). Where possible, FFmpeg uses mova (aligned move) instead of movu.

av_malloc provides aligned heap memory
DECLARE_ALIGNED provides aligned stack memory
Using mova on an unaligned address causes a segfault
Alignment must match register size: 16 for xmm, 32 for ymm, 64 for zmm

SECTION_RODATA 64    ; align beginning of RODATA to 64 bytes

Range Expansion

When a byte value exceeds 255 after an operation, it overflows. To handle this, we expand to a larger intermediate size (e.g. bytes to words).

Unsigned: Zero Extension

punpcklbw (unpack low bytes to words) and punpckhbw (unpack high bytes to words) interleave bytes from two registers. If the source register is all zeros, this interleaves bytes with zeros — performing zero extension.

pxor m2, m2         ; zero out m2

movu m0, [srcq]
movu m1, m0         ; copy of m0
punpcklbw m0, m2    ; low bytes zero-extended to words
punpckhbw m1, m2    ; high bytes zero-extended to words

m0 and m1 now contain the original bytes zero-extended to words. (Three-operand AVX makes the second movu unnecessary.)

Signed: Sign Extension

Sign extension pads MSBs with the sign bit. For example, -2 as int8_t is 0b11111110. Sign-extended to int16_t: 0b1111111111111110.

pcmpgtb (compare greater than byte) against zero sets all bits to 1 if the byte is negative, 0 otherwise. Interleaving with punpckX performs sign extension:

pxor m2, m2         ; zero out m2

movu m0, [srcq]
movu m1, m0         ; copy of m0

pcmpgtb m2, m0      ; m2 bytes = 0xFF if negative, 0x00 otherwise
punpcklbw m0, m2    ; sign-extended low bytes
punpckhbw m1, m2    ; sign-extended high bytes

Packing

packuswb (pack unsigned words to bytes) and packsswb (pack signed) go from words back to bytes, interleaving two registers into one. Values exceeding the byte range are saturated.

Shuffles

Shuffles (or permutes) are the most important instruction in video processing. pshufb (packed shuffle bytes), available in SSSE3, is the key variant.

For each byte, the corresponding source byte is used as an index into the destination. When the MSB is set, the destination byte is zeroed. Analogous C:

uint8_t tmp[16];
memcpy(tmp, dst, 16);
for (int i = 0; i < 16; i++) {
    if (src[i] & 0x80)
        dst[i] = 0;
    else
        dst[i] = tmp[src[i]];
}

SECTION_RODATA 64

shuffle_mask: db 4, 3, 1, 2, -1, 2, 3, 7, 5, 4, 3, 8, 12, 13, 15, -1

SECTION .text

movu m0, [srcq]
movu m1, [shuffle_mask]
pshufb m0, m1       ; shuffle m0 based on m1

-1 as a byte is 0b11111111 (two's complement), so the MSB is set and the output byte is zeroed. Used for easy readability.