FFmpeg x86 Assembly
Notes from three lectures on writing x86 assembly for FFmpeg (from the FFmpeg asm-lessons). Covers registers, SIMD fundamentals, branching, loops, instruction sets, and common optimization tricks used in video processing.
Registers
What are they?
Registers are areas in the CPU where data can be processed. CPUs don't operate on memory directly:
Data in memory → loaded into registers → processed → written back to memory
Generally in assembly you cannot directly copy data from one memory location to another without first passing that data through a register.
General Purpose Registers (GPRs)
GPRs can contain either data (up to a 64-bit value) or a memory address (a pointer). A value in a GPR can be processed through operations like addition, multiplication, shifting, etc. In FFmpeg, GPR complexity is abstracted away.
Vector Registers (SIMD)
SIMD registers contain multiple data elements. There are various types:
mmregisters — MMX, 64-bit, historical and rarely usedxmmregisters — XMM, 128-bit, widely availableymmregisters — YMM, 256-bit, some complicationszmmregisters — ZMM, 512-bit, limited availability
Most calculations in video compression/decompression are integer-based.
An xmm register holding 16 bytes can be interpreted as:
- 16 bytes — 8-bit data
- 8 words — 16-bit data
- 4 doublewords — 32-bit data
- 2 quadwords — 64-bit data
- 1 double quadword — 128-bit data
x86inc.asm
x86inc.asm is a lightweight abstraction layer used in FFmpeg, x264, and dav1d.
One useful thing it does is label GPRs as r0, r1, r2, etc.,
so you don't have to remember platform-specific register names.
GPRs are generally just scaffolding, so this makes life easier.
Simple Scalar Assembly
Scalar assembly operates on individual data items, one at a time:
mov r0q, 3
inc r0q
dec r0q
imul r0q, 5
mov r0q, 3— immediate value 3 stored into r0 as a quadwordinc r0q— increments r0 to 4dec r0q— decrements back to 3imul r0q, 5— multiplies by 5, r0 now contains 15
Mnemonics
Human-readable instructions like mov and inc,
assembled into machine code by the assembler, are known as mnemonics.
FFmpeg uses lowercase (you'll see uppercase online).
First SIMD Function
%include "x86inc.asm"
SECTION .text
;static void add_values(uint8_t *src, const uint8_t *src2)
INIT_XMM sse2
cglobal add_values, 2, 2, 2, src, src2
movu m0, [srcq]
movu m1, [src2q]
paddb m0, m1
movu [srcq], m0
RET
Breakdown:
%include "x86inc.asm"— include the helper header for macros likecglobalSECTION .text— executable code section (vs.datafor constants)INIT_XMM sse2— initialize to use XMM registers with the SSE2 instruction setcglobal add_values, 2, 2, 2, src, src2— define function with 2 args, 2 GPRs, 2 XMM registersmovu m0, [srcq]— 128-bit unaligned load from*src(brackets = dereference, like*ptrin C)paddb m0, m1— packed bytewise addition. "p" = packed (vector), "b" = byte granularitymovu [srcq], m0— store result back to*srcRET— function return macro
Note: vector registers are referred to as m0, m1, etc. (abstracted form),
not their full names like xmm0.
The q suffix on srcq refers to pointer size (64-bit on 64-bit systems),
but the underlying load is 128-bit.
Branches & Loops
Labels and Jumps
mov r0q, 3
.loop:
dec r0q
jmp .loop
jmp moves execution to the label. The dot prefix makes it a
local label, allowing the same name across multiple functions.
The above is an infinite loop (no boundary check).
FLAGS Register
The FLAGS register contains flags like Zero-Flag, Sign-Flag, and Overflow-Flag,
set based on the output of most non-mov scalar instructions (arithmetic, shifts, etc.).
mov r0q, 3
.loop:
; do something
dec r0q
jg .loop ; jump if greater than zero
Equivalent C:
int i = 3;
do {
// do something
i--;
} while (i > 0);
A more natural C-style for loop:
for (int i = 0; i < 3; i++) {
// do something
}
Roughly translates to:
xor r0q, r0q
.loop:
; do something
inc r0q
cmp r0q, 3
jl .loop ; jump if (r0q - 3) < 0, i.e. r0q < 3
xor r0q, r0q is a common way to zero a register — on some
systems faster than mov r0q, 0 because no actual load takes place.
Also works on SIMD registers: pxor m0, m0.
cmp subtracts the second operand from the first (without storing the result)
and sets FLAGS. The extra cmp instruction is why the countdown form is preferred
— fewer instructions generally means faster code.
Common Jump Mnemonics
Mnemonic Description FLAGS
JE/JZ Jump if Equal/Zero ZF = 1
JNE/JNZ Jump if Not Equal/Not Zero ZF = 0
JG/JNLE Jump if Greater (signed) ZF = 0 and SF = OF
JGE/JNL Jump if Greater or Equal (signed) SF = OF
JL/JNGE Jump if Less (signed) SF ≠ OF
JLE/JNG Jump if Less or Equal (signed) ZF = 1 or SF ≠ OF
Constants
SECTION_RODATA
constants_1: db 1,2,3,4
constants_2: times 2 dw 4,3,2,1
SECTION_RODATA— read-only data sectiondb(declare byte) — equivalent touint8_t constants_1[4] = {1,2,3,4};times 2 dw— repeats the declared words:uint16_t constants_2[8] = {4,3,2,1, 4,3,2,1};
These labels (converted to memory addresses by the assembler) can be used in loads but not stores, since they are read-only.
Offsets
Offsets are the distance (in bytes) between consecutive elements in memory. In C, the compiler precalculates these offsets for you. In assembly, you do it yourself.
Syntax for memory address calculations:
[base + scale*index + disp]
base— a GPR (usually a pointer from a C function argument)scale— can be 1, 2, 4, or 8 (default: 1)index— a GPR (usually a loop counter)disp— an integer displacement (up to 32-bit)
x86inc provides the constant mmsize, which gives you the size of the
SIMD register you are working with.
;static void simple_loop(const uint8_t *src)
INIT_XMM sse2
cglobal simple_loop, 1, 2, 2, src
movq r1q, 3
.loop:
movu m0, [srcq]
movu m1, [srcq+2*r1q+3+mmsize]
; do something
add srcq, mmsize
dec r1q
jg .loop
RET
The assembler precalculates the displacement constant in
[srcq+2*r1q+3+mmsize].
LEA (Load Effective Address)
Lets you perform multiplication and addition in one instruction — faster than using multiple instructions. Despite the name, it's used for normal arithmetic as well as address calculations.
lea r0q, [r1q + 8*r2q + 5]
Does not affect r1q or r2q, and does not affect FLAGS.
The equivalent without LEA would require multiple instructions and a temporary register:
movq r0q, r1q
movq r3q, r2q
sal r3q, 3 ; shift arithmetic left 3 = * 8
add r3q, 5
add r0q, r3q
Scale is limited to 1, 2, 4, or 8, but multiplication by these values plus a fixed offset is very common in practice.
Instruction Sets
- MMX (1997) — first SIMD in Intel processors, 64-bit registers, historic
- SSE (1999) — 128-bit registers
- SSE2 (2000) — many new instructions
- SSSE3 (2006) — introduced
pshufb, arguably the most important instruction in video processing - SSE4 (2008) — packed minimum and maximum, among others
- AVX (2011) — 256-bit registers (float only), new three-operand syntax
- AVX2 (2013) — 256-bit integer instructions
- AVX-512 (2017) — 512-bit registers, operation masks. Initially had CPU frequency downscaling issues. Full 512-bit shuffle with
vpermb - AVX-512 ICL (2019) — no more clock frequency downscaling
- AVX10 — upcoming
Instruction sets can be removed as well as added. AVX-512 was removed from Alder Lake.
FFmpeg uses function pointers that default to C and are replaced with a particular instruction set variant at runtime. Detection is done once. This allows optimized functions to be toggled on/off and avoids hardcoding a specific instruction set (which makes perfectly functional hardware obsolete).
Real-World Availability (Steam Survey, Nov 2024)
Instruction Set Availability
SSE2 100%
SSE3 100%
SSSE3 99.86%
SSE4.1 99.80%
AVX 97.39%
AVX2 94.44%
AVX-512 14.09%
For FFmpeg with billions of users, even 0.1% is a massive number of users and bug reports.
References
Pointer Offset Trick
Taking the original add_values function but adding a width argument.
We use ptrdiff_t instead of int to ensure the upper 32-bits
of the 64-bit argument are zero (otherwise they can contain arbitrary values).
;static void add_values(uint8_t *src, const uint8_t *src2, ptrdiff_t width)
INIT_XMM sse2
cglobal add_values, 3, 3, 2, src, src2, width
add srcq, widthq
add src2q, widthq
neg widthq
.loop:
movu m0, [srcq+widthq]
movu m1, [src2q+widthq]
paddb m0, m1
movu [srcq+widthq], m0
add widthq, mmsize
jl .loop
RET
The trick:
add srcq, widthq/add src2q, widthq— advance both pointers to the end of the bufferneg widthq— negate width so it becomes negative[srcq+widthq]— on first iteration, the negative offset points back to the original startadd widthq, mmsize/jl .loop—widthqapproaches zero bymmsizeeach iteration. It serves as both the pointer offset and the loop counter, saving acmpinstruction
Alignment
Many CPUs load/store data faster when it's aligned (memory address divisible by SIMD register size).
Where possible, FFmpeg uses mova (aligned move) instead of movu.
av_mallocprovides aligned heap memoryDECLARE_ALIGNEDprovides aligned stack memory- Using
movaon an unaligned address causes a segfault - Alignment must match register size: 16 for xmm, 32 for ymm, 64 for zmm
SECTION_RODATA 64 ; align beginning of RODATA to 64 bytes
Range Expansion
When a byte value exceeds 255 after an operation, it overflows. To handle this, we expand to a larger intermediate size (e.g. bytes to words).
Unsigned: Zero Extension
punpcklbw (unpack low bytes to words) and punpckhbw
(unpack high bytes to words) interleave bytes from two registers.
If the source register is all zeros, this interleaves bytes with zeros —
performing zero extension.
pxor m2, m2 ; zero out m2
movu m0, [srcq]
movu m1, m0 ; copy of m0
punpcklbw m0, m2 ; low bytes zero-extended to words
punpckhbw m1, m2 ; high bytes zero-extended to words
m0 and m1 now contain the original bytes zero-extended to words.
(Three-operand AVX makes the second movu unnecessary.)
Signed: Sign Extension
Sign extension pads MSBs with the sign bit. For example, -2
as int8_t is 0b11111110. Sign-extended to int16_t:
0b1111111111111110.
pcmpgtb (compare greater than byte) against zero sets all bits to 1
if the byte is negative, 0 otherwise. Interleaving with punpckX
performs sign extension:
pxor m2, m2 ; zero out m2
movu m0, [srcq]
movu m1, m0 ; copy of m0
pcmpgtb m2, m0 ; m2 bytes = 0xFF if negative, 0x00 otherwise
punpcklbw m0, m2 ; sign-extended low bytes
punpckhbw m1, m2 ; sign-extended high bytes
Packing
packuswb (pack unsigned words to bytes) and packsswb
(pack signed) go from words back to bytes, interleaving two registers into one.
Values exceeding the byte range are saturated.
Shuffles
Shuffles (or permutes) are the most important instruction in video processing.
pshufb (packed shuffle bytes), available in SSSE3, is the key variant.
For each byte, the corresponding source byte is used as an index into the destination. When the MSB is set, the destination byte is zeroed. Analogous C:
uint8_t tmp[16];
memcpy(tmp, dst, 16);
for (int i = 0; i < 16; i++) {
if (src[i] & 0x80)
dst[i] = 0;
else
dst[i] = tmp[src[i]];
}
SECTION_RODATA 64
shuffle_mask: db 4, 3, 1, 2, -1, 2, 3, 7, 5, 4, 3, 8, 12, 13, 15, -1
SECTION .text
movu m0, [srcq]
movu m1, [shuffle_mask]
pshufb m0, m1 ; shuffle m0 based on m1
-1 as a byte is 0b11111111 (two's complement),
so the MSB is set and the output byte is zeroed. Used for easy readability.