Open-source SYCL 2020 kernels for AV2 (AOM Video 2) โ DCT, SAD, loop filter, and intra prediction. 3โ5ร speedup on NVIDIA RTX 4090, Intel Arc A770, and AMD RX 7900 XTX.
Tested on Intel Core i9-13900K + NVIDIA RTX 4090, Ubuntu 22.04, DPC++ 2024.0
Encoding throughput comparison
Per-kernel GPU performance
No vendor lock-in. Write once, deploy everywhere.
| GPU | Architecture | Backend | DCT | SAD | Loop Filter | Intra |
|---|---|---|---|---|---|---|
๐ข NVIDIA RTX 4090 |
Ada Lovelace | CUDA | โ | โ | โ | โ |
๐ข NVIDIA RTX 3080 |
Ampere | CUDA | โ | โ | โ | โ |
๐ข Intel Arc A770 |
Alchemist | Level Zero | โ | โ | โ | โ |
๐ข Intel Xe LP |
Xe LPG | Level Zero | โ | โ | โ | โ |
๐ก AMD RX 7900 XTX |
RDNA 3 | HIP | ๐ | ๐ | ๐ | ๐ |
โ Full Support ๐ Experimental ๐ข Tested ๐ก Community Reported
End-to-end encoding throughput gains. Not microbenchmarks โ real-world AV2 encode tests on multiple GPUs.
Uses Unified Shared Memory (USM), device selectors, and async handlers. Compile with Intel DPC++, AdaptiveCpp, or NVIDIA clang-CUDA.
RTCD-compatible. Replace fdct8x8_cpu() with avm::sycl::fdct8x8(). Works with FFmpeg, OpenCV, GStreamer.
Intelligent device scoring. Automatically picks the best GPU at runtime โ no manual configuration needed.
If GPU is unavailable (macOS, driver error, device busy), automatically falls back to optimized CPU implementation.
Unit tests + performance benchmarks validated on Intel Xeon Gold 6530 + Arc A770. 100% test pass rate.
CUDA is NVIDIA-only. AVM SYCL runs everywhere.
Zero integration cost. Just replace the function call.