⬡ SYCL 2020 C++17 BSD-3-Clause 🌐 Cross-platform

GPU-Accelerated AV2 Video Encoding
from a Single Codebase

Open-source SYCL 2020 kernels for AV2 (AOM Video 2) — DCT, SAD, loop filter, and intra prediction. 3–5× speedup on NVIDIA RTX 4090, Intel Arc A770, and AMD RX 7900 XTX.

⭐ Star on GitHub ⚡ Quick Start 📖 Documentation

3–5×

Encoding Speedup

GPU Vendors Supported

50 μs

DCT 8×8 on RTX 4090

Zero

Integration Cost

Performance

Real-World Benchmarks

Tested on Intel Core i9-13900K + NVIDIA RTX 4090, Ubuntu 22.04, DPC++ 2024.0

Encoding throughput comparison

Per-kernel GPU performance

4K Real-time Encoding

3.2× faster

12 fps → 38 fps on NVIDIA RTX 4090

1080p Batch Transcode

4.2× faster

0.5× → 2.1× real-time on RTX 4090

DCT 8×8 Kernel

3.6× faster

180 μs → 50 μs · 19,945 transforms/sec

SAD 16×16

5.1× faster

Motion estimation on Intel Arc A770

Hardware

Every Major GPU. One Codebase.

No vendor lock-in. Write once, deploy everywhere.

GPU	Architecture	Backend	DCT	SAD	Loop Filter	Intra
🟢 NVIDIA RTX 4090	Ada Lovelace	CUDA	✅	✅	✅	✅
🟢 NVIDIA RTX 3080	Ampere	CUDA	✅	✅	✅	✅
🟢 Intel Arc A770	Alchemist	Level Zero	✅	✅	✅	✅
🟢 Intel Xe LP	Xe LPG	Level Zero	✅	✅	✅	✅
🟡 AMD RX 7900 XTX	RDNA 3	HIP	🔄	🔄	🔄	🔄

✅ Full Support 🔄 Experimental 🟢 Tested 🟡 Community Reported

Why AVM SYCL

Built for Production

🚀

3–5× Real Speedup

End-to-end encoding throughput gains. Not microbenchmarks — real-world AV2 encode tests on multiple GPUs.

⬡

SYCL 2020 Standard

Uses Unified Shared Memory (USM), device selectors, and async handlers. Compile with Intel DPC++, AdaptiveCpp, or NVIDIA clang-CUDA.

🔄

Drop-in Replacement

RTCD-compatible. Replace fdct8x8_cpu() with avm::sycl::fdct8x8(). Works with FFmpeg, OpenCV, GStreamer.

🎯

Auto GPU Selection

Intelligent device scoring. Automatically picks the best GPU at runtime — no manual configuration needed.

🔒

CPU Fallback

If GPU is unavailable (macOS, driver error, device busy), automatically falls back to optimized CPU implementation.

🧪

Well Tested

Unit tests + performance benchmarks validated on Intel Xeon Gold 6530 + Arc A770. 100% test pass rate.

Comparison

Why Not Just Use CUDA?

CUDA is NVIDIA-only. AVM SYCL runs everywhere.

🔴 CUDA / ROCm / OpenCL

❌ NVIDIA only (CUDA)
❌ AMD only (ROCm/HIP)
❌ No cross-vendor portability
❌ Maintain 3 separate codebases
❌ Different optimization strategies per vendor
❌ Hard to integrate in cross-platform apps

🟢 AVM SYCL

✅ NVIDIA (CUDA backend)
✅ Intel (Level Zero backend)
✅ AMD (HIP backend)
✅ ARM Mali (OpenCL backend)
✅ Single codebase, zero porting effort
✅ Works in FFmpeg, GStreamer, OpenCV

Get Started

3 Lines to GPU Acceleration

Zero integration cost. Just replace the function call.

// your existing code
int16_t input[64] = {...};
int32_t output[64];
fdct8x8_cpu(input, output);       // slow

// replace with AVM SYCL — that's it
auto& ctx = avm::sycl::SYCLContext::instance();
ctx.initialize();                      // auto-detect best GPU
avm::sycl::fdct8x8(ctx.queue(), input, output);  // 3.6× faster
  

⭐ Get the Code 📖 Architecture Guide 🔗 FFmpeg/GStreamer Integration

GPU-Accelerated AV2 Video Encodingfrom a Single Codebase