โฌก SYCL 2020 C++17 BSD-3-Clause ๐ŸŒ Cross-platform

GPU-Accelerated AV2 Video Encoding
from a Single Codebase

Open-source SYCL 2020 kernels for AV2 (AOM Video 2) โ€” DCT, SAD, loop filter, and intra prediction. 3โ€“5ร— speedup on NVIDIA RTX 4090, Intel Arc A770, and AMD RX 7900 XTX.

โญ Star on GitHub โšก Quick Start ๐Ÿ“– Documentation
3โ€“5ร—
Encoding Speedup
4
GPU Vendors Supported
50 ฮผs
DCT 8ร—8 on RTX 4090
Zero
Integration Cost
Performance

Real-World Benchmarks

Tested on Intel Core i9-13900K + NVIDIA RTX 4090, Ubuntu 22.04, DPC++ 2024.0

AV2 GPU Acceleration Benchmark

Encoding throughput comparison

AV2 Kernel Performance

Per-kernel GPU performance

4K Real-time Encoding

3.2ร— faster
12 fps โ†’ 38 fps on NVIDIA RTX 4090

1080p Batch Transcode

4.2ร— faster
0.5ร— โ†’ 2.1ร— real-time on RTX 4090

DCT 8ร—8 Kernel

3.6ร— faster
180 ฮผs โ†’ 50 ฮผs ยท 19,945 transforms/sec

SAD 16ร—16

5.1ร— faster
Motion estimation on Intel Arc A770
Hardware

Every Major GPU. One Codebase.

No vendor lock-in. Write once, deploy everywhere.

GPU Architecture Backend DCT SAD Loop Filter Intra
๐ŸŸข NVIDIA RTX 4090
Ada Lovelace CUDA โœ… โœ… โœ… โœ…
๐ŸŸข NVIDIA RTX 3080
Ampere CUDA โœ… โœ… โœ… โœ…
๐ŸŸข Intel Arc A770
Alchemist Level Zero โœ… โœ… โœ… โœ…
๐ŸŸข Intel Xe LP
Xe LPG Level Zero โœ… โœ… โœ… โœ…
๐ŸŸก AMD RX 7900 XTX
RDNA 3 HIP ๐Ÿ”„ ๐Ÿ”„ ๐Ÿ”„ ๐Ÿ”„

โœ… Full Support    ๐Ÿ”„ Experimental    ๐ŸŸข Tested    ๐ŸŸก Community Reported

Why AVM SYCL

Built for Production

๐Ÿš€

3โ€“5ร— Real Speedup

End-to-end encoding throughput gains. Not microbenchmarks โ€” real-world AV2 encode tests on multiple GPUs.

โฌก

SYCL 2020 Standard

Uses Unified Shared Memory (USM), device selectors, and async handlers. Compile with Intel DPC++, AdaptiveCpp, or NVIDIA clang-CUDA.

๐Ÿ”„

Drop-in Replacement

RTCD-compatible. Replace fdct8x8_cpu() with avm::sycl::fdct8x8(). Works with FFmpeg, OpenCV, GStreamer.

๐ŸŽฏ

Auto GPU Selection

Intelligent device scoring. Automatically picks the best GPU at runtime โ€” no manual configuration needed.

๐Ÿ”’

CPU Fallback

If GPU is unavailable (macOS, driver error, device busy), automatically falls back to optimized CPU implementation.

๐Ÿงช

Well Tested

Unit tests + performance benchmarks validated on Intel Xeon Gold 6530 + Arc A770. 100% test pass rate.

Comparison

Why Not Just Use CUDA?

CUDA is NVIDIA-only. AVM SYCL runs everywhere.

๐Ÿ”ด CUDA / ROCm / OpenCL

  • โŒ NVIDIA only (CUDA)
  • โŒ AMD only (ROCm/HIP)
  • โŒ No cross-vendor portability
  • โŒ Maintain 3 separate codebases
  • โŒ Different optimization strategies per vendor
  • โŒ Hard to integrate in cross-platform apps

๐ŸŸข AVM SYCL

  • โœ… NVIDIA (CUDA backend)
  • โœ… Intel (Level Zero backend)
  • โœ… AMD (HIP backend)
  • โœ… ARM Mali (OpenCL backend)
  • โœ… Single codebase, zero porting effort
  • โœ… Works in FFmpeg, GStreamer, OpenCV
Get Started

3 Lines to GPU Acceleration

Zero integration cost. Just replace the function call.

// your existing code int16_t input[64] = {...}; int32_t output[64]; fdct8x8_cpu(input, output); // slow // replace with AVM SYCL โ€” that's it auto& ctx = avm::sycl::SYCLContext::instance(); ctx.initialize(); // auto-detect best GPU avm::sycl::fdct8x8(ctx.queue(), input, output); // 3.6ร— faster
โญ Get the Code ๐Ÿ“– Architecture Guide ๐Ÿ”— FFmpeg/GStreamer Integration