Echo-DSRN - Triton Kernel Benchmark Report - PyTorch (native) vs Triton Legacy (sequential) vs Triton 3-Pass (new)

#10
by mrs83 - opened
ethicalabs.ai org
edited 9 days ago

Echo-DSRN Triton Kernel Benchmark Report

System Specifications

Component Value
OS Linux 6.17.0-14-generic
CPU AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Python 3.13.12
PyTorch 2.10.0+rocm7.1

GPU Specifications

Metric Value
Name AMD Radeon 8060S
Total Memory 96.00 GB
Compute Capability 11.5
Multi Processors 20

Disclaimer: The throughput metrics (TPS) in this report refer specifically to the isolated DSRN slow-state update kernels. These values represent raw kernel performance and do not reflect end-to-end model generation speeds.

Kernel Performance Results

Sequence Length T=128

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 0.55 20.28 931138
Triton (Legacy) 0.14 12.62 3655291
Triton (3-Pass) 0.23 8.51 2200967

Sequence Length T=512

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 0.91 81.04 2253689
Triton (Legacy) 0.40 50.36 5102426
Triton (3-Pass) 0.37 33.97 5503546

Sequence Length T=1024

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 2.04 162.04 2007393
Triton (Legacy) 0.91 100.70 4504895
Triton (3-Pass) 0.69 67.92 5906577

Sequence Length T=2048

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 4.41 324.04 1858820
Triton (Legacy) 2.02 201.36 4053194
Triton (3-Pass) 1.62 135.82 5048226

Sequence Length T=4096

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 8.55 648.05 1916386
Triton (Legacy) 3.98 402.69 4119082
Triton (3-Pass) 3.24 271.61 5060568

Sequence Length T=8192

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 16.76 1296.07 1955516
Triton (Legacy) 7.90 805.34 4150341
Triton (3-Pass) 6.08 543.19 5387080
ethicalabs.ai org
edited 9 days ago

System Specifications

Component Value
OS Linux 6.17.0-19-generic
CPU AMD Ryzen 7 7700 8-Core Processor
Python 3.13.12
PyTorch 2.10.0+rocm7.1

GPU Specifications

Metric Value
Name AMD Radeon AI PRO R9700
Total Memory 31.86 GB
Compute Capability 12.0
Multi Processors 32

Disclaimer: The throughput metrics (TPS) in this report refer specifically to the isolated DSRN slow-state update kernels. These values represent raw kernel performance and do not reflect end-to-end model generation speeds.

Kernel Performance Results

Sequence Length T=128

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 0.73 20.28 696828
Triton (Legacy) 0.17 12.62 2940147
Triton (3-Pass) 0.32 8.51 1611378

Sequence Length T=512

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 1.03 81.04 1987169
Triton (Legacy) 0.31 50.36 6566224
Triton (3-Pass) 0.35 33.97 5907795

Sequence Length T=1024

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 1.41 162.04 2902789
Triton (Legacy) 0.60 100.70 6878000
Triton (3-Pass) 0.37 67.92 11095956

Sequence Length T=2048

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 3.18 324.04 2572318
Triton (Legacy) 1.26 201.36 6494856
Triton (3-Pass) 0.60 135.82 13661381

Sequence Length T=4096

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 4.75 648.05 3448440
Triton (Legacy) 2.73 402.69 5995208
Triton (3-Pass) 1.25 271.61 13058581

Sequence Length T=8192

Implementation Total Time (ms) Peak Memory (MB) Raw Kernel TPS
PyTorch (Legacy) 9.09 1296.07 3606687
Triton (Legacy) 5.51 805.34 5943563
Triton (3-Pass) 2.46 543.19 13294154

Sign up or log in to comment