Echo-DSRN - Triton Kernel Benchmark Report - PyTorch (native) vs Triton Legacy (sequential) vs Triton 3-Pass (new)
#10
by mrs83 - opened
Echo-DSRN Triton Kernel Benchmark Report
System Specifications
| Component | Value |
|---|---|
| OS | Linux 6.17.0-14-generic |
| CPU | AMD RYZEN AI MAX+ 395 w/ Radeon 8060S |
| Python | 3.13.12 |
| PyTorch | 2.10.0+rocm7.1 |
GPU Specifications
| Metric | Value |
|---|---|
| Name | AMD Radeon 8060S |
| Total Memory | 96.00 GB |
| Compute Capability | 11.5 |
| Multi Processors | 20 |
Disclaimer: The throughput metrics (TPS) in this report refer specifically to the isolated DSRN slow-state update kernels. These values represent raw kernel performance and do not reflect end-to-end model generation speeds.
Kernel Performance Results
Sequence Length T=128
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 0.55 | 20.28 | 931138 |
| Triton (Legacy) | 0.14 | 12.62 | 3655291 |
| Triton (3-Pass) | 0.23 | 8.51 | 2200967 |
Sequence Length T=512
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 0.91 | 81.04 | 2253689 |
| Triton (Legacy) | 0.40 | 50.36 | 5102426 |
| Triton (3-Pass) | 0.37 | 33.97 | 5503546 |
Sequence Length T=1024
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 2.04 | 162.04 | 2007393 |
| Triton (Legacy) | 0.91 | 100.70 | 4504895 |
| Triton (3-Pass) | 0.69 | 67.92 | 5906577 |
Sequence Length T=2048
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 4.41 | 324.04 | 1858820 |
| Triton (Legacy) | 2.02 | 201.36 | 4053194 |
| Triton (3-Pass) | 1.62 | 135.82 | 5048226 |
Sequence Length T=4096
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 8.55 | 648.05 | 1916386 |
| Triton (Legacy) | 3.98 | 402.69 | 4119082 |
| Triton (3-Pass) | 3.24 | 271.61 | 5060568 |
Sequence Length T=8192
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 16.76 | 1296.07 | 1955516 |
| Triton (Legacy) | 7.90 | 805.34 | 4150341 |
| Triton (3-Pass) | 6.08 | 543.19 | 5387080 |
System Specifications
| Component | Value |
|---|---|
| OS | Linux 6.17.0-19-generic |
| CPU | AMD Ryzen 7 7700 8-Core Processor |
| Python | 3.13.12 |
| PyTorch | 2.10.0+rocm7.1 |
GPU Specifications
| Metric | Value |
|---|---|
| Name | AMD Radeon AI PRO R9700 |
| Total Memory | 31.86 GB |
| Compute Capability | 12.0 |
| Multi Processors | 32 |
Disclaimer: The throughput metrics (TPS) in this report refer specifically to the isolated DSRN slow-state update kernels. These values represent raw kernel performance and do not reflect end-to-end model generation speeds.
Kernel Performance Results
Sequence Length T=128
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 0.73 | 20.28 | 696828 |
| Triton (Legacy) | 0.17 | 12.62 | 2940147 |
| Triton (3-Pass) | 0.32 | 8.51 | 1611378 |
Sequence Length T=512
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 1.03 | 81.04 | 1987169 |
| Triton (Legacy) | 0.31 | 50.36 | 6566224 |
| Triton (3-Pass) | 0.35 | 33.97 | 5907795 |
Sequence Length T=1024
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 1.41 | 162.04 | 2902789 |
| Triton (Legacy) | 0.60 | 100.70 | 6878000 |
| Triton (3-Pass) | 0.37 | 67.92 | 11095956 |
Sequence Length T=2048
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 3.18 | 324.04 | 2572318 |
| Triton (Legacy) | 1.26 | 201.36 | 6494856 |
| Triton (3-Pass) | 0.60 | 135.82 | 13661381 |
Sequence Length T=4096
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 4.75 | 648.05 | 3448440 |
| Triton (Legacy) | 2.73 | 402.69 | 5995208 |
| Triton (3-Pass) | 1.25 | 271.61 | 13058581 |
Sequence Length T=8192
| Implementation | Total Time (ms) | Peak Memory (MB) | Raw Kernel TPS |
|---|---|---|---|
| PyTorch (Legacy) | 9.09 | 1296.07 | 3606687 |
| Triton (Legacy) | 5.51 | 805.34 | 5943563 |
| Triton (3-Pass) | 2.46 | 543.19 | 13294154 |