onnxruntime/onnxruntime/contrib_ops
Tianlei Wu 09e5724f3b
[CUDA] Fix beam search of num_beams > 32 (#23599)
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
https://github.com/microsoft/onnxruntime/pull/16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
2025-02-06 16:50:31 -08:00
..
cpu [CUDA] Fix beam search of num_beams > 32 (#23599) 2025-02-06 16:50:31 -08:00
cuda [CUDA] Fix beam search of num_beams > 32 (#23599) 2025-02-06 16:50:31 -08:00
js [JS/WebGPU] GroupQueryAttention rewrite (#20946) 2024-10-23 10:14:09 -07:00
rocm Update BiasGelu fusion and related ops (#23518) 2025-01-30 22:53:59 -08:00
webgpu Implement Flash Attention 2 for webgpu EP (#23576) 2025-02-06 16:32:05 -08:00