onnxruntime/onnxruntime
Tianlei Wu 09e5724f3b
[CUDA] Fix beam search of num_beams > 32 (#23599)
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
https://github.com/microsoft/onnxruntime/pull/16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
2025-02-06 16:50:31 -08:00
..
contrib_ops [CUDA] Fix beam search of num_beams > 32 (#23599) 2025-02-06 16:50:31 -08:00
core OpenVINO EP Weights Sharing Feature (#23553) 2025-02-06 14:57:38 -08:00
lora
python [TensorRT EP] support TensorRT 10.8-GA (#23592) 2025-02-06 10:05:57 -08:00
test [CUDA] Fix beam search of num_beams > 32 (#23599) 2025-02-06 16:50:31 -08:00
tool/etw
wasm [WebNN] Fixed WebNN Module undefined issue (#22795) 2024-11-11 21:31:24 -08:00
__init__.py Use ruff as the formatter to replace black-isort (#23397) 2025-01-16 11:14:15 -08:00
ReformatSource.ps1
ReformatSourcePython.bat
VSCodeCoverage.runsettings