Remove early stopping from LLaMA end-to-end benchmarking (#20033)

### Description This PR removes early stopping from the end-to-end LLaMA-2 benchmark script. ### Motivation and Context This allows models to always generate the requested number of new tokens.
2026-07-15 18:23:41 +00:00 · 2024-03-22 14:44:34 -07:00 · 2024-03-22 14:44:34 -07:00 · f9cddd2cf5
commit f9cddd2cf5
parent 7e84ba0ea3
1 changed files with 0 additions and 4 deletions
--- a/onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py
+++ b/onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py
@ -400,11 +400,7 @@ def main():
                sampling_times.append(sampling_end_time - sampling_start_time)

                all_token_ids = torch.cat([all_token_ids, tokens_to_add], dim=-1)
-
-                # Return early if all batch entries have reached EOS token id
                current_length += 1
-                if torch.all(has_eos) or current_length > max_length:
-                    break

                # Update inputs for next inference run
                inputs["input_ids"] = tokens_to_add