onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-28 22:56:32 +00:00

Author	SHA1	Message	Date
Frank Dong	92f66de702	remove llama 70b (#21396 ) Remove llama 70b model due to security reason. We need add shard code in HF to enable model shardding for llama-70b, these codes are not merged into main branch as HF forks want a more general solution instead of doing shard for specify model. shared code is kept here: https://github.com/frank-dong-ms/transformers/tree/frdong/shard_llama we kept llama-70b related code here for internal use: https://github.com/frank-dong-ms/onnxruntime/tree/frdong/llama_70b	2024-07-18 12:12:10 -07:00
kunal-vaishnavi	72a3bde330	Add GQA on CPU in LLaMA scripts (#20720 ) ### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](https://github.com/microsoft/onnxruntime/pull/20598).	2024-05-17 23:23:57 -07:00
kunal-vaishnavi	6e4516cef9	Fix parity checker in LLaMA scripts (#20301 ) ### Description This PR fixes the parity checker in the LLaMA scripts by adding the following. - Enable buffer sharing manually with `use_buffer_share` instead of `use_gqa` - Get max sequence length from model's config ### Motivation and Context This PR fixes an issue with running the parity checker on other large-language models where `GroupQueryAttention` can be used without buffer sharing enabled.	2024-04-14 17:59:14 -07:00
kunal-vaishnavi	6238e9c0af	Add LLaMA end-to-end benchmarking (#19985 ) ### Description This PR adds a benchmarking script to measure end-to-end performance and saves the results in a CSV file. ### Motivation and Context With this PR, end-to-end performance can be easily measured for many large-language models such as LLaMA-2. The performance numbers for LLaMA-2 are located [here](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama).	2024-03-21 18:59:05 -07:00
kunal-vaishnavi	a3ecb63267	Update LLaMA attention fusions (#19200 ) ### Description This PR updates the LLaMA-2 attention fusions by adding the following. - Loading the PyTorch model from Hugging Face with the `LlamaAttention` class before exporting - Updating the attention mask pattern matching to support another case This PR also fixes [this issue](https://github.com/microsoft/onnxruntime/issues/19040). ### Motivation and Context Recent changes to Hugging Face's `transformers` library break the existing pattern matching. Since the attention fusions aim to change the graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to `LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately it does not matter what nodes comprise the `Set of Attention Nodes` because they will all be removed and replaced by the `Attention Op` in the end. Therefore, it does not matter whether the `LlamaAttention` class or a different attention class is used to load the PyTorch model before exporting because the expected graphs after the attention fusions will look identical no matter the attention class chosen. By loading the PyTorch model with the `LlamaAttention` class instead of other attention classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then exporting it to ONNX, the existing pattern matching will continue to work.	2024-01-19 11:09:24 -08:00
Abhishek Jindal	2f93d97fd0	Add cuda visible devices for Mistral benchmark (#18764 ) ### Description <!-- Describe your changes. --> Add cuda visible devices for Mistral benchmark as it is not working for Torch compile and throwing an error. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Error: File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/_inductor/triton_heuristics.py", line 556, in run return launcher( File "<string>", line 8, in launcher RuntimeError: Triton Error [CUDA]: invalid device context	2023-12-08 23:12:48 -08:00
petermcaughan	871c52977a	Mistral Optimization & Benchmarking Support (#18225 ) ### Description As a prerequisite for this model running correctly, two PRs need to be merged: - GQA Sliding Window Attention: https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local - MHA Fusion: https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b This PR adds optimization, quantization, and benchmarking support for Mistral. The README included describes steps to export, optimize, and benchmark Mistral models, but won't function correctly without the two above branches being merged first. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com> Co-authored-by: Abhishek Jindal <abjindal@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2023-12-05 15:39:17 -08:00
Frank Dong	a46c79d211	fix llama2-70b bug, add document (#18398 ) 1. fix dist setting bug for LLaMA2-70b distributed convert and benchmark 2. Add instruction in README for how to benchmark LLaMA2-70b distribute inference	2023-11-10 21:59:23 -08:00
kunal-vaishnavi	c8def0cc51	Add LLaMA GQA ragged batching (#18337 ) This PR updates replacing MHA with GQA and updates the LLaMA scripts for the modified GQA op. It is related to the changes in [this PR](https://github.com/microsoft/onnxruntime/pull/18283). ### Motivation and Context This PR allows us to run LLaMA with the GQA op end-to-end using ragged batching (i.e. batched inputs of different lengths).	2023-11-08 09:36:28 -08:00
Frank Dong	dabd395fdf	llama 70b model fusion and shardding (#18175 ) ### Description Support llama-70b model fusion and shardding ### Motivation and Context This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU. This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.	2023-11-02 06:03:59 -07:00
kunal-vaishnavi	b79ea74819	Add updates to LLaMA scripts (#18076 ) ### Description This PR adds a few updates to scripts in the LLaMA folder: - Fixes the precision re-naming in the LLaMA export - Adds a "prerequisites" section in the README - Adds IO binding synchronizations during benchmarking for other EPs ### Motivation and Context - With precision re-naming, the LLaMA parity check does not produce errors when creating the FP32 CPU model - The "prerequisites" section shows that there are specific package versions needed - This allows for benchmarking with other EPs besides CPU and CUDA	2023-10-26 21:54:23 -07:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00

13 commits