onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-29 03:30:52 +00:00

Author	SHA1	Message	Date
petermcaughan	871c52977a	Mistral Optimization & Benchmarking Support (#18225 ) ### Description As a prerequisite for this model running correctly, two PRs need to be merged: - GQA Sliding Window Attention: https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local - MHA Fusion: https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b This PR adds optimization, quantization, and benchmarking support for Mistral. The README included describes steps to export, optimize, and benchmark Mistral models, but won't function correctly without the two above branches being merged first. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com> Co-authored-by: Abhishek Jindal <abjindal@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2023-12-05 15:39:17 -08:00
Frank Dong	a46c79d211	fix llama2-70b bug, add document (#18398 ) 1. fix dist setting bug for LLaMA2-70b distributed convert and benchmark 2. Add instruction in README for how to benchmark LLaMA2-70b distribute inference	2023-11-10 21:59:23 -08:00
kunal-vaishnavi	c8def0cc51	Add LLaMA GQA ragged batching (#18337 ) This PR updates replacing MHA with GQA and updates the LLaMA scripts for the modified GQA op. It is related to the changes in [this PR](https://github.com/microsoft/onnxruntime/pull/18283). ### Motivation and Context This PR allows us to run LLaMA with the GQA op end-to-end using ragged batching (i.e. batched inputs of different lengths).	2023-11-08 09:36:28 -08:00
Frank Dong	dabd395fdf	llama 70b model fusion and shardding (#18175 ) ### Description Support llama-70b model fusion and shardding ### Motivation and Context This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU. This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.	2023-11-02 06:03:59 -07:00
kunal-vaishnavi	b79ea74819	Add updates to LLaMA scripts (#18076 ) ### Description This PR adds a few updates to scripts in the LLaMA folder: - Fixes the precision re-naming in the LLaMA export - Adds a "prerequisites" section in the README - Adds IO binding synchronizations during benchmarking for other EPs ### Motivation and Context - With precision re-naming, the LLaMA parity check does not produce errors when creating the FP32 CPU model - The "prerequisites" section shows that there are specific package versions needed - This allows for benchmarking with other EPs besides CPU and CUDA	2023-10-26 21:54:23 -07:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00

7 commits