Commit graph

13 commits

Author SHA1 Message Date
Frank Dong
92f66de702
remove llama 70b (#21396)
Remove llama 70b model due to security reason.

We need add shard code in HF to enable model shardding for llama-70b,
these codes are not merged into main branch as HF forks want a more
general solution instead of doing shard for specify model. shared code
is kept here:
https://github.com/frank-dong-ms/transformers/tree/frdong/shard_llama

we kept llama-70b related code here for internal use:
https://github.com/frank-dong-ms/onnxruntime/tree/frdong/llama_70b
2024-07-18 12:12:10 -07:00
kunal-vaishnavi
72a3bde330
Add GQA on CPU in LLaMA scripts (#20720)
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](https://github.com/microsoft/onnxruntime/pull/20598).
2024-05-17 23:23:57 -07:00
kunal-vaishnavi
6e4516cef9
Fix parity checker in LLaMA scripts (#20301)
### Description
This PR fixes the parity checker in the LLaMA scripts by adding the
following.
- Enable buffer sharing manually with `use_buffer_share` instead of
`use_gqa`
- Get max sequence length from model's config


### Motivation and Context
This PR fixes an issue with running the parity checker on other
large-language models where `GroupQueryAttention` can be used without
buffer sharing enabled.
2024-04-14 17:59:14 -07:00
kunal-vaishnavi
6238e9c0af
Add LLaMA end-to-end benchmarking (#19985)
### Description

This PR adds a benchmarking script to measure end-to-end performance and
saves the results in a CSV file.

### Motivation and Context

With this PR, end-to-end performance can be easily measured for many
large-language models such as LLaMA-2. The performance numbers for
LLaMA-2 are located
[here](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama).
2024-03-21 18:59:05 -07:00
kunal-vaishnavi
a3ecb63267
Update LLaMA attention fusions (#19200)
### Description
This PR updates the LLaMA-2 attention fusions by adding the following.

- Loading the PyTorch model from Hugging Face with the `LlamaAttention`
class before exporting
- Updating the attention mask pattern matching to support another case

This PR also fixes [this
issue](https://github.com/microsoft/onnxruntime/issues/19040).

### Motivation and Context
Recent changes to Hugging Face's `transformers` library break the
existing pattern matching. Since the attention fusions aim to change the
graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to
`LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately
it does not matter what nodes comprise the `Set of Attention Nodes`
because they will all be removed and replaced by the `Attention Op` in
the end.

Therefore, it does not matter whether the `LlamaAttention` class or a
different attention class is used to load the PyTorch model before
exporting because the expected graphs after the attention fusions will
look identical no matter the attention class chosen. By loading the
PyTorch model with the `LlamaAttention` class instead of other attention
classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then
exporting it to ONNX, the existing pattern matching will continue to
work.
2024-01-19 11:09:24 -08:00
Abhishek Jindal
2f93d97fd0
Add cuda visible devices for Mistral benchmark (#18764)
### Description
<!-- Describe your changes. -->
Add cuda visible devices for Mistral benchmark as it is not working for
Torch compile and throwing an error.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Error: 
File
"/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/_inductor/triton_heuristics.py",
line 556, in run
    return launcher(
  File "<string>", line 8, in launcher
RuntimeError: Triton Error [CUDA]: invalid device context
2023-12-08 23:12:48 -08:00
petermcaughan
871c52977a
Mistral Optimization & Benchmarking Support (#18225)
### Description
As a prerequisite for this model running correctly, two PRs need to be
merged:

- GQA Sliding Window Attention:
https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local
- MHA Fusion:
https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b

This PR adds optimization, quantization, and benchmarking support for
Mistral. The README included describes steps to export, optimize, and
benchmark Mistral models, but won't function correctly without the two
above branches being merged first.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2023-12-05 15:39:17 -08:00
Frank Dong
a46c79d211
fix llama2-70b bug, add document (#18398)
1. fix dist setting bug for LLaMA2-70b distributed convert and benchmark
2. Add instruction in README for how to benchmark LLaMA2-70b distribute
inference
2023-11-10 21:59:23 -08:00
kunal-vaishnavi
c8def0cc51
Add LLaMA GQA ragged batching (#18337)
This PR updates replacing MHA with GQA and updates the LLaMA scripts for
the modified GQA op. It is related to the changes in [this
PR](https://github.com/microsoft/onnxruntime/pull/18283).

### Motivation and Context
This PR allows us to run LLaMA with the GQA op end-to-end using ragged
batching (i.e. batched inputs of different lengths).
2023-11-08 09:36:28 -08:00
Frank Dong
dabd395fdf
llama 70b model fusion and shardding (#18175)
### Description
Support llama-70b model fusion and shardding



### Motivation and Context
This change enables shard and export llama-70b model into Onnx as this
model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different
with llama-7b and llama-13b.
2023-11-02 06:03:59 -07:00
kunal-vaishnavi
b79ea74819
Add updates to LLaMA scripts (#18076)
### Description
This PR adds a few updates to scripts in the LLaMA folder:
- Fixes the precision re-naming in the LLaMA export
- Adds a "prerequisites" section in the README
- Adds IO binding synchronizations during benchmarking for other EPs



### Motivation and Context
- With precision re-naming, the LLaMA parity check does not produce
errors when creating the FP32 CPU model
- The "prerequisites" section shows that there are specific package
versions needed
- This allows for benchmarking with other EPs besides CPU and CUDA
2023-10-26 21:54:23 -07:00
kunal-vaishnavi
2a17d5cf32
LLaMA Model Optimization (#18021)
### Description
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

### Usage

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/14997
- https://github.com/microsoft/onnxruntime/issues/16254
- https://github.com/microsoft/onnxruntime/issues/17681
- https://github.com/microsoft/onnxruntime/issues/17925
- https://github.com/microsoft/onnxruntime-inference-examples/issues/320

This PR uses changes from the following PRs:
- https://github.com/pytorch/pytorch/pull/104468
- https://github.com/pytorch/pytorch/pull/109759
- https://github.com/microsoft/onnxruntime/pull/17020
- https://github.com/microsoft/onnxruntime/pull/17674
- https://github.com/microsoft/onnxruntime/pull/17890
- https://github.com/microsoft/onnxruntime/pull/17920
- https://github.com/huggingface/transformers/pull/26162
- https://github.com/huggingface/optimum/pull/1257
- https://github.com/huggingface/optimum/pull/1289
- https://github.com/huggingface/optimum/pull/1462

### New TorchDynamo Exporter (experimental stage)

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- https://github.com/huggingface/transformers/pull/26307
- https://github.com/pytorch/pytorch/issues/104903
- https://github.com/pytorch/pytorch/pull/105040
- https://github.com/microsoft/onnxscript/pull/847
- https://github.com/microsoft/onnxscript/pull/862
- https://github.com/microsoft/onnxscript/issues/493
2023-10-23 13:00:56 -07:00
kunal-vaishnavi
edac3ef150
Add LLaMA scripts (#17020)
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)



### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
2023-08-22 18:05:11 -07:00