### Description Support INT4 weight only quantize (WOQ) via Intel Neural Compressor, including RTN and GPTQ 2 algorithms. **Note:** Please install `neural-compressor==2.3` for weight only quantize. ### Motivation and Context As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy. RTN is the most straightforward way to quantize weight. GPTQ algorithm provides more accurate quantization but requires more computational resources. ### Evaluation results The following table shows the accuracy results of Llama-2 models evaluated on [lambada_openai](https://huggingface.co/datasets/lambada) task. `GPTQ W4G32Asym` in configuration column means GPTQ algorithm is used for 4-bit weight only quantization, setting group_size=32 and scheme=asym. <table class="tg"> <thead> <tr> <th rowspan="2">Model name</th> <th rowspan="2">Configuration</th> <th colspan="2">Lambada_openai</th> <th rowspan="2">Accuracy Ratio<br>[WOQ/FP32]</th> </tr> <tr> <th>Accuracy</th> <th>Perplexity</th> </tr> </thead> <tbody> <tr> <td rowspan="2">meta-llama/Llama-2-7b-chat-hf</td> <td>FP32</td> <td>0.7058</td> <td>3.2788</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7025</td> <td>3.4489</td> <td>99.53%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-7b-hf</td> <td>FP32</td> <td>0.7392</td> <td>3.3950</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7326</td> <td>3.5286</td> <td>99.11%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-13b-chat-hf</td> <td>FP32</td> <td>0.7312</td> <td>2.9163</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G128Asym</td> <td>0.7289</td> <td>3.0061</td> <td>99.56%</td> <tr> <td rowspan="2">meta-llama/Llama-2-13b-hf</td> <td>FP32</td> <td>0.7677</td> <td>3.0438</td> <td>/</td> </tr> <tr> <td>GPTQ<br>W4G32Asym</td> <td>0.7607</td> <td>3.1562</td> <td>99.09%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-70b-chat-hf</td> <td>FP32</td> <td>0.7543</td> <td>2.6181</td> <td>/</td> </tr> <tr> <td>RTN<br>W4G32Sym</td> <td>0.7489</td> <td>2.6850</td> <td>99.28%</td> </tr> <tr> <td rowspan="2">meta-llama/Llama-2-70b-hf</td> <td>FP32</td> <td>0.7964</td> <td>2.6612</td> <td>/</td> </tr> <tr> <td>RTN<br>W4G32Sym</td> <td>0.7896</td> <td>2.7546</td> <td>99.15%</td> </tr> </tbody> </table> --------- Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> |
||
|---|---|---|
| .config | ||
| .devcontainer | ||
| .gdn | ||
| .github | ||
| .pipelines | ||
| .vscode | ||
| cgmanifests | ||
| cmake | ||
| csharp | ||
| dockerfiles | ||
| docs | ||
| include/onnxruntime/core | ||
| java | ||
| js | ||
| objectivec | ||
| onnxruntime | ||
| orttraining | ||
| rust | ||
| samples | ||
| tools | ||
| winml | ||
| .clang-format | ||
| .clang-tidy | ||
| .dockerignore | ||
| .gitattributes | ||
| .gitignore | ||
| .gitmodules | ||
| .lintrunner.toml | ||
| build.bat | ||
| build.sh | ||
| build_arm64x.bat | ||
| CITATION.cff | ||
| CODEOWNERS | ||
| CONTRIBUTING.md | ||
| lgtm.yml | ||
| LICENSE | ||
| NuGet.config | ||
| ort.wprp | ||
| ORT_icon_for_light_bg.png | ||
| packages.config | ||
| pyproject.toml | ||
| README.md | ||
| requirements-dev.txt | ||
| requirements-doc.txt | ||
| requirements-lintrunner.txt | ||
| requirements-training.txt | ||
| requirements.txt.in | ||
| SECURITY.md | ||
| setup.py | ||
| ThirdPartyNotices.txt | ||
| VERSION_NUMBER | ||

ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →
ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Learn more →
Get Started & Resources
-
General Information: onnxruntime.ai
-
Usage documentation and tutorials: onnxruntime.ai/docs
-
YouTube video tutorials: youtube.com/@ONNXRuntime
-
Companion sample repositories:
- ONNX Runtime Inferencing: microsoft/onnxruntime-inference-examples
- ONNX Runtime Training: microsoft/onnxruntime-training-examples
Builtin Pipeline Status
| System | Inference | Training |
|---|---|---|
| Windows | ||
| Linux | ||
| Mac | ||
| Android | ||
| iOS | ||
| Web | ||
| Other |
Third-party Pipeline Status
| System | Inference | Training |
|---|---|---|
| Linux |
Data/Telemetry
Windows distributions of this project may collect usage data and send it to Microsoft to help improve our products and services. See the privacy statement for more details.
Contributions and Feedback
We welcome contributions! Please see the contribution guidelines.
For feature requests or bug reports, please file a GitHub Issue.
For general discussion or questions, please use GitHub Discussions.
Code of Conduct
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
License
This project is licensed under the MIT License.