onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-20 02:07:56 +00:00

Author	SHA1	Message	Date
Patrice Vignola	3e4a471f36	Cherry-pick 7 changes into 1.14.1 (#14762 ) This cherry picks the following 7 changes into 1.14.1: `1b7f65437e` `b539c364ee` `12d91173c4` `ff3aed8540` `3d79b1f06e` `c0d2472ede` `e9ec4c098b` --------- Signed-off-by: Cliff Woolley <jwoolley@nvidia.com> Co-authored-by: Sheil Kumar <smk2007@gmail.com> Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com> Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com> Co-authored-by: Cliff Woolley <jwoolley@nvidia.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: Jian Chen <cjian@microsoft.com>	2023-02-23 11:00:29 -08:00
Rui Ren	6ccaeddefa	ORT 1.14.0 release -- cherry pick round3 (#14617 ) ### Description <!-- Describe your changes. --> This is the Final cherry-pick, no more PR will be accepted Third round cherry pick, total 10 PRs, as below. Please check here for [Here](https://github.com/microsoft/onnxruntime/issues?q=label%3Arelease%3A1.14+sort%3Aupdated-asc+is%3Aclosed+label%3Atriage%3Aapproved) for the total list. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {text-align:center;} .xl66 {color:windowtext; text-align:center; border:.5pt solid windowtext;} .xl67 {text-align:center; border:.5pt solid windowtext;} --> </style> </head> <body link="#0563C1" vlink="#954F72"> Date \| PR \| # \| Commit # \| Short # -- \| -- \| -- \| -- \| -- 1 \| remove 'module' field from package.json \| 14532 \| `cfb6e528c8` \| `cfb6e52` 2 \| Fix CI failure: temporarily disable real model tests from onnx repo \| 14606 \| `cf8bad7f19` \| `cf8bad7` 3 \| Stable Diffusion CUDA optimizations Part 2 \| 14597 \| `742658d171` \| `742658d` 4 \| reduce cuda library binary size \| 14555 \| `8de885fdb1` \| `8de885f` 5 \| Remove Identical Children Consolidation from default transformer uitil. \| 14602 \| `585f43e31d` \| `585f43e` 6 \| Revert mimalloc from v2.0.9 to v2.0.3 \| 14603 \| `b6bec54341` \| `b6bec54` 7 \| Adding RunOptions synchronization behaviour to C/C++ API \| 14088 \| `e9ab56fa64` \| `e9ab56f` 8 \| Move TRT include_directories to outside scope \| 14622 \| `0a6b22018f` \| `0a6b220` 9 \| Remove torch package from requirements.txt of stable diffusion models \| 14630 \| `cfda876a3f` \| `cfda876` 10 \| Test and fix optimizers LayerNormFusion, BiasSoftmaxFusion, Transpose for opset 18 \| 14542 \| `30ec8b038f` \| `30ec8b0` </body> </html> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Last round cherry-pick for ORT 1.14.0 release. --------- Signed-off-by: Kevin Chen <kevinch@nvidia.com> Signed-off-by: xadupre <xadupre@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Chun-Wei Chen <jacky82226@gmail.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Chi Lo <chi.lo@microsoft.com> Co-authored-by: Kevin Chen <45886021+kevinch-nv@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>	2023-02-09 10:08:02 -08:00
Rui Ren	5ae597da6a	ORT 1.14.0 release -- cherry pick round2 (#14573 ) ### Description <!-- Describe your changes. --> Second round cherry pick, total 13 PRs, as below. Please check here for [Here](https://github.com/microsoft/onnxruntime/issues?q=label%3Arelease%3A1.14+sort%3Aupdated-asc+is%3Aclosed+label%3Atriage%3Aapproved) for the total list. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {text-align:center;} .xl66 {color:windowtext; text-align:center; border:.5pt solid windowtext;} .xl67 {text-align:center; border:.5pt solid windowtext;} .xl68 {text-align:center; border:.5pt solid windowtext; background:white; mso-pattern:black none;} --> </style> </head> <body link="#0563C1" vlink="#954F72"> Date \| PR \| # \| Commit # \| Short # -- \| -- \| -- \| -- \| -- 1 \| Fix unused variable for CUDA EP builds with USE_FLASH_ATTENTION off \| 14404 \| `85d7e9c596` \| `85d7e9c` 2 \| UNet fusion and fp16 conversion for stable diffusion \| 14248 \| `a95fcb4345` \| `a95fcb4` 3 \| upgrade protobuf to 3.20.2 and onnx to 1.13 \| 14279 \| `80f807c03d` \| `80f807c` 4 \| Include python training apis when enable_training is enabled \| 14485 \| `d06ad9462b` \| `d06ad94` 5 \| Including support for Deepspeed 0.8.0 \| 14506 \| `6fa4555a06` \| `6fa4555` 6 \| change deepspeed version in warning from 0.7.3 to 0.8.0 \| 14527 \| `3d388a1aea` \| `3d388a1` 7 \| Do not fuse DQ+Node+Q if DQ produces graph output \| 14509 \| `d9e675a2af` \| `d9e675a` 8 \| upgrade EsrpCodeSigning from v1 to v2 \| 14531 \| `0578eeff91` \| `0578eef` 9 \| Fix python packaging pipeline \| 14533 \| `7954976e0a` \| `7954976` 10 \| Specify deps in deps.txt and manifest \| 14530 \| `01cafe89f0` \| `01cafe8` 11 \| Fix Gather to Split optimizer \| 14478 \| `0bcca7ad45` \| `0bcca7a` 12 \| Stable Diffusion CUDA Optimizations \| 14428 \| `a6c5ba0185` \| `a6c5ba0` 13 \| Fix sharing scalar bug \| 14544 \| `c6c11039d7` \| `c6c1103` </body> </html> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Second round cherry-pick for ORT 1.14.0 release. --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Yi Zhang <zhanyi@microsoft.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Abhishek Jindal <abjindal@microsoft.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: pengwa <pengwa@microsoft.com>	2023-02-06 17:00:22 -08:00
Rui Ren	1a48099eea	ORT 1.14.0 release -- cherry pick round1 (#14456 ) ### Description <!-- Describe your changes. --> First round cherry pick, total `19` PRs, as below. Please check here for [Here](https://github.com/microsoft/onnxruntime/issues?q=label%3Arelease%3A1.14+sort%3Aupdated-asc) for the total list. <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/ruiren/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {text-align:center;} .xl66 {color:windowtext; text-align:center; border:.5pt solid windowtext; background:#E2EFDA; mso-pattern:black none;} .xl67 {color:windowtext; text-align:center; border:.5pt solid windowtext;} .xl68 {color:windowtext; text-align:center; border:.5pt solid windowtext; background:white; mso-pattern:black none;} --> </style> </head> <body link="#0563C1" vlink="#954F72"> Date \| PR \| # \| Commit # \| Short # -- \| -- \| -- \| -- \| -- 0 \| fix headers for training apis \| 14350 \| `ea7bbd667d` \| `ea7bbd6` 1 \| Fix post merge jobs pipeline build issues \| 14346 \| `ae0e090c7b` \| `ae0e090` 2 \| support ScatterND(18) and ScatterElement(18) \| 14224 \| `5d6a049141` \| `5d6a049` 3 \| Exclude a multi-stream case from reduced ops build \| 14351 \| `36ba3d8d21` \| `36ba3d8` 4 \| Support muP in Attention \| 14348 \| `668586e8f8` \| `668586e` 5 \| Add memory efficient attention from CUTLASS \| 14343 \| `414b012f42` \| `414b012` 6 \| Add PyTorch 2.0 to ORT transformer benchmarking \| 14300 \| `72821a6113` \| `72821a6` 7 \| Misc transformer fixes - 3 \| 14320 \| `2d8ee5251c` \| `2d8ee52` 8 \| Update quantization_defs.cc \| 14380 \| `de7a868d5f` \| `de7a868` 9 \| Revert "Allow PostAnalysis@2 task to continue on error for Windows_Pa… \| 14375 \| `cf3661ff6d` \| `cf3661f` 10 \| Fix fuzz test \| 14385 \| `f03c507cf0` \| `f03c507` 11 \| support Pad(18) \| 14219 \| `05915d8393` \| `05915d8` 12 \| Ort openvino 4.3 cli \| 14341 \| `77b455b969` \| `77b455b` 13 \| cpu to support bitwise ops \| 14197 \| `7b6d880b28` \| `7b6d880` 14 \| Update ORT format v5 change docs to cover limited backwards compatibility in 1.14. \| 14413 \| `3bc092b1ea` \| `3bc092b` 15 \| Upgrade CUTLASS to v2.11 and add sequence length threshold for cutlass FMHA \| 14401 \| `94b1791974` \| `94b1791` 16 \| Add Col2Im CPU op \| 12311 \| `32c05fcdd1` \| `32c05fc` 17 \| [DML EP] Upgrade DML to 1.10.1 \| 14433 \| `edb377f2cb` \| `edb377f` 18 \| cpu support of LpPool(18) \| 14205 \| `2b1a59f01a` \| `2b1a59f` </body> </html> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> First round cherry-pick for ORT 1.14.0 release. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: liqun Fu <liqfu@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Yi Zhang <zhanyi@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Preetha <preetha.veeramalai@intel.com> Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com> Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>	2023-01-31 14:35:34 -08:00
Adrian Lizarraga	de17d53c50	Custom Op runtime wrapper (#13427 ) ### Description Adds the below C APIs to support custom ops that wrap an entire model to be inferenced with an external runtime. The current SNPE EP is an example of an EP that could be ported to use a custom op wrapper. Ex: The custom op stores the serialized SNPE DLC binary as a string attribute. The SNPE model is built when the kernel is created. The model is inferenced with SNPE APIs on call to the kernel's compute method. #### C APIs \| API \| Description \| Why \| \| --- \| --- \| --- \| \| `KernelInfo_GetInputCount` \| Gets number of inputs from `OrtKernelInfo`. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfo_GetOutputCount` \| Gets number of outputs from `OrtKernelInfo`. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfo_GetInputName` \| Gets an input's name. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfo_GetOutputName` \| Gets an output's name. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfo_GetInputTypeInfo` \| Gets the type/shape information for an input. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfo_GetOutputTypeInfo` \| Gets the type/shape information for an output. \| Query I/O characteristics during kernel creation<sup>1</sup> \| \| `KernelInfoGetAttribute_tensor` \| Get a OrtValue tensor stored as an attribute in the graph node \| Extract serialized models, weights, etc. \| \| `GetSessionConfigEntry` \| Get a session configuration value \| Need to be able to get session-time configurations from within custom op \| \| `HasSessionConfigEntry` \| Check if session configuration entry exists. \| Need to be able to get session-time configurations from within custom op \| #### Why so many KernelInfo APIs?<sup>1</sup> Similar APIs currently exist for `OrtKernelContext`, but not `OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op on call to its kernel's compute() function. However, `OrtKernelInfo` is available on kernel creation, which occurs when the session is created. Having these APIs available from `OrtKernelInfo` allows an operator to trade-off computation time for session-creation time, and vice versa. Operators that must build expensive state may prefer to do it during session creation time instead of compute-time. SNPE is an example of an EP that needs to be able to query `KernelInfo` for the name, type, and shape of inputs and outputs in order to build the model from the serialized DLC data. This is an expensive operation. Other providers (e.g., OpenVINO) are able to query i/o info from the serialized model, so they do not strictly need these APIs. However, the APIs can still be used to validate the expected I/O characteristics. Additionally, several of our CPU contrib ops currently use the same internal version of these KernelInfo APIs (Ex: [qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)). If custom ops are also meant to be a test bed for future ops, then all custom ops (not just runtime wrappers) would benefit from the addition of these public KernelInfo APIs (IMO). #### Example of usage in a custom OP From `onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h` ```c++ struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> { explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options); CustomOpOpenVINO(const CustomOpOpenVINO&) = delete; CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete; void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const; constexpr const char* GetName() const noexcept { return "OpenVINO_Wrapper"; } constexpr const char* GetExecutionProviderType() const noexcept { return "CPUExecutionProvider"; } // IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator // must have a non-homogeneous variadic input and output. constexpr size_t GetInputTypeCount() const noexcept { return 1; } constexpr size_t GetOutputTypeCount() const noexcept { return 1; } constexpr ONNXTensorElementDataType GetInputType(size_t /* index /) const noexcept { return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED; } constexpr ONNXTensorElementDataType GetOutputType(size_t / index /) const noexcept { return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED; } constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t / index /) const noexcept { return INPUT_OUTPUT_VARIADIC; } constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t / index */) const noexcept { return INPUT_OUTPUT_VARIADIC; } constexpr bool GetVariadicInputHomogeneity() const noexcept { return false; // heterogenous } constexpr bool GetVariadicOutputHomogeneity() const noexcept { return false; // heterogeneous } std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; } private: std::unordered_map<std::string, std::string> session_configs_; }; ``` #### How to create a session: ```c++ Ort::Env env; Ort::SessionOptions session_opts; Ort::CustomOpConfigs custom_op_configs; // Create local session config entries for the custom op. custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU"); // Register custom op library and pass in the custom op configs (optional). session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs); Ort::Session session(env, model_path.data(), session_opts); ``` ### Motivation and Context Allows creation of simple "wrapper" EPs outside of the main ORT code base.	2023-01-18 09:09:32 -08:00
Guenther Schmuelling	60290393f3	enable ort-extensions in wasm release builds (#14239 ) enable ort-extensions in wasm release builds. sentence piece, gpt2, bert and word piece tokenizers for now. wasm size will grow from 8.4MB to 8.9MB.	2023-01-17 12:39:13 -08:00
Scott McKay	7f374f4012	Fix build error on Windows if Python debug libraries are installed (#14308 ) ### Description <!-- Describe your changes. --> If a user installs the debug libraries from Python on Windows the ORT python project file attempts to use the debug python lib, which conflicts with a pragma in pyconfig.h that wants the release lib (due to pybind11 undefining _DEBUG). Explicitly use the release lib instead of Python::Module so the build doesn't break. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix obtuse build break.	2023-01-17 09:48:26 +10:00
Jeff Daily	fe052e603b	ROCm header path updates (#14170 ) ROCm reorganized header file locations. Use the new locations to avoid warnings.	2023-01-16 10:28:13 +08:00
Patrice Vignola	99a4036c80	[DML EP] Add FusedMatMul (#14196 ) ### Description Add FusedMatMul ### Motivation and Context - Add the FusedMatMul fusion for DML - Fix the FusedMatMul logic and tests when transposed batches are involved	2023-01-12 02:17:04 -08:00
cloudhan	712f781702	Make CK an optional dependencies and only built with ck if ROCm >= 5.3 (#14232 ) Recently, ck dropped ROCm 5.2 support, which is causing packaging pipeline failures. This PR workaround it.	2023-01-12 17:09:40 +08:00
Scott McKay	b9ecd428c1	Add ability to register custom ops by specifying a function name (#14177 ) ### Description <!-- Describe your changes. --> Use dlsym/GetProcAddress to lookup a custom ops registration function by name and call it. This will be better on mobile platforms where the custom ops library is linked against, and there isn't necessarily a filesystem that a library path can be loaded from. Alternative is to wire up passing in the address of the function, but that has multiple complications which differ by platform. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable using ort and ort-ext packages on mobile platforms. Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-01-12 15:11:34 +10:00
sfatimar	7654cd50e8	Openvino ep 2022.3 v4.3 (#14210 ) ### Description Changes to incorporate OpenVINO EP 2022.3 ### Motivation and Context This change is required to incorportate OpenVINO EP 2022.3 - If it fixes an open issue, please link to the issue here. --> Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Aravind <aravindx.gunda@intel.com> Co-authored-by: mayavijx <mayax.vijayan@intel.com> Co-authored-by: flexci <mohsinmx>	2023-01-11 16:31:26 -08:00
RandySheriffH	83ad562826	Rename CloudEP to AzureEP (#14175 ) Rename CloudEP to AzureEP. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-01-11 12:25:04 -08:00
RandySheriffH	ecd5ce0b33	Use json format to save and load partition config (#14169 ) Use json format to save and load partition config, previously it was csv, which brought issues among windows and posix due to different line breaks. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-01-11 10:03:14 -08:00
Ashwini Khade	d92c663f28	Create dedicated build for training api (#14136 ) ### Description Enable creating dedicated build for on device training. With this PR we can build a lean binary for on device training using flag --enable_training_apis. This binary includes only the essentials like training ops, optimizers etc and NOT features like Aten fallback, strided tensors, gradient builders etc . This binary also removes all the deprecated components like training::TrainingSession and OrtTrainer etc ### Motivation and Context This enables our partners to create a lean binary for on device training.	2023-01-10 20:58:04 -08:00
Chen Fu	90142899bd	Supporting Intel AMX instructions in quantized GEMM (#14042 ) ### Description Using Intel AMX int8 instructions to accelerate quantized GEMM ### Motivation and Context AMX instructions accelerate quantized GEMM significantly: Prepacked B perf numbers (latency in ns) GEMM Config \| AVX512Vnni \| AMX -- \| --: \| --: M:384/N:1024/K:1024/Batch:1/Threads:4 \| 1057511 \| 285393 M:384/N:1024/K:3072/Batch:1/Threads:4 \| 2643929 \| 700397 M:384/N:1024/K:4096/Batch:1/Threads:4 \| 3784750 \| 890701 M:384/N:4096/K:1024/Batch:1/Threads:4 \| 2378139 \| 887251 M:384/N:1024/K:1024/Batch:1/Threads:16 \| 307137 \| 138481 M:384/N:1024/K:3072/Batch:1/Threads:16 \| 855730 \| 295027 M:384/N:1024/K:4096/Batch:1/Threads:16 \| 1126878 \| 317395 M:384/N:4096/K:1024/Batch:1/Threads:16 \| 781963 \| 237014 M:1536/N:1024/K:1024/Batch:1/Threads:16 \| 538864 \| 181459 M:1536/N:1024/K:3072/Batch:1/Threads:16 \| 1681002 \| 561600 M:1536/N:1024/K:4096/Batch:1/Threads:16 \| 2158127 \| 717470 M:1536/N:4096/K:1024/Batch:1/Threads:16 \| 2428622 \| 896140 M:3072/N:1024/K:1024/Batch:1/Threads:16 \| 1058029 \| 357031 M:3072/N:1024/K:3072/Batch:1/Threads:16 \| 3138504 \| 1095857 M:3072/N:1024/K:4096/Batch:1/Threads:16 \| 4155640 \| 1386183 M:3072/N:4096/K:1024/Batch:1/Threads:16 \| 4679030 \| 1778624 Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-01-10 12:16:27 -08:00
Ye Wang	a01bf8dbb1	rename CrossAttention to MultiHeadAttention (#14201 ) ### Description <!-- Describe your changes. --> rename the CrossAttention to MultiheadAttention since this op can also be used as self attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-10 10:18:39 -08:00
Guenther Schmuelling	6b8c72cfa6	pin ort-ext to 81e7799c69044c745239202085eb0a98f102937b (#14044 ) pin onnxruntime-extension to 81e7799c69044c745239202085eb0a98f102937b in preparation to in enable extension in wasm build.	2023-01-10 10:10:17 -08:00
liqun Fu	1be36913cc	to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765 )	2023-01-09 10:26:16 -08:00
cloudhan	be879c11ee	Add batched and strided batched gemm as TunableOp (#13841 )	2023-01-07 19:11:40 +08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Yi Zhang	2ce7b1c1dc	Enable cache for msbuild (#14085 ) ### Description Enable ccache in windows CPU compilation. The windows compilation in CI could be reduced to 1 more minute at most. ![image](https://user-images.githubusercontent.com/16190118/210294061-86742cf4-65c7-4cc2-9725-e102c3c64abd.png)	2023-01-06 11:19:57 +08:00
PeixuanZuo	4eac0db3af	[ROCm] Add GemmFastGelu CK implementation (#13759 ) ### Description <!-- Describe your changes. --> Add GemmFastGelu CK implementation. TODO 1. The performance of CK GemmFastGelu in ORT is not good as using CK directly, still need to investigate the reason and improve the CK in ORT. `GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89 tflops` `withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8, Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152 n=3072 k=768 2401.9799 us 96.56 tflops` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-01-05 17:53:30 +08:00
Adrian Lizarraga	68794d0ac1	Improve custom op library handle cleanup (#14099 ) ### Description - Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages the lifetime of dynamic library handles (i.e., calls `dlclose` or `FreeLibrary`). - Deprecates C API `OrtApi::RegisterCustomOpsLibrary`. - Adds C++ API wrapper for convenient registering of custom op libraries. - `PySessionOptions` is now an alias of `OrtSessionOptions` ### Motivation and Context The current API for registering custom op libraries loads dynamic libraries but requires users to handle the release of the corresponding library handles. Additionally, the user has to make sure to release the library handle _after_ the session has been destroyed (or the program segfaults). The new API automatically cleans up the library and allows the user to write more straightforward code.	2023-01-04 17:56:29 -08:00
cao lei	b29a1c7348	Address follow-up comments on multistream pr #13495 (#13992 ) ### Description This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Co-authored-by: Lei Cao <leca@microsoft.com>	2023-01-03 16:33:36 -08:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
RandySheriffH	587e891cae	CloudEP (#13855 ) Implement CloudEP for hybrid inferencing. The PR introduces zero new API, customers could configure session and run options to do inferencing with Azure [triton endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint) Sample configuration in python be like: ``` sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton'); sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com'); sess_opt.add_session_config_entry('cloud.model_name', 'detection2'); sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1 sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose ... run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint. run_opt.add_run_config_entry('cloud.auth_key', '...') ... sess.run(None, {'input':input_}, run_opt) ``` Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-01-03 10:03:15 -08:00
Yi Zhang	52e3fe961d	add dnnl dependency in unittest.cmake (#14104 ) ### Description It's from the PR #14085 On multiple running msbuilds , it throws the exception of ``` 22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal "C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo if %errorlevel% neq 0 goto :cmEnd :cmEnd endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone :cmErrorLevel exit /b %1 :cmDone if %errorlevel% neq 0 goto :VCEnd :VCEnd" exited with code 1. ``` https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f It should make sure that the onnxruntime_test_all project depends on dnnl project.	2023-01-03 11:24:06 +08:00
Tianlei Wu	6a9dc6c993	[CUDA] Update fused MHA to support flash attention and causal mask (#13953 ) ### Description Update fused attention kernels to support flash attention and causal mask (GPT-2 initial decoder run). Note: Causal kernels are from FasterTransformer 5.2. Flash attention kernels that is not causal are from TensorRT 8.5.1. #### Performance Test of bert-base model Test like the following: ``` python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index ``` Original Flash Attention is from https://github.com/HazyResearch/flash-attention. RemovePadding and RestorePadding is added before/after the original flash attention but not for this PR, so the result is not apple-to-apple comparison. It is added for reference only. Average latency (ms) of float16 bert-base-cased model: * A100 Kernel \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 \| b128_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 1.83 \| 5.00 \| 9.31 \| 17.76 \| 34.47 \| 67.43 \| 133.38 TRT Fused \| 2.05 \| 3.58 \| 5.70 \| 10.96 \| 21.22 \| 41.23 \| 80.56 Flash Attention (from FT) \| 1.43 \| 3.20 \| 5.71 \| 10.95 \| 22.19 \| 42.96 \| 84.54 Flash Attention (from TRT) \| 1.44 \| 3.28 \| 5.70 \| 10.86 \| 21.00 \| 40.56 \| 79.53 Original Flash Attention \| 1.81 \| 4.04 \| 6.82 \| 13.06 \| 24.62 \| 46.58 \| 91.10 * T4 \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 8.17 \| 29.86 \| 59.56 \| 115.77 \| 236.66 \| 461.43 Flash Attention (from FT) \| 5.65 \| 21.12 \| 44.94 \| 86.83 \| 174.16 \| 351.38 Flash Attention (from TRT) \| 5.73\| 21.49\| 45.49 \| 89.15 \| 174.37 \| 352.08 Original Flash Attention \| 6.22 \| 22.16 \| 43.39 \| 83.8 \| 168.77 \| 337.04 * V100 Kernel \| b1_s512 \| b4_512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 3.77 \| 10.48 \| 19.53 \| 37.63 \| 73.68 \| 145.58 Flash Attention (from FT) \| 3.21 \| 8.25 \| 14.95 \| 28.83 \| 56.28 \| 111.15 #### Performance Test of GPT-2 model Test like the following: ` python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b 1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512 ` * A100 Note that flash attention is used as fused attention when sequence_length > 128. batch_size \| sequence_length \| with Fused Attention \| without Fused Attention \| A100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 0.93 \| 1 \| 7.0% 4 \| 8 \| 0.82 \| 0.88 \| 6.8% 8 \| 8 \| 0.84 \| 0.88 \| 4.5% 16 \| 8 \| 0.92 \| 0.97 \| 5.2% 32 \| 8 \| 1.15 \| 1.17 \| 1.7% 64 \| 8 \| 1.68 \| 1.72 \| 2.3% 128 \| 8 \| 2.76 \| 2.78 \| 0.7% 1 \| 16 \| 0.95 \| 0.95 \| 0.0% 4 \| 16 \| 0.83 \| 0.88 \| 5.7% 8 \| 16 \| 0.91 \| 0.97 \| 6.2% 16 \| 16 \| 1.12 \| 1.17 \| 4.3% 32 \| 16 \| 1.67 \| 1.72 \| 2.9% 64 \| 16 \| 2.73 \| 2.76 \| 1.1% 128 \| 16 \| 4.96 \| 4.95 \| -0.2% 1 \| 32 \| 0.94 \| 0.88 \| -6.8% 4 \| 32 \| 0.91 \| 0.97 \| 6.2% 8 \| 32 \| 1.12 \| 1.17 \| 4.3% 16 \| 32 \| 1.65 \| 1.71 \| 3.5% 32 \| 32 \| 2.69 \| 2.76 \| 2.5% 64 \| 32 \| 4.86 \| 4.94 \| 1.6% 128 \| 32 \| 9.35 \| 9.38 \| 0.3% 1 \| 64 \| 0.84 \| 0.88 \| 4.5% 4 \| 64 \| 1.1 \| 1.17 \| 6.0% 8 \| 64 \| 1.64 \| 1.73 \| 5.2% 16 \| 64 \| 2.66 \| 2.77 \| 4.0% 32 \| 64 \| 4.82 \| 4.97 \| 3.0% 64 \| 64 \| 9.23 \| 9.4 \| 1.8% 128 \| 64 \| 18.54 \| 19.12 \| 3.0% 1 \| 128 \| 0.91 \| 0.98 \| 7.1% 4 \| 128 \| 1.68 \| 1.74 \| 3.4% 8 \| 128 \| 2.71 \| 2.83 \| 4.2% 16 \| 128 \| 4.85 \| 5.09 \| 4.7% 32 \| 128 \| 9.32 \| 9.69 \| 3.8% 64 \| 128 \| 18.54 \| 19.44 \| 4.6% 128 \| 128 \| 36.86 \| 38.47 \| 4.2% 1 \| 256 \| 1.15 \| 1.23 \| 6.5% 4 \| 256 \| 2.71 \| 2.95 \| 8.1% 8 \| 256 \| 4.87 \| 5.3 \| 8.1% 16 \| 256 \| 9.32 \| 10.23 \| 8.9% 32 \| 256 \| 18.6 \| 20.53 \| 9.4% 64 \| 256 \| 36.93 \| 40.41 \| 8.6% 128 \| 256 \| 72.84 \| 80.14 \| 9.1% 1 \| 512 \| 1.68 \| 1.96 \| 14.3% 4 \| 512 \| 4.9 \| 6.02 \| 18.6% 8 \| 512 \| 9.4 \| 11.59 \| 18.9% 16 \| 512 \| 18.71 \| 23.05 \| 18.8% 32 \| 512 \| 37.13 \| 45.46 \| 18.3% 64 \| 512 \| 74.04 \| 89.88 \| 17.6% 128 \| 512 \| NA \| NA \| NA * T4: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| T4 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.97 \| 2.11 \| 6.6% 4 \| 8 \| 2.2 \| 2.25 \| 2.2% 8 \| 8 \| 2.77 \| 3.1 \| 10.6% 16 \| 8 \| 4.17 \| 4.2 \| 0.7% 32 \| 8 \| 6.86 \| 6.82 \| -0.6% 64 \| 8 \| 14.88 \| 14.92 \| 0.3% 128 \| 8 \| 31.4 \| 31.29 \| -0.4% 1 \| 16 \| 1.61 \| 1.71 \| 5.8% 4 \| 16 \| 2.13 \| 2.31 \| 7.8% 8 \| 16 \| 3.38 \| 3.67 \| 7.9% 16 \| 16 \| 6.16 \| 6.54 \| 5.8% 32 \| 16 \| 14.16 \| 14.76 \| 4.1% 64 \| 16 \| 30.36 \| 30.57 \| 0.7% 128 \| 16 \| 63.14 \| 63.57 \| 0.7% 1 \| 32 \| 1.53 \| 1.69 \| 9.5% 4 \| 32 \| 3.34 \| 3.66 \| 8.7% 8 \| 32 \| 6.25 \| 6.64 \| 5.9% 16 \| 32 \| 14.12 \| 14.9 \| 5.2% 32 \| 32 \| 28.96 \| 29.82 \| 2.9% 64 \| 32 \| 61.07 \| 61.77 \| 1.1% 128 \| 32 \| 116.38 \| 117.98 \| 1.4% 1 \| 64 \| 2.01 \| 2.21 \| 9.0% 4 \| 64 \| 6.18 \| 6.67 \| 7.3% 8 \| 64 \| 13.72 \| 14.49 \| 5.3% 16 \| 64 \| 28.71 \| 29.83 \| 3.8% 32 \| 64 \| 58.65 \| 60.68 \| 3.3% 64 \| 64 \| 113.09 \| 113.17 \| 0.1% 128 \| 64 \| 205.21 \| 209.4 \| 2.0% 1 \| 128 \| 3.37 \| 3.76 \| 10.4% 4 \| 128 \| 13.54 \| 14.85 \| 8.8% 8 \| 128 \| 28.32 \| 30.22 \| 6.3% 16 \| 128 \| 58.16 \| 62.09 \| 6.3% 32 \| 128 \| 109.17 \| 113.99 \| 4.2% 64 \| 128 \| 198.9 \| 207.1 \| 4.0% 128 \| 128 \| 413.25 \| 421.82 \| 2.0% 1 \| 256 \| 6.33 \| 7.05 \| 10.2% 4 \| 256 \| 28.09 \| 31.49 \| 10.8% 8 \| 256 \| 57.47 \| 62.76 \| 8.4% 16 \| 256 \| 106.77 \| 117.95 \| 9.5% 32 \| 256 \| 197.02 \| 208.58 \| 5.5% 64 \| 256 \| 406.81 \| 431.36 \| 5.7% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 13.84 \| 16.32 \| 15.2% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA * V100: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| V100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.31 \| 1.6 \| 18.1% 4 \| 8 \| 1.17 \| 1.26 \| 7.1% 8 \| 8 \| 1.43 \| 1.79 \| 20.1% 16 \| 8 \| 2.14 \| 1.96 \| -9.2% 32 \| 8 \| 2.91 \| 3.08 \| 5.5% 64 \| 8 \| 5.32 \| 5.27 \| -0.9% 128 \| 8 \| 9.34 \| 8.97 \| -4.1% 1 \| 16 \| 1.41 \| 1.58 \| 10.8% 4 \| 16 \| 1.38 \| 1.49 \| 7.4% 8 \| 16 \| 1.81 \| 2.2 \| 17.7% 16 \| 16 \| 2.8 \| 2.83 \| 1.1% 32 \| 16 \| 4.94 \| 4.99 \| 1.0% 64 \| 16 \| 8.88 \| 8.84 \| -0.5% 128 \| 16 \| 17.35 \| 17.2 \| -0.9% 1 \| 32 \| 1.38 \| 1.77 \| 22.0% 4 \| 32 \| 1.77 \| 1.93 \| 8.3% 8 \| 32 \| 2.71 \| 2.86 \| 5.2% 16 \| 32 \| 5.03 \| 4.92 \| -2.2% 32 \| 32 \| 8.8 \| 8.79 \| -0.1% 64 \| 32 \| 17.29 \| 17.23 \| -0.3% 128 \| 32 \| 33.27 \| 33.1 \| -0.5% 1 \| 64 \| 1.67 \| 1.87 \| 10.7% 4 \| 64 \| 2.69 \| 2.76 \| 2.5% 8 \| 64 \| 4.87 \| 4.94 \| 1.4% 16 \| 64 \| 8.73 \| 8.81 \| 0.9% 32 \| 64 \| 16.92 \| 17.24 \| 1.9% 64 \| 64 \| 33 \| 33.38 \| 1.1% 128 \| 64 \| 65.33 \| 65.86 \| 0.8% 1 \| 128 \| 2.03 \| 2.22 \| 8.6% 4 \| 128 \| 4.9 \| 5.04 \| 2.8% 8 \| 128 \| 8.76 \| 8.81 \| 0.6% 16 \| 128 \| 17.06 \| 17.29 \| 1.3% 32 \| 128 \| 33.25 \| 33.56 \| 0.9% 64 \| 128 \| 65.54 \| 66.5 \| 1.4% 128 \| 128 \| 130.44 \| 131.44 \| 0.8% 1 \| 256 \| 2.78 \| 2.86 \| 2.8% 4 \| 256 \| 8.75 \| 9.04 \| 3.2% 8 \| 256 \| 17 \| 17.68 \| 3.8% 16 \| 256 \| 33.19 \| 34.32 \| 3.3% 32 \| 256 \| 65.43 \| 67.86 \| 3.6% 64 \| 256 \| 129.92 \| 134.68 \| 3.5% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 4.95 \| 5.32 \| 7.0% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA	2022-12-31 10:33:54 -08:00
Dmitri Smirnov	d762aa2a4c	Let Cmake decide where to place abseil (#14057 ) ### Description Remove Abseil module placement specifications ### Motivation and Context Allow Cmake defaults take place and possible redirection of all submodules for sharing between the local builds.	2022-12-23 12:08:13 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
Changming Sun	05137e6ec4	Use target name for flatbuffers (#13991 ) ### Description Use target name for flatbuffers. Add version range for flatbuffers. It is similar to #13870 ### Motivation and Context To fix a build error: ``` CMake Error at onnxruntime_graph.cmake:88 (add_dependencies): The dependency target "flatbuffers" of target "onnxruntime_graph" does not exist. Call Stack (most recent call first): CMakeLists.txt:1490 (include) ``` It happens when flatbuffers library is already installed. For example, on Ubuntu people may get it from apt-get. But, the one provided by Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04 works fine.	2022-12-20 11:44:02 -08:00
Changming Sun	fc2a6db573	Update absl to the latest release (#13990 ) ### Description Update absl to a new version ### Motivation and Context The new version contains fixes that are needed for Nvidia GPU build. Once we update it to that version, we don't need to maintain our private patches for Nvidia GPU build.	2022-12-19 14:25:13 -08:00
cloudhan	2df046fc67	Fix deprecated-builtins (#14001 ) Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins] This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.	2022-12-17 18:17:05 +08:00
FFFrog	6705915af8	[CANN] Add the ability to run graph (#13728 ) ### Description Add the ability to run graph ### Motivation and Context A brief description is as follows: 1) If the whole graph is supported, then will be processed by the graph engine, directly. 2) If the whole graph is not supported, the whole graph will be divided into subgraphs and single operators; The sub-graphs will be run on graph engine, and the single operators will fallback to the traditional mode.	2022-12-16 06:57:40 -08:00
Tang, Cheng	a81faee41e	Multi-stream execution support (#13495 ) Description: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. Motivation and Context - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com>	2022-12-15 07:39:29 -08:00
Chi Lo	5b492cbae3	[TensorRT EP] support TensorRT 8.5 (#13867 ) Integrate TensorRT 8.5 - Update TensorRT EP to support TensorRT 8.5 - Update relevant CI pipelines - Disable known non-supported ops for TensorRT - Make timeout configurable. We observe more than [20 hours](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=256729&view=logs&j=71ce39d8-054f-502a-dcd0-e89fa9931f40) of running unit tests with TensorRT 8.5 in package pipelines. Because we can't use placeholder to significantly reduce testing time (c-api application test will deadlock) in package pipelines, we only run subsets of model tests and unit tests that are related to TRT (add new build flag--test_all_timeout and set it to 72000 seconds by package pipelines). Just to remember, we still run all the tests in TensorRT CI pipelines to have full test coverage. - include https://github.com/microsoft/onnxruntime/pull/13918 to fix onnx-tensorrt compile error. Co-authored-by: George Wu <jywu@microsoft.com>	2022-12-14 13:06:03 -08:00
Ashwini Khade	6090d8cd6e	Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888 ) ### Description Fix usage of enable_training_ops and reduce ifdef complexity for training builds. ### Motivation and Context This is the second refactoring PR towards creating a dedicated build for on device training. This PR aims to reduce some complexity. We can set ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. - If it fixes an open issue, please link to the issue here. -->	2022-12-14 08:32:46 -08:00
Changming Sun	070769d61d	Use onnxruntime_fetchcontent_makeavailable cmake function for TRT (#13918 ) ### Description Use onnxruntime_fetchcontent_makeavailable cmake function for TRT. See the comment for the reason. ### Motivation and Context To support a newer TRT version. Previously they have a "BUILD_EXE" build option to allow us to exclude such things from build. But in https://github.com/onnx/onnx-tensorrt/pull/879 they deleted the build option. It wouldn't be a problem if we continue to use git submodules as before, because cmake's add_subdirectories function has an "EXCLUDE_FROM_ALL" keyword. However, cmake's FetchContent module doesn't. That's why I needed to create our own version of the macro.	2022-12-12 11:27:46 -08:00
RandySheriffH	75584c5fa8	Enabling thread pool to be numa-aware (#13778 ) The PR enables ort thread pool to be numa-aware, so that threads could be evenly created and distributed among numa nodes. In addition, to facilitate performance tuning, the PR opens a new API allowing customers to attach threads to certain logical processors. Please check the API [definition](https://github.com/microsoft/onnxruntime/pull/13778/files#diff-5845a5c76fb64abdc8f0cffe21b37f8da1712674eb3abc4cd87190891be1bd48) for details. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2022-12-12 10:33:55 -08:00
Abhishek Udupa	83c59d2594	Session-aware and thread-safe CUDA profiler (#13706 ) ### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>	2022-12-09 13:22:12 -08:00
Changming Sun	d5b45226be	Improve the handling of /external:I (#13904 ) ### Description Improve the handling of "/external:I". The "onnxruntime_external_lib_include_dir" variable may be: 1. A simple file path 2. A cmake generator expression like "$<INSTALL_INTERFACE:include>", "$<TARGET_PROPERTY:onnx_proto,INTERFACE_INCLUDE_DIRECTORIES>", "$<BUILD_INTERFACE:xxxx>". It seems that we can't simply put them in to the "target_compile_options" line. So this PR tries to parse the expression and extract the part we need out. ### Motivation and Context Resolve the Github issue: https://github.com/microsoft/onnxruntime/issues/13893	2022-12-09 11:44:32 -08:00
Changming Sun	05dc1165a5	Add protobuf version constraint (#13870 ) To fix a build error: /home/xxxxxxxxxxxxx/onnxruntime/build/Linux/Debug/tensorboard/compat/proto/cost_graph.pb.cc:17:8: error: ‘PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto’ does not name a type 17 \| extern PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto ::PROTOBUF_NAMESPACE_ID::internal::SCCInfo<1> scc_info_TensorShapeProto_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto;	2022-12-08 16:14:16 -08:00
Yulong Wang	dbf47284d1	[wasm] disable closure compiler in debug build (#13865 ) ### Description disable closure compiler in debug build. after this change, emscripten will only run closure compiler in release build.	2022-12-08 13:18:19 -08:00
Changming Sun	81c2defd3b	Remove unused git submodules (#13830 )	2022-12-07 21:59:16 -08:00
Ashwini Khade	983877c712	Decouple strided tensor support from ENABLE_TRAINING (#13829 ) ### Description Decouple strided tensor support from ENABLE_TRAINING ### Motivation and Context This is step 1 for creating a dedicated build for on device training. Intention is 1. We can set ENABLE_STRIDED_TENSORS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. 2. This also paves the way to easily enable strided tensor support for inference in future (if required).	2022-12-07 09:22:21 -08:00
cloudhan	f79d38181b	Fix hipify to avoid nccl_service.h: No such file or directory (#13852 ) Fix various flaky build error due to onnxruntime_session missing dependencies on hipify generated files.	2022-12-07 09:10:37 +08:00
Changming Sun	d12521d7b2	Upgrade pybind11 (#13853 ) Upgrade pybind11 to include the fix for #9735	2022-12-06 15:39:23 -08:00
Ashwini Khade	65201e47bf	Enable nuget packages for on device training (#13637 ) ### Description This PR enables building nuget packages locally for on device training using --build_nuget arg. This PR also enables the C# bindings by default in the managed package. If a user triggers any training apis when the native binary is not built for training, an exception with message "Training is disabled in the current build. Please build ONNXRuntime from source with the build flags enable_training and enable_training_on_device. " is thrown. Build command for creating nuget packes for on device training: build.bat --enable_training --enable_training_on_device --build_nuget 2 Nuget packages are built 1. Microsoft.ML.OnnxRuntime.Managed 2. Microsoft.ML.OnnxRuntime.Training OR Microsoft.ML.OnnxRuntime.Training.Gpu ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-05 14:54:09 -08:00

1 2 3 4 5 ...

1244 commits