onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-05 04:17:53 +00:00

Author	SHA1	Message	Date
Ye Wang	df796bbb62	cast logits to half when T=MLFloat16 (#13454 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-03 16:40:19 -07:00
Edward Chen	b4a1ae8350	Use narrow instead of gsl::narrow. (#13555 )	2022-11-03 16:24:11 -07:00
cloudhan	2de883c592	Update CK and fix performance issue on dev machine (#13531 ) 1. Update CK to its latest develop branch 2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's performance, ensure it is properly configured. - The flags are propagated from target `hip-lang::device`'s `INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags. - Instead, we must ensure this target is properly configured by checking _CMAKE_HIP_DEVICE_RUNTIME_TARGET is set. TL,DR `hip-lang::device` sometime will be not be properly configured if our `CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the configuration is in good state, but on dev machine it is not, which then silently result poor performance for kernels. We fixed it in this PR and add a guard to avoid unsuccessful future editing and to prevent convoluted debugging process. `_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in `/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal to [CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs), the variable name will not be changed in the foreseeable future.	2022-11-03 19:32:30 +08:00
Yi Zhang	7c3a23c186	extend some timeout value (#13552 ) ### Description <!-- Describe your changes. --> ### Motivation and Context these workflows are prone to timeout.	2022-11-03 15:11:41 +08:00
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
George Nash	77be22f379	[oneDNN ep] Update from oneDNN v2.7.0 to oneDNN v2.7.1 (#13536 ) The oneDNN 2.7.1 release includes multiple functional and performance improvements. Signed-off-by: George Nash <george.nash@intel.com> ### Description Update the oneDNN library from 2.7.0 to 2.7.1. This contains multiple functional and performance improvements. ### Motivation and Context This is a minor point release from the oneDNN library that gives performance and functional fixes that were found in the oneDNN 2.7 library shortly after release. Signed-off-by: George Nash <george.nash@intel.com>	2022-11-02 15:57:49 -07:00
Changming Sun	b1e1b25e04	Delete CUB (#13534 ) ### Description Delete CUB ### Motivation and Context Because it is already in CUDA SDK.	2022-11-02 13:06:22 -07:00
Changming Sun	5914a7e0ae	Fix an error in the python packaging pipeline (#13538 ) ### Description It missed a space there. ### Motivation and Context Right now the pipeline is failing because GSL was just converted from a submodule to a cmake external project.	2022-11-02 07:55:20 -07:00
Wei-Sheng Chin	b5904c40dd	Enable ORT in TorchDynamo (#13259 ) This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.	2022-11-01 11:19:29 -07:00
PeixuanZuo	6740528b98	[ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525 )	2022-11-01 13:05:55 +08:00
PeixuanZuo	c8886c5b4c	Revert "Update CK and fix performance due to lacking -amdgpu-early-inline-all=true (#13493 )" This reverts commit `4dd053cc15`.	2022-11-01 13:05:55 +08:00
Baiju Meswani	c557a55816	Fix on-device training ExportModelForInferencing api (#13510 )	2022-10-31 21:29:06 -07:00
Vincent Wang	17f0ffd1c8	Support More Cases in NoOpElimination (#13460 ) Current NoOpElimination can support only Add node. This PR adds support for: x-0, x1, 1x and x/1 besides x+0 and 0+x. With this PR, all Div(x,1) and their gradients (also Div(x,1)) in Huggingface's diffusers model can be removed, which takes ~1% of compute time in total previously.	2022-11-01 10:39:52 +08:00
Patrice Vignola	3d0db47c17	[DML EP] Fix variable shadowing in EinSum (#13520 ) ### Description Fix variable shadowing in the DML EP's implementation of EinSum ### Motivation and Context An SDL bug was opened because of shadowing of the variable `i` in a nested loop of the EinSum operator.	2022-10-31 19:27:43 -07:00
Patrice Vignola	74f905b237	DML EP enable the provider in the op tests (#13441 ) ### Description Enables the DML provider in the op tests to allow for better CI coverage. ### Motivation and Context Some of the CI tests for DML were actually running on the CPU because there was no default DML provider, so it was returning a `nullptr`. This should add better coverage, and it already uncovered some failures and asserts hitting in a few tests, which need to be investigated separately.	2022-10-31 15:49:03 -07:00
Adrian Lizarraga	9d867a07c0	Fix regression in CustomOpApi::GetTensorData (#13450 ) - Reverts change to CustomOpApi::GetTensorData introduced by commit `5dae0c477d`, which causes infinite recursion. - Moves EndsProfilingAllocated to non-const session implementation (C++ API header).	2022-10-31 12:20:49 -07:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Edward Chen	7fbfbf789f	Increase timeout for binary-size-checks-pipeline. (#13498 )	2022-10-28 23:15:56 -07:00
zhangyaobit	33b8778a46	Minor improvement for the documentation of kernel explorer (#13490 ) ### Description <!-- Describe your changes. --> Fix the input shape of FastGelu Minor improvement for the documentation of kernel explorer ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-28 22:57:53 -07:00
Fei Hu	943e156f4c	Allow custom ops to set input memory type (#10879 )	2022-10-28 21:45:26 -07:00
Hector Li	1b494daffa	Add yml file for Snpe EP build (#13494 ) Add yml file for Snpe EP build	2022-10-28 19:47:50 -07:00
Changming Sun	689e524c58	Move DML packaging pipelines to aiinfra-dml-winbuild machine pool (#13487 ) 1. Move DML packaging pipelines to aiinfra-dml-winbuild machine pool 2. Delete tools/ci_build/github/azure-pipelines/templates/windowsai-nuget-build.yml because the pipeline has been migrated to Onebranch. I monitored it for months, it worked well.	2022-10-28 10:30:16 -07:00
Numfor Tiapo	49e5a11ccd	Fix SDL and Prefast Errors (#13465 ) Fixes Errors 1978844, 1978870, 1978850, 1978855, and 9245 Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-10-28 09:41:18 -07:00
zhangyaobit	0a524cfe1c	Fix the input shape of FastGelu (#13488 ) ### Description <!-- Describe your changes. --> Fix the input shape of FastGelu ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-28 09:36:31 -07:00
cloudhan	4dd053cc15	Update CK and fix performance due to lacking -amdgpu-early-inline-all=true (#13493 ) 1. Update CK to its latest develop branch 2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's performance, add it.	2022-10-28 09:36:00 -07:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
JiCheng	20c3c35c33	[XNNPACK] support building xnnpack EP for IOS (#13461 ) ### Description support building xnnpack for IOS ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-28 15:03:04 +08:00
Changming Sun	07271b6c8a	Update docs/OperatorKernels.md (#13485 )	2022-10-27 20:11:49 -07:00
Jian Chen	f9378c5cca	Cjian/c4244 round 2 (#13473 ) ### Description Round 2 of fixing C4244 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-27 18:50:26 -04:00
Changming Sun	4a20c0d98b	Delete zlib.cmake (#13467 ) Delete the file because it is not included by any other file.	2022-10-27 15:36:04 -07:00
Yi Zhang	67074851a3	Skip failed models on training ci and openvino ci (#13477 )	2022-10-27 15:22:47 -07:00
Changming Sun	35659d9021	Increase the timeout value for linux-gpu-tensorrt-ci-pipeline.yml (#13481 ) Now it takes about 55-60 minutes. It is on the edge so it often fails.	2022-10-27 14:26:22 -07:00
Scott McKay	ab71c4bbc0	Document generation CI is broken (#13308 ) ### Description <!-- Describe your changes. --> Fix document generation CI. It's not currently updating the docs as we're skipping the tests, which is the invocation of build.py that would have generated the documentation. Setup specific task to generate documentation for greater clarity. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Operator kernel documentation is not getting updated and is now out of date.	2022-10-28 07:20:48 +10:00
Patrice Vignola	0b29f64dba	[DML EP] Enable all datatypes for Abs and Sign (#13470 ) ### Description Enables all datatypes supported for DML for `Abs` and `Sign`. ### Motivation and Context `Abs` and `Sign` haven't been updated since DML started to support all datatypes for them. These ops are used in some transformer models and were forcing unnecessary copies between the CPU and the GPU.	2022-10-27 11:36:11 -07:00
Dmitri Smirnov	0e2087acff	Add extension method to compensate for Contains() absence (#13466 ) ### Description The targeted framework does not contain `Contains(string, orginal)`. Add extension method to compensate in following the suggestion [here](https://learn.microsoft.com/en-us/dotnet/api/system.string.contains?view=net-7.0). ### Motivation and Context Packaging pipeline fails.	2022-10-27 10:00:47 -07:00
Baiju Meswani	a46c599a40	Training API to export the eval model to an inference model (#13345 )	2022-10-27 09:34:01 -07:00
Jian Chen	8827c4bdbc	First round of fixes. (#13452 ) ### Description First round of fixes for C4244 error. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-26 23:05:45 -04:00
Edward Chen	601b74b904	Add '$schema' entry to cgmanifest.json files. (#13444 )	2022-10-26 16:15:05 -07:00
Changming Sun	7d58332298	Update tsaoptions.json: update the email alias (#13448 )	2022-10-26 15:56:16 -07:00
Vincent Wang	805ec459a0	Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462 )	2022-10-26 15:45:18 -07:00
sumitsays	490e4ddea5	[DML EP] Don't fuse a capability outside the compile call (#13468 ) ### Description DML EP was a special EP w.r.t. capability fusion. It used to fuse a capability outside the IExecutionProvider::Compile() call. But after recent re-architecture #13131, it is no longer a special case. ### Motivation and Context Why is this change required? What problem does it solve? To make DML EP consistent with the ORT design. - If it fixes an open issue, please link to the issue here. N/A Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2022-10-26 15:21:33 -07:00
Dmitri Smirnov	1c8a22ec68	Improve logging and default affinity mask generation (#13338 ) ### Description Fix logging for affinity failures on Linux. Make `GetCpuCores()` consistently return the number of physical cores. Use `CpuInfo` library to correctly set affinities for Linux where supported. Make windows generate affinity masks as ordinals and convert them to masks at the setting site. Allow setting multiple logical processors affinity masks per thread. We continue to set all logical processors as thread affinity per physical core. ### Motivation and Context Error logging on Linux uses `pthread_self()` which does not return Thread ID. Fix default affinity mask generation on Windows. The following are the issues with Windows: - `GetThreadAffinityMasks()` returns bitmasks, but on other platforms it returns ordinals generated for the hardware concurrency - The maximum number of processors supported for requires a mask of 64-bits, but `size_t` type used is not always 64-bit - The masks returned per physical core may have multiple bits set, because the mask applies to several logical cores hosted by the physical core. In the past, customers complained that their threads jump from one core to another which adversely affects performance. The decision was made to stay this way. - 64-bit masks do not allow for logical processors with IDs that are outside of 0-63 range.	2022-10-26 13:30:27 -07:00
Rui Ren	136e15bfaf	revert cmake external file (#13459 )	2022-10-26 11:38:15 -07:00
Adrian Lizarraga	8770201e96	[EP-Perf-Dashboard] Decouple docker image name from branch name (#13449 ) ### Description Updates naming scheme for docker images built by the EP Perf pipeline. Specifically, the docker image name is no longer based on the branch name. ### Motivation and Context The docker image name used by EP Perf pipeline is built from the branch name. This makes the pipeline fail for branches with uppercase letters because docker image names can only contain lower-case letters.	2022-10-26 10:27:22 -07:00
Juan Villamizar	48b2ec944c	Fix warnings preventing Onnx build (#13447 )	2022-10-26 07:53:55 -07:00
Abhishek Udupa	8fbdc6cc46	Add a script for quick profile analysis (#13423 ) ### Description Implements a Python script for quick analysis of a generated JSON profile from ORT. ### Motivation and Context This PR implements a script that lists kernels that take up the most time in a JSON profile, from both the CPU and GPU points-of-view. The script also supports various options for CSV output, grouping of kernels wrt shape of input tensors and wrt kernel dimensions. Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>	2022-10-26 07:43:03 -07:00
PeixuanZuo	a0cc289be6	Update SkipLayerNorm fusion rules (#13350 ) ### Description <!-- Describe your changes. --> The subgraph below meet the SkipLayerNorm fusion pattern, but the fusion rules also required every input dimension has a certain value. So the subgraph below cannot fused to SkipLayerNorm. subgraph we want to fuse ![image](https://user-images.githubusercontent.com/94887879/196386821-3e678a4c-83e4-4bca-8900-5ef4ea996868.png) fusion pattern 3 [Sub1] [Sub2] \ / \ / \ / Add1 \| LayerNormalization This change allow inputs of FirstAdd operator has dimension which only has dim_param. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-10-26 16:15:27 +08:00
Patrice Vignola	ac48bdec89	DML EP add einsum MatMul NHCW ops (#13440 ) ### Description This adds the "NHCW" format support for einsum MatMul. The logic is basically a merge of the existing Transpose and MatMul Einsum implementations. ### Motivation and Context Some transformer models that I'm tracking use Einsum quite often during a single inference, and about half of those were "NHCW" MatMul Einsums. Supporting them will reduce the number of copies to the CPU.	2022-10-25 23:09:07 -07:00
Patrice Vignola	d5e8d59243	DML EP register all data types for Where operator (#13443 ) ### Description Register all datatypes for DML's `Where` operator since DML now supports everything. ### Motivation and Context Some transformer models use the `Where` operator on int64 data, but since DML wasn't supporting it, it needed to fall back to the CPU.	2022-10-25 22:47:55 -07:00
PeixuanZuo	70b73afd36	[ADD] fuse Matmul + fastgalu -> gemmfastgelu (#11699 ) Description: Describe your changes. fuse MatMul + FastGelu -> GemmFastGelu prepare for AMD optimized fused operator GemmFastGelu usage: python benchmark.py -g -m bert-base-cased --sequence_length 384 --batch_sizes 128 --provider=rocm -p fp16 --disable_embed_layer_norm --enable_gemm_fast_gelu Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-10-26 09:33:58 +08:00

1 2 3 4 5 ...

7650 commits