onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Changming Sun	5b826b1bc3	Update cmake version in Linux build (#15707 ) ### Description All our Windows build pipelines already uses cmake 3.26 except one pipeline: QNN ARM64. This PR does the same for Linux build pipelines. ### Motivation and Context This change is related to #15704 .	2023-04-27 20:02:33 -07:00
Edward Chen	9db24f8fec	Update kernel registration validation to allow kernel registrations to appear in arbitrary order. (#15705 ) The validation script will now sort them by increasing opset order before processing them.	2023-04-27 18:49:31 -07:00
kunal-vaishnavi	39d6d7050d	Change EmbedLayerNormalization mask index output to optional (#15526 ) ### Description This PR changes an EmbedLayerNormalization node's mask index output to be an optional output if a mask input is not provided. ### Motivation and Context The documentation for EmbedLayerNormalization states ``` The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated. ``` However, if the mask input is not provided, the mask index output is still calculated and required.	2023-04-27 16:32:42 -07:00
Yulong Wang	d471432e10	[js/webgpu] fix attribute cache key for 2 operators (#15710 ) ### Description fix attribute cache key for LeakyRelu and ThresholdedRelu	2023-04-27 15:04:33 -07:00
Yulong Wang	c0116af619	[js/webgpu] operator Exp (#15713 ) ### Description operator Exp	2023-04-27 15:04:09 -07:00
Tang, Cheng	627f5c9767	support allgather on different axis (#15610 ) ### Description Extend the AllGather op to support perform allgather on different axis. provide the implementation in nccl kernels. ### Motivation and Context We hit some scenario in distributed inference that we need to support gather on non-first axis. --------- Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>	2023-04-27 14:47:28 -07:00
Sheil Kumar	5bde1e8e37	Add Bluestein Z-Chirp Algorithm to DirectML DFT implementation (#15686 ) Add Bluestein Z-Chirp Algorithm to DirectML DFT implementation This will enable STFT and DFT on signals which have non-powers of 2.	2023-04-27 14:03:40 -07:00
Adrian Lizarraga	be5c582e65	[QNN EP] Update to QNN SDK 2.9.0 (#15709 ) ### Description - Update to QNN SDK 2.9.0 for QNN pipelines - Temporarily disable warnings as errors for QNN Windows x64 pipeline - Note that this pipeline did not previously run to completion. It also currently does not run for pull requests. ### Motivation and Context Need to update and test the latest available version of the QNN SDK.	2023-04-27 13:44:09 -07:00
RandySheriffH	9773e76c44	Single-schema-multi-kernel (#15184 ) The PR is to allow custom op of different input types to have same op name in a graph. The idea to go over all ops of same name and merge their input/output types into a type-inference function. With the enhancement, custom op node inside a graph can have same op-type given that the input/output types are different. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-04-27 13:39:59 -07:00
Changming Sun	d3d232b047	Rename onnxruntime-Linux-CPU-2019 machine pool (#15691 ) Rename onnxruntime-Linux-CPU-2019 machine pool to "onnxruntime-Ubuntu2004-AMD-CPU". The old one has an internal error and stuck there. I cannot make any change to it. It has been like this for more than 1 week. So I created a new pool with the same setting except the name is different. Also, move some android pipelines to "onnxruntime-Linux-CPU-For-Android-CI" which uses a standard image from https://github.com/actions/runner-images	2023-04-27 12:46:18 -07:00
Chi Lo	a957a872d3	Patch fix for the newly added TRT EP provider options (#15687 ) We missed some code change with recently added TRT EP provider options	2023-04-27 10:36:01 -07:00
Changming Sun	d3e8d7a70d	Better support for cmake 3.26 and Windows ARM64 (#15704 ) ### Description In #8953 I introduced a change in our onnxruntime_mlas.cmake that it enables "ASM_MASM" cmake language for all Windows build. ```cmake enable_language(ASM_MASM) ``` Before the change, it is only enabled when onnxruntime_target_platform equals to x64. However, cmake 3.26 added a new language: ASM_MARMASM. According to cmake's manual, ASM_MASM is for Microsoft Assembler ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake 3.26. We should choose the right one according to ${onnxruntime_target_platform}.	2023-04-27 10:25:45 -07:00
yf711	2e1f92a986	Fix EP Perf pipeline (#15507 ) ### Description * Update TensorRT 8.6 lib dependencies in dockerfile of TRT EP Perf pipeline * Avoid using `--allow_running_as_root` and build ORT with non-root user ### Motivation and Context To fix the build issue on EP perf pipeline Fixed [AB#14615]	2023-04-27 10:09:14 -07:00
Yi Zhang	8cda1ffa28	Fix error in post-merge pipeline (#15717 ) ### Description Get the right drive letter on Windows ### Motivation and Context Build Directory might be in drive C	2023-04-27 10:05:15 -07:00
cloudhan	a952419674	[ROCm] Fix FusedConv to stop caching fusion args (#15671 ) The follow code shows ROCm EP FusedConv produce incorrect results: ```py import numpy as np import onnx import onnxruntime as ort X = onnx.helper.make_tensor_value_info("input", onnx.TensorProto.FLOAT, [1, 64, 55, 55]) a = onnx.helper.make_tensor_value_info("tmp", onnx.TensorProto.FLOAT, [1, 64, 55, 55]) Y = onnx.helper.make_tensor_value_info("output", onnx.TensorProto.FLOAT, [1, 64, 55, 55]) weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32) weight1 = onnx.helper.make_tensor("weight1", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data) bias_data = np.random.random(64).astype(np.float32) bias1 = onnx.helper.make_tensor("bias1", onnx.TensorProto.FLOAT, [64], bias_data) weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32) # <------ comment out weight2 = onnx.helper.make_tensor("weight2", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data) bias_data = np.random.random(64).astype(np.float32) # <------ comment out bias2 = onnx.helper.make_tensor("bias2", onnx.TensorProto.FLOAT, [64], bias_data) node1 = onnx.helper.make_node("FusedConv", inputs=[X.name, weight1.name, bias1.name], outputs=[a.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu") node2 = onnx.helper.make_node("FusedConv", inputs=[a.name, weight2.name, bias2.name], outputs=[Y.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu") graph = onnx.helper.make_graph([node1, node2], "Graph", [X], [Y], initializer=[weight1, bias1, weight2, bias2]) model = onnx.helper.make_model(graph, producer_name="tmp", opset_imports=[ onnx.helper.make_opsetid('com.microsoft', 1), onnx.helper.make_opsetid('ai.onnx.ml', 1), onnx.helper.make_opsetid('', 14), ]) sess0 = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"]) sess1 = ort.InferenceSession(model.SerializeToString(), providers=["ROCMExecutionProvider"]) ref = sess0.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0] our = sess1.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0] print(ref - our) ``` The root cause is that fusion args is cached together with fusion plan. It seems that internal to MIOpen, the `miopenOperatorArgs_t` handle is copied directly to execution engine, instread of the content of a `miopenOperatorArgs_t`. If two ORT `OpKernel`s have the same conv kernel spatial dimension and strides, etc, we then get the same hash for the fusion plan, thus we also get the same fusion args handle. Then the second node of `FusedConv` may modify the fusion args on the fly when it is still pending execution for first node of `FusedConv` internal to MIOpen. This PR moves the fusion args out of fusion plan cache to avoid the problem.	2023-04-27 23:20:25 +08:00
pengwa	2efb75bfe9	Fold shape related operation (#14936 ) ### Fold shape related operation at best efforts. This is a follow up for PR https://github.com/microsoft/onnxruntime/pull/12561. Create a specialized shape_optimzer to constant fold shape related operation. ShapeOptimizer at the best efforts to constant fold the dim values that exists from shape inferencing. This is helpful to simplify the graph, which on the other hand, help other graph transformers to do more. Transformer that traverses the graph top-down and performs shape optimizations. Try the best effort to constant fold the shape related to Shape node outputs: 1. Shape generates 1D tensor [12, 128, 512] (all dimensions have concrete dim value), which can be constant folded to an initializer including 1D tensor values [12, 128, 512]. (Some logic of ConstantFolding also does the same thing.) 2. Shape generate 1D tensor [batch_size, 128, 512] -> Slice(start=1,end=3), we can constant fold the Shape->Slice to an initializer including 1D tensor values [128, 512]. 3. Shape generate 1D tensor [batch_size, 128, 512] -> Gather(axes=[0], index=[2]), we can constant fold the Shape->Gather to an initializer including 1D tensor values [512]. 4. Shape 15 takes input of shape [batch_size, 128, 512], slicing from 1 to 2(exclusive), we can constant fold the Shape15(start=1,end=2) to an initializer including 1D tensor values [128]. This would help clean up the graph, combined with ConstantFolding, the graph would be much more simplified. ### Motivation and Context One direct motivation to have this is, we have a model subgraph like this: ![image](https://user-images.githubusercontent.com/10530022/223390243-47b13922-4340-4999-9637-f52a33f69a2d.png) The subgraph in the green rectangle is trying to get the value `30522`, with the changes in this PR, the subgraph will be constant folded. Plus ConstantFolding optimizer will further to optimize out the subsquent `Squeeze`/`Unsqueeze`/`ConcatTraining`, then we will have a clean very clean Reshape node, with its shape input be an constant `[-1, 20522]`. Having this simplified graph, our other compute optimizer can help further optimize the graph by re-ordering gather/reshape nodes.	2023-04-27 18:59:28 +08:00
Yi Zhang	53ff50d19a	make nuget workflow easy to debug. (#15693 ) ### Description Add parameters to make some stages could use other run's intermediate output. ### Motivation and Context nuget workflow has 38 stages of 4 layers. We had to run the whole workflow from begining to test one stage. It could make life easier to run only one stage for testing. like ![image](https://user-images.githubusercontent.com/16190118/234453721-e6e9a4bd-5e0b-4101-a18e-d5cf60615c9f.png) ### N.B. In this PR, Nuget_Test_Linux_CPU, Nuget_Test_LinuxGPU and Jar_Packaging_GPU are enabled as the first step. So I can start to move tests from Linux host to container	2023-04-27 14:54:14 +08:00
Ted Themistokleous	926ae7d786	Add updated skipped test for multiheadattention Packed KV & QKV (#15587 ) Adds skip for MIGraphX EP builds for Packed KV and QKV tests in Multi Head attention. As it is not supported and causes CI failures when building and testing EPs --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-04-27 10:31:53 +08:00
Changming Sun	e63bb5acef	Fix a memory leak in QGemm (#15703 ) ### Description The BufferUniquePtrs in the old code doesn't have knowledge of the allocator where the allocated memory was from, so it cannot free the memory.	2023-04-26 18:48:00 -07:00
Rachel Guo	740d553c42	[rn] Reland support loading model from buffer for Android (#14514 ) ### Description <!-- Describe your changes. --> Reland previous reverted changes for loading model from buffer - Android ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #13903 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-04-26 16:53:17 -07:00
Yulong Wang	a02c885f86	[js/webgpu] add implementation of Relu, LeakyRelu and ThresholdedRelu (#15668 ) ### Description add implementation of Relu, LeakyRelu and ThresholdedRelu	2023-04-26 15:11:01 -07:00
Justin Chu	76ddc92fbd	Enable RUFF as a formatter (#15699 ) ### Description RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX linter. ### Motivation and Context Better engineering	2023-04-26 14:04:07 -07:00
Yufeng Li	d7ba9814cf	[prefast:Warning]: C26409 ('PackedAttention<onnxruntime::MLFloat16>::TryGettingFusedRunner') (#15663 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-04-26 14:03:36 -07:00
Patrice Vignola	97c4cab6b7	[DML EP] Massage SkipLayerNorm axes to better target metacommands (#15676 ) DML's MVN metacommand needs all axes except for batch and channel to be reduced. By adding trailing dimensions of 1's and their corresponding axes, the operation stays the same but we are now able to call metacommands.	2023-04-26 14:00:36 -07:00
Hector Li	4c7b5032da	[QNN EP]Support unpack initializer from external data source (#15694 ) ### Description Support unpack initializer from external data source ### Motivation and Context Support unpack initializer from external data source	2023-04-26 13:39:40 -07:00
yf711	28985c47b7	[TensorRT EP] Unleash opset16-17 onnx model tests (#15657 ) ### Description In 2021 we restricted onnx node test CI execution in range of opset 14-15 for ORT-TRT, which was the latest opset that TRT EP could support Update this range to opset 14-17 to improve the ORT-TRT unit test coverage, as [Nvidia announced that TRT 8.6 supported opset17](https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md)	2023-04-26 11:44:19 -07:00
kunal-vaishnavi	cfb8c0e2ca	Add Whisper custom export to wheel (#15685 ) ### Description This PR adds the Whisper custom export scripts to the wheel. ### Motivation and Context This enables access to the custom export scripts in the wheel.	2023-04-26 10:45:52 -07:00
yf711	d701dcd027	Fix Linux MultiGPU TensorRT CI (#15697 ) ### Description * Reverting default TensorRT version to 8.5 as temporary fix * Apart from that, this PR temporarily leaves this CI as a place to validate user behavior that uses TRT 8.5 with latest ORT ### Context * This CI pool equips 2xTesla M60 GPUs, which are no longer supported by TensorRT 8.6. * Currently, other CIs are using single-T4 VM but there's no VM with 2xT4 or other suitable dualGPU in the range. * Once we decide which VM instance for this CI to migrate to, TRT8.6 can be enabled on this CI * According to [Nvidia](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html): * TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices are deprecated in TensorRT 8.6.	2023-04-26 10:01:33 -07:00
PeixuanZuo	0ecfe83932	[ROCm] add beam search support (#15625 ) add beam search support for ROCm EP.	2023-04-26 17:53:33 +08:00
Xavier Dupré	699c9a520b	Fix TVM pipelines (#15653 ) ### Description Fix TVM pipelines by adding missing dependancy of TVM (attrs).	2023-04-26 09:55:05 +02:00
Yulong Wang	b98317b907	[js/webgpu] following up for JSEP/WebGPU code cleanup (#15666 ) ### Description This PR resolves a part of non-critical comments from code review comments in #14579. - use `USE_JSEP` instead of `USE_JS` in build definition to make it less ambiguous - remove unused util functions from util.ts - fix transpose.h - other misc fixes	2023-04-25 21:20:03 -07:00
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Changming Sun	b1b6e5522e	Update cuda 11.6 to 11.8 for Windows pipelines (#15684 ) ### Description Update cuda 11.6 to 11.8 for Windows pipelines This PR is just for Windows CUDA pipelines. It does include any change for Linux pipelines or TensorRT pipelines ### Motivation and Context It is a planned feature for the upcoming ONNX Runtime release.	2023-04-25 20:23:57 -07:00
Rui Ren	db6a9bc033	support latest deepspeed version for optim (#15682 ) ### Description <!-- Describe your changes. --> support the latest deepspeed 0.9.1 for the next release ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will avoid the warn message `Skip modifying optimizer because of unsupported DeepSpeed version` --------- Co-authored-by: ruiren <ruiren@microsoft.com>	2023-04-25 20:12:23 -07:00
Hector Li	3dc9720cfc	[QNN EP] Enable Qnn EP op support Elu, HardSwish, Atan (#15681 ) ### Description Enable some Ops for QNN EP: Elu, HardSwish, Atan ### Motivation and Context unblock more models	2023-04-25 20:11:06 -07:00
Wei-Sheng Chin	1524f73a09	Implement two easier random tensor generator (RTG) for flaky tests (#15517 ) Some math ops have very bad numerical stability and essential randomness (e.g., exp/log with reduction on large elements). To maintain the same test coverage with lower CI failing rate, we can gradually replace flaky tests' RTG with the ones implemented in this PR --- try Discrete first. If still unstable, use Circular. Overall recommended strategy to handle flaky test - Find if it uses `Uniform` in `onnxruntime/test/common/tensor_op_test_utils.h`. If yes, replace `Uniform` with `Discrete` implemented in this PR. For `candidate_values`, we can try `[-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]`, `[-2, -1, 0, 1, 2]`, `[-1, 0, 1]`, and `[0, 1]` and choose the most difficult one among those passing 100 runs. - If `Discrete` fails to meet the stability requirement, switch to `Circular` and repeat the `candidate_values` selection process. Let's keep an eye on the two bugs mentioned in https://github.com/microsoft/onnxruntime/pull/15515. If the related unit tests fail again, we can replace the underlying `RandomValueGenerator::Uniform` with `FixedPatternValueGenerator::Descrete` or `FixedPatternValueGenerator::Circular` implemented in this PR.	2023-04-25 17:52:44 -07:00
Numfor Tiapo	f44f6c5b2e	Fix Prefast Errors (#15651 ) This PR adds fixes for prefast errors with the following codes: - C26814 - C26451 - C26400	2023-04-25 16:41:39 -07:00
Rui Ren	4c3e350a6a	fix ORTModuleONNXModelException fallback OOM (#15523 ) ### Description <!-- Describe your changes. --> ### Error ``` RuntimeError: There was an error while exporting the PyTorch model to ONNX:- Traceback (most recent call last): File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string raise exception File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model torch.onnx.export(self._flattened_module, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/__init__.py", line 305, in export return utils.export(model, args, f, export_params, verbose, training, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 118, in export _export(model, args, f, export_params, verbose, training, input_names, output_names, File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 743, in _export proto, export_map, val_use_external_data_format = graph._export_onnx( RuntimeError: ONNX export failed: Couldn't export Python operator XDropout ``` The error leads to Out of Memory issue, because the log.txt file is 26 GB. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The root cause is that in each `_forward` ``` if log_level <= _logger.LogLevel.WARNING and not self._raised_ORTModuleONNXModelException: warnings.warn( ( f"Fallback to PyTorch due to exception {type(self._exception)} was triggered. " "Report this issue with a minimal repro at https://www.github.com/microsoft/onnxruntime. " f"See details below:\n\n{_utils.get_exception_as_string(self._exception)}" ), UserWarning, ) ``` above code will be called and log the `exception` through `get_exception_as_string`, In my training case, this will lead to 40 k times of `Traceback` stdout and 110 millions lines of `onnx graph` output and run into OOM. ### Validation After above fixes, the log.txt file will only be 2.4 MB. --------- Co-authored-by: ruiren <ruiren@microsoft.com>	2023-04-25 15:10:31 -07:00
Yulong Wang	d30831d829	[js/webgpu] make `RunFunction` return `void` (#15669 ) ### Description make `RunFunction` return `void`. the return value is meaningless in the OpResolveRule context. Allows any JavaScript error to be caught and returns non-zero return value from `computeKernel()`	2023-04-25 14:14:26 -07:00
Chen Fu	2fa10fb803	Fp16 onnx pool operators, relu, leakyrelu (#15498 ) ### Description Adding the fp16 onnx operator implementations: maxpool, averagepool, global average pool, relu, leaky relu ### Motivation and Context Continue with support for fp16. Standard onnx operator implementations are needed as a basis for the graph optimizers to work.	2023-04-25 14:01:47 -07:00
Changming Sun	9bf08bdb52	Fix iconv link issue (#15592 ) ### Description Fix iconv link issue. The library is used in string_normalizer.cc. ### Motivation and Context Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries. However, this change has two caveats: 1. It may silently pull in GNU libraries into libonnxruntime.so, and make the shared library not distributable. 2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.	2023-04-25 13:28:36 -07:00
Ye Wang	d05777ddb6	stabilize fusion script with a seperate create_attention_node() (#15670 ) ### Description <!-- Describe your changes. --> previously it used create_attention_node() from base class in fusion_attention.py. sometimes the changes in that file may silently lead to generating a bad model. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-25 13:07:58 -07:00
Baiju Meswani	5885abfb35	Training Documentation (#15612 )	2023-04-25 11:44:12 -07:00
Ye Wang	d00197aaa7	initialize cache_indir explicitly in beamsearch with encoder decoder model (#15667 )	2023-04-25 11:05:21 -07:00
Chi Lo	e1755541cc	Fix TRT timing cache test (#15588 ) TRT EP test for timing cache has wrong logic where it enables timing cache for both sessions to compare the trt engine build time, that's why CI got some intermittent failures. This PR disabled the timing cache test for comparing the engine build time between enabling/disabling timing cache until we find a model that can benefit from timing cache.	2023-04-25 10:20:26 -07:00
Wei-Sheng Chin	d0c3f92ec6	[DORT] Fix fake tensor problem cuased by PyTorch change (#15664 ) This should make `Orttraining Linux Lazy Tensor CI Pipeline` green again.	2023-04-25 19:56:42 +08:00
Yulong Wang	3440d3a08e	remove 'lib/' from .gitignore (#15613 ) This will ignore source folder /js/web/lib/	2023-04-24 18:43:32 -07:00
Ashwini Khade	124ea0a801	remove compute optimizer from lte (learning on the edge) builds (#15637 ) ### Description Removing compute optimizer from on device training builds. ### Motivation and Context 1. mitigate android build failures 2. reduce binary size Since only CPU EP is enabled for LTE builds, we can optimize the models offline.	2023-04-24 15:57:15 -07:00
Yulong Wang	14cc02c65c	[js/web] WebGPU backend via JSEP (#14579 ) ### Description This change introduced the following new components into ONNX Runtime Web: - JavaScript Execution Provider (JSEP) - Asynchronized inferencing execution powered by Emscripten's Asyncify - WebGPU backend implemented in TypeScript - initial implementation of kernels: - elementwise operators (22) - binary operators (5) - tensor: Shape, Reshape, Transpose, Gemm - nn: Conv, {Global}Maxpool, {Global}AveragePool Code need to be polished. still working on it. ## Q&A What is JSEP? > JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model. Why JSEP? > JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP: > 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation. > 2. the requirement of asynchronized execution from JavaScript API (eg. `buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify. What is WebGPU? > WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL). > WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available. What is the async problem and why we have the problem? > The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code: > ```c > // C-style declarations (API) > typedef void (ON_COMPLETE)(PVOID state, DATA data); > void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete); > > // implementation > DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) { > // how to implement? > } > ``` > The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible. > > WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As `OrtRun()` will eventually call into DataTransfer for copy data from GPU to CPU, and `OrtRun()` is a synchronized function, this cannot be done in normal way. What is Emscripten? How is the Asyncify feature resolved the problem? > Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers. > > Asyncify is a [compiler feature](https://emscripten.org/docs/porting/asyncify.html) that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside `OrtRun()` call. ## Design Overview Inter-op JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js: ```js // init JSEP Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) { Module.jsepBackend = backend; Module.jsepAlloc = alloc; Module.jsepFree = free; Module.jsepCopy = copy; Module.jsepCopyAsync = copyAsync; Module.jsepCreateKernel = createKernel; Module.jsepReleaseKernel = releaseKernel; Module.jsepRun = run; }; ``` This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime: - `jsepBackend`: assign the singleton object to webassembly module - `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc() and Free() - `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU) - `jsepCopyAsync`: asynchronized copy ( GPU to CPU) - `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT - `jsepRun`: OpKernel::Compute() should call into this The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript. Resource Management Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly. For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes. about data transfer `js::DataTransfer::CopyTensor` implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function to be called in the synchronized context. run kernel in JS Kernel class constructor calls once `jsepCreateKernel()` with an optional per-kernel specific serialization to pass attributes into JavaScript. `Compute()` are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro `EM_ASM_`. disabled features* memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size). concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it. prefer channels last JSEP prefers channels last and returns `DataLayout::NHWC` in method `GetPreferredLayout()`. This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used. Testing code It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-04-24 15:21:18 -07:00
George Wu	8dd32fed47	[TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639 ) TensorRT will load/unload libraries as builder objects are created and torn down. This will happen for every single unit test, which leads to excessive test execution time due to that overhead. This overhead has steadily increased over the past few TensorRT versions as the library objects get bigger leading to 8 hours to run all the unit tests. Nvidia suggests to keep a placeholder builder object around to avoid this.	2023-04-24 14:43:13 -07:00

1 2 3 4 5 ...

8673 commits