onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-26 03:00:54 +00:00

Author	SHA1	Message	Date
Patrice Vignola	4880f1da46	Fix attention fusion for UNet onnx model export when using LoRA weights (#17249 ) ### Description Tested with stable diffusion unet models exported by both pytorch 2.1.0 (nightly) and pytorch 1.13.1, with and without LoRA weights. ### Motivation and Context LoRA weights modifiy the unet model by adding matmul and scale operations to every q/k/v/out tensors, which breaks the current MHA pattern recognition.	2023-08-29 11:59:30 -07:00
Hector Li	761c4333b5	[QNN EP] GridSample op support (#17317 ) ### Description QNN EP GridSample op support	2023-08-29 11:41:59 -07:00
Hector Li	742b192a34	[QNN EP] Enable GlobalMaxPool op (#17304 ) ### Description [QNN EP] Enable GlobalMaxPool op	2023-08-29 11:25:34 -07:00
Artem Shilkin	6e60dba726	Fix compilation with newer flatbuffers (#17164 ) In flatbuffers@v23.5.9 was broken forward declaration for FlatBufferBuilder. Trying to compile onnxruntime falls with the following error: ``` flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder') typedef FlatBufferBuilderImpl<false> FlatBufferBuilder; ^ onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here class FlatBufferBuilder; ``` This PR removes these declarations and puts includes instead	2023-08-29 10:28:26 -07:00
Yi Zhang	0e9e9b2a67	Fix one exception in post merge (#17327 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-29 19:24:50 +08:00
Baiju Meswani	5d2c57363f	Sign CUDA Kernel (#17293 )	2023-08-28 21:03:58 -07:00
Baiju Meswani	38ea8c3931	Increase max error tolerance for ConvTransposeGrad test (#17315 )	2023-08-28 17:05:40 -07:00
Tianlei Wu	ee9d046112	Fix model serialization with external data in current directory (#17311 ) When original model has external data in current directory, saving the optimized model will raise File not found exception during looking for external data file under root directory "/". This fix will look under current directory for this case. I manually tested an extra case and it is working: Original model with external data in root directory ("/"), and save optimized to current directory. BTW, there is another bug found: when "session.optimized_model_external_initializers_min_size_in_bytes" is set a large value, some tensor is still pointed to the original external data file. Add a TODO in unit test for this bug. Possible solution: load external data into memory before saving model.	2023-08-28 16:06:04 -07:00
Caroline	228db24317	Add training API functions to WASM API (#16521 ) ### Description * Created `wasm/training_api` source and header files & modified WebAssembly CMake to include training flags * The `wasm/training_api` files use an `OrtTrainingManager` handle which is a struct of an OrtCheckpointState and an OrtTrainingSession, rather than creating a CheckpointState handle & a separate TrainingSession handle. * This is so that the TypeScript side only has to manage one handle that will be passed between TrainingSession & CheckpointState representations, rather than the TypeScript side managing separate CheckpointStateHandle and TrainingSessionHandle. ### Motivation and Context WASM API needs to be updated with ORT training API function calls so that ORT training web bindings can be added for on-device training. --------- Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-08-28 11:05:02 -07:00
Hariharan Seshadri	cbd97515cd	[JS/WebGPU] Support GatherElements kernel (#17243 ) ### Description As title ### Motivation and Context Improve WebGPU kernel coverage	2023-08-28 09:55:25 -07:00
mindest	53169f59e5	[ROCm] Sort candidate solutions in rocBLAS/hipBLASLt for deterministic offline tuning (#17297 ) ### Description Sort the candidates in rocBLAS/hipBLASLt to make sure that they are properly ordered and can be correctly fetched by saved indices in offline tuning cases.	2023-08-28 16:34:21 +08:00
cloudhan	bf8b1681f9	Build nuget pkg for ROCm (#16791 ) Add nuget pkg building and publishing for ROCm EP --------- Co-authored-by: Yi Zhang <zhanyi@microsoft.com>	2023-08-28 13:35:08 +08:00
Yulong Wang	bb1871332f	[js/webgpu] add kernel Not and Equal (#17306 ) ### Description This PR adds kernel implementation for operator "Not" and "Equal". Also removed download cache in gpu data manager. Why removing download cache The following test case failed. ("Or" is on CPU, "Greater" and "Equal" are on JSEP) ![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2) after debugging, I found that both "Equal" and "Greater" are using the same output GPU Data ID. This is because when ORT executes the graph, it first run "Equal", allowing its shader to write into GPU Data ID 2; then a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP); at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it as output for "Greater". This means there is no allocation for output of "Greater" kernel, and both kernel writes to GPU Data ID=2. For gpu data manager, there will be 2 downloads from the same GPU buffer. Previously I think this is a waste of resource so I cached the data. But now it shoes that we need to perform 2 downloads because the GPU data is already different. The download data cache should be removed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-27 19:50:17 -07:00
simonjub	4eedd3bb46	[TRT EP] Fix logic to reach cache encryption code. (#17111 ) ### Description This is a followup to PR #15519 that is closed in favor of this one. ### Motivation and Context The current implementation of TRT cache has no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. This was originally raised in issue #12551.	2023-08-26 20:09:03 -07:00
Scott McKay	ca0159b45d	Various test infra updates from testing Azure ops with MAUI test app (#17262 ) ### Description <!-- Describe your changes. --> - fix issue with handling string input - set minSdkVersion - otherwise defaults to 19 which we don't support and the build breaks - comment out the debug logging hook - enabling it breaks the Android native logging - can be enabled if you need to debug C# code - update test data tools to allow creating input data for raw file contents (e.g. audio) and from strings (e.g. auth token value) - fix some warnings ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve test setup	2023-08-27 09:35:00 +10:00
Yulong Wang	ddcd46174e	[js/webgpu] fix jsepOnRunEnd (#17300 ) ### Description fix jsepOnRunEnd: jsepOnRunEnd() need to be run after runPromise is resolved.	2023-08-26 00:30:28 -07:00
Yifan Li	808215366d	Fix Multi GPU TensorRT tests (#17269 ) ### Description * Integrate `trt_multi_gpu` test stage in ORT post merge CI (Win-2xA10 vm) * Deprecate Linux MultiGPU TRT CI (This vm will be deprecated soon) * Add multi gpu support to existing C# test cases * Deprecate unfunctional flag `--enable_multi_device_tests` ### Motivation and Context * Two contexts of replacing Linux MultiGPU TRT CI: * Flag `--enable_multi_device_tests` is not functional, which cannot detect issues like #17036 * The Linux-2xM60 VM of this CI pool is about to be deprecated 9/6/23. Need to enable this test in other dualGPU vm pool.	2023-08-25 20:30:45 -07:00
Arthur Islamov	c262879214	Added DML and CUDA provider support in onnxruntime-node (#16050 ) ### Description I've added changes to support CUDA and DML (only on Windows, on other platforms it will throw an error) ### Motivation and Context It fixes this feature request https://github.com/microsoft/onnxruntime/issues/14127 which is tracked here https://github.com/microsoft/onnxruntime/issues/14529 I was working on StableDiffusion implementation for node.js and it is very slow on CPU, so GPU support is essential. Here is a working demo with a patched and precompiled version https://github.com/dakenf/stable-diffusion-nodejs ---------	2023-08-25 16:57:06 -07:00
Jiajia Qin	873ef8b8f0	[js/webgpu] add label for some webgpu APIs (#17291 ) ### Description <!-- Describe your changes. --> With the label, it's more easier to identify which op causes the error. Without the label, the error message is like below: ``` Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32' return W[i2o_W(indices)]; ^^^^^^ - While validating [ShaderModuleDescriptor] - While calling [Device].CreateShaderModule([ShaderModuleDescriptor]). ``` With the label, the error message is like below: ``` Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32' return W[i2o_W(indices)]; ^^^^^^ - While validating [ShaderModuleDescriptor "ConvTranspose2D"] - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "ConvTranspose2D"]). ``` ### Motivation and Context This change is mainly for debugging. With this change, we can easily know that `ConvTranspose2D`'s shader has problem from above message.	2023-08-25 12:12:56 -07:00
xhcao	5e8d94cec8	[js/webgpu] support Greater and Less operators (#17296 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 12:11:25 -07:00
Adrian Lizarraga	5a83a67f32	Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127 ) ### Description - Enables int32 support for com.microsoft.DequantizeLinear (contrib op) - Makes the `zero_point` input optional for Quantize/Dequantize contrib ops - Enables QDQ transformations with the Quantize/Dequantize contrib ops - Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests, TransposeOptimizerTests ### Testing List of tested graph transformations: - [x] QDQSelectorActionTransformer - qdq_transformer_test.cc - [x] QDQS8ToU8Transformer - qdq_transformer_test.cc - [x] DoubleQDQPairsRemover - qdq_transformer_test.cc - [x] IdenticalChildrenConsolidation - qdq_transformer_test.cc - [x] QDQPropagation - qdq_transformer_test.cc - [x] QDQFinalCleanup - qdq_transformer_test.cc - [x] CliQuantFusion - qdq_transformer_test.cc - [x] ReluQuantFusion - qdq_transformer_test.cc - [x] EnsureUniqueDQForNodeUnit - ensure_unique_dq_for_node_unit_test.cc - [x] TransposeOptimizer - transpose_optimizer_test.cc - [x] CommonSubexpressionElimination - graph_transform_test.cc - [x] ConstantFolding - graph_transform_test.cc ### Motivation and Context We need to [support mixed 16-bit/8-bit precision QDQ models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is the first step in achieving this goal: we need to make QDQ contrib ops work with our optimizations/transformations. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-08-25 09:57:51 -07:00
Yulong Wang	79c4ed9a45	[js/webgpu] support error pop and kernel name (#17260 ) ### Description This PR contains changes to support error pop and kernel name. - Add a function `JsepGetNodeName` to allow reading kernel name from JS to C++ - When in debug mode ( `env.debug = true;` ) or in profiling mode ( `env.webgpu.profilingMode = 'default';` ), kernel name will be read from ORT; otherwise use the kernel pointer ( a number ) as kernel name to save calls from JS to C++. - When in debug mode, WebGPU validation errors will be recorded and if any error occurs, `inferenceSession.run()` will fail (Promise get rejected). Behavior when not in debug mode is not changed. This is because recording errors are not zero-overhead, and GPU validation errors should occur consistently in and not in debug mode. - Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to: - allow implementation of the features mentioned above. - pass session ID to backend.	2023-08-25 08:08:15 -07:00
satyajandhyala	da180b20fa	[JS/Web] Fix ConvTranspose shader code compilation errors. (#17232 ) ### Description Fix JSEP ConvTranspose shader code errors. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 06:25:54 -07:00
guyang3532	401129d484	Add support for more ops for padding elimination (#17217 ) Add support for Gelu/ReduceMean/SimplifiedLayerNormalization for padding elimination	2023-08-25 18:02:15 +08:00
mindest	735cc8e6c8	[ROCm] enable If op for ROCm EP. (#17279 ) ### Description Enable If op for ROCm EP.	2023-08-25 17:49:49 +08:00
Yi Zhang	9cd33e07b4	Readd Tests in Window GPU Reduced Ops workflow (#17294 ) ### Description Add single test step in Window GPU Reduced Ops workflow ### Motivation and Context The old workflow's building and testing were running in one command. In PR #17263, the test step was removed by mistake. So, readd it. How to consolidate the test step is in consideration.	2023-08-25 15:56:59 +08:00
Yi Zhang	4a0f8f6672	Skip one flaky Test (#17290 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It's skipped in the PR ``` 2023-08-25T02:37:48.7772670Z 1: [ RUN ] ModelTests/ModelTest.Run/cuda__models_opset9_Candy_candy 2023-08-25T02:37:48.7824755Z 1: D:\a\_work\1\s\onnxruntime\test\providers\cpu\model_tests.cc(91): Skipped 2023-08-25T02:37:48.7825343Z 1: Skipping single test It's in broken_tests ```	2023-08-25 14:48:41 +08:00
Changming Sun	3e934030f4	nodejs: Release Ort Env before main function returns (#17288 ) ### Description Release OrtEnv before main function returns. Before this change, OrtEnv is deleted when C/C++ runtime destructs all global variables in ONNX Runtime's core framework. The callstack is like this: ``` * frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7 frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2 frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17 frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1 frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2 frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17 frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261 frame #7: 0x00007ffff7857630 libc.so.6`exit + 32 frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135 frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128 frame #10: 0x0000000000abbdee node`_start + 46 ``` After this change, OrtEnv will be deleted before the main function returns and nodejs is still alive.	2023-08-24 23:07:02 -07:00
mindest	93ae17d1bb	[ROCm] Add hipBLASLt workspace support (#17096 ) ### Description * hipBLASLt extra workspace for split-k * type update (due to extra support for fp8 in hipBLASLt) * minor changes	2023-08-25 13:08:57 +08:00
pengwa	7c98f45928	Fix layernorm and softmax axis after upstream (#17255 ) ### Fix layernorm and softmax axis after upstream For Gather (the slicing is a scalar), the output rank is small than its inputs. When we upstream this kind of Gather before softmax or layernorm, we should also update the axis attribute. Otherwise, the axis might be out-of-date and incorrect for the updated rank. ``` File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception raise exception File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 280, in forward self._build_graph(graph_transformer_config) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 158, in wrapper result = func(graph_execution_manager, args, kwargs) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 273, in wrapper result = func(graph_execution_manager, args, *kwargs) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 361, in _build_graph super()._build_graph(graph_transformer_config) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 184, in _build_graph self._graph_builder.build(config) RuntimeError: /onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:823 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Node (Softmax_2904) Op (Softmax) [ShapeInferenceError] 'axis' must be in [-3 , 2]. Its actual value is: 3 ```	2023-08-25 12:26:22 +08:00
Faith Xu	86238fb507	[Docs] Auto generate JS API (#17271 ) ### Description Adds new workflow to generate js docs with latest changes so the API page can stay up to date [Test page of latest js docs](https://faxu.github.io/onnxruntime/docs/api/js/modules/InferenceSession.html)	2023-08-24 17:35:37 -07:00
Yi Zhang	756eda2cc4	Windows CI build steps template (#17263 ) ### Description 1. New windows ci build steps template. 2. Remove useless variables. ### Motivation and Context 1. Make it easier to apply build cache to all windows CIs. 2. Other team's devs only need to take care of build options ###Comparision Before: `9f21f694cf/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml (L19-L82)` After: `b4c1f2261b/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml (L35-L54)`	2023-08-25 05:58:49 +08:00
Hector Li	680fac64ed	[QNN EP] Support non-quantized Op on HTP (#17194 ) ### Description [QNN EP] Support non-quantized Op on HTP 1. Remove the limitation in GetCapability that always require QDQ node unit group to partition the node on NPU backend. So that we can support non-quantized Slice op with int32 data input on HTP. 2. Enable Where QDQ node unit 3. Separate out the flag is_npu_backend & is_quantized_node to make it clear 4. Separate output QuantizeLinear, DequantizeLinear to QdqOpBuilder to better identify quantized/un-quantized input/output tensor 5. Separate out a TransposeOpBuilder to make it simple for Transpose node processing. Especially for Single Transpose node in QDQ model, for case like Q->Tranpose->DQ, Transpose is not QDQ node unit group, it's single node. But we should treat it as quantized node. Output should has same data type and quantization parameter with input. Another case is to support non-quantized data for Transpose in QDQ model. 6. Remove is_npu_backend flag from OpBuilder interface. Set the backend type in QnnBackendManager, QnnMOdel & QnnModelWrapper, so that OpBuilders can always get it from QnnModelWrapper. 7. Add unit tests for quantized/non-quantized Transpose (int32, float32) on HTP backend	2023-08-24 14:57:16 -07:00
pengwa	18d5cfdb85	Fix build - redefinition of default argument for ‘long unsigned int Extent’ (#17281 ) ### Fix build - redefinition of default argument for ‘long unsigned int Extent’ One of the training customer env, building ORT, there is such a build error. The GCC version are ``` aiscuser@node-0:/tmp/onnxruntime$ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 aiscuser@node-0:/tmp/onnxruntime$ g++ --version g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 ``` But on our dev node using same GCC/G++, we don't have build issue., not sure what's the difference but giving an explict type when creating `gsl::span` fixed the problem. ``` /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’ 394 \| class span \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here 46 \| template <class ElementType, std::size_t Extent = dynamic_extent> \| ^~~~~~~~~~~~~~~ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete 82 \| [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) { \| ^ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void, size_t)’: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed: 83 \| return gsl::span(reinterpret_cast<const std::byte>(data), length); \| ^ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte, size_t&)’ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’ 740 \| span(Type (&)[Extent]) -> span<Type, Extent>; \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: template argument deduction/substitution failed: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘Type [Extent]’ and ‘const std::byte’ 83 \| return gsl::span(reinterpret_cast<const std::byte>(data), length); \| ^ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’ 743 \| span(std::array<Type, Size>&) -> span<Type, Size>; \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: template argument deduction/substitution failed: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte’ 83 \| return gsl::span(reinterpret_cast<const std::byte*>(data), length); \| ^ ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:40:40 +08:00
pengwa	d90afc697b	Introduce ZeROOffloadSubscriber for ORTModule (#17006 ) ### Introduce ZeROOffloadSubscriber for ORTModule As part of the work: integrate ORTModule with DeepSpeed stage3, this PR mainly focus on moving original PyTorch-based (leveraging hooks) param partition/offload implementation to ORTModule compatible implementation. Changes include: 1. Refactor `SubscriberBase`/`SubcriberManager` to support pre-forward/post_forward hooks. 2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook function as much as possible. Since all hook functions are defined in `DeepSpeedZeRoOffload._register_hooks_recursively` and `DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is, the closure is not complex, all hooks are referencing the owning `DeepSpeedZeRoOffload` instance, so we can create new hook function with `FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance, then call the new created function in subscriber's `pre_forward_module_apply_impl` and `post_forward_module_apply_impl` interfaces. 3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to register the `ZeROOffloadSubscriber` for the model, then we don't need change any code on the DeepSpeed repo (at least so far). 4. Fix the ATen embedding custom symbolic exporter function by tolerating weights size be (0) (changed by DeepSpeed zero stage 3). UT will be added once stage3 is fully supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:15:22 +08:00
Baiju Meswani	fca81cc5d5	ConvTransposeGrad CUDA Kernel (#17201 )	2023-08-24 09:08:06 -07:00
Jian Chen	33415b9da4	Removing 10.14 suffix from osx nuget package (#17277 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-24 08:51:54 -07:00
cloudhan	94f23882f7	Colorize terminal log output (#17196 ) Make eyeball log parsing a little bit easier.	2023-08-24 17:38:21 +08:00
Baiju Meswani	34d18ee076	Build gradient graph starting at the loss alone (#17240 )	2023-08-23 23:54:45 -07:00
Yulong Wang	fb51faea64	[js/webgpu] fix 2 build breaks introduced in merge (#17273 ) ### Description fix 2 build breaks introduced in merge. Fixes web build	2023-08-23 18:09:50 -07:00
cloudhan	87bef1f3f2	Move composable_kernel to deps.txt (#17245 )	2023-08-23 17:39:16 -07:00
Dmitri Smirnov	33c87f6283	ORT_ENFORCE on the iterator must come before iterator is dereferenced. (#17265 ) ### Description Move `ORT_ENFORCE` on the iterator before iterator is used for the first time.	2023-08-23 17:20:01 -07:00
Baiju Meswani	6c95d959f3	Make batchnorm training mode available in inference only package (#17270 )	2023-08-23 15:19:11 -07:00
Dmitri Smirnov	fdc3bcae20	Disable local symbol table for function shape inferencing. (#17267 ) ### Description Temporarily disable symbol tables. ### Motivation and Context Local symbol tables mark unrelated shapes re-use and cause inference to error out. https://github.com/microsoft/onnxruntime/issues/17061	2023-08-23 14:46:21 -07:00
Yulong Wang	8b18d48c7c	[js/webgpu] make IndicesHelper implementation implicit (#17193 ) ### Description This change makes it no longer required to call indicesHelper.impl() in shader code.	2023-08-23 14:41:35 -07:00
Rachel Guo	aed7c6ffc7	Exclude fp16 support flag definition from minimal build (#17259 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Reduce minimal build binary size for mobile to meet office team requirement. cc @chenfucn Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-08-23 10:13:19 -07:00
Scott McKay	b3cb775cf9	Two fixes involving minimal builds (#17000 ) ### Description <!-- Describe your changes. --> - allocation planner was breaking if graph had no nodes - in this particular model a branch of an If node returned an outer scope value directly. - if model used non-tensor types and sparse tensors are disabled the call to IsSpareTensor causes an exception when prematurely terminates the code. - it's perfectly fine to check if a value is a sparse tensor when support for them is disabled. we just can't do anything with that OrtValue which is what the current ifdef's after the call to IsSparseTensor handle. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix model execution failure for partner with model that uses sequences in a minimal build with sparse tensors disabled.	2023-08-23 16:01:22 +10:00
BoarQing	d21a2f064b	[VITISAI] fix compile error for onnxruntime (#17252 ) ### Description <!-- Describe your changes. --> Updated the code to pass in the missing parameter ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Compile error. See https://github.com/microsoft/onnxruntime/issues/17139 Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2023-08-22 22:40:39 -07:00
Ashwini Khade	56102ecbdd	On-Device Training - Enable loading from buffer (#16417 )	2023-08-22 19:59:32 -07:00
Edward Chen	ae62d752d6	Prevent GSL_SUPPRESS arguments from being modified by clang-format (#17242 ) Prevent `GSL_SUPPRESS` arguments from being modified by clang-format and update existing usages. clang-format was changing something like `GSL_SUPPRESS(r.11)` to `GSL_SUPPRESS(r .11)`. For some compilers (e.g., clang), the `gsl::suppress` attribute takes a quoted string argument. We don't want to insert spaces there.	2023-08-22 18:26:53 -07:00

1 2 3 4 5 ...

9495 commits