onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-13 18:08:13 +00:00

Author	SHA1	Message	Date
Yi-Hong Lyu	33e883fbc4	Fix the doxygen error (#20515 ) Fix onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637: error: argument 'session' of command @param is not found in the argument list of ``` OrtApi::AddExternalInitializersFromFilesInMemory( OrtSessionOptions options, const char const external_initializer_file_names, char const external_initializer_file_buffer_array, const size_t external_initializer_file_lengths, size_t num_external_initializer_files) ```	2024-04-30 11:45:03 -07:00
Tianlei Wu	9f0fae29e8	[CUDA] Add SparseAttention operator for Phi-3-small (#20216 ) ### Description Add CUDA implementation for block sparse attention for Phi-3-small. Block sparse attention was proposed in [Sparse Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different sparse layout. In Phi-3-small, the sparse layout is static, and works with unidirectional (causal) attention. Compared to dense attention, the benefit of block sparse is to speed up both training and inference. It could save memory thus support longer context length. - [x] Add operator spec and shape inference - [x] Symbolic shape inference - [x] Refactor GroupQueryAttention to expose common kernels for kv cache concatenation, q/k/v transpose etc. - [x] Add cuda kernel to convert block mask to CSR format - [x] Add cuda kernel to generate position ids - [x] Add compile script and template files to convert triton kernel to cubin and dispatcher. - [x] Add triton kernel v1 for prompt - [x] Add triton kernel v2 for token generation and support padding - [x] Update IO Binding Helper to allow buffer sharing. - [x] Test relevance - [x] Test performance ### Performance Test in A100-SXM4-80GB with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Average latency in milliseconds (for fused attention kernel used in prompt prefilling): seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0465 \| 0.0722 \| 0.0641 128 \| 0.0618 \| 0.0787 \| 0.0672 256 \| 0.1086 \| 0.1076 \| 0.0943 512 \| 0.2535 \| 0.2487 \| 0.1676 1024 \| 0.7042 \| 0.7050 \| 0.3800 2048 \| 2.4125 \| 1.9316 \| 0.8966 4096 \| 8.9346 \| 4.5699 \| 2.1129 8192 \| 40.5401 \| 10.3508 \| 5.1748 Average latency in milliseconds (for fused attention kernel used in token generation: past_seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0186 \| 0.0186 \| 0.0870 128 \| 0.0408 \| 0.0466 \| 0.1165 256 \| 0.0530 \| 0.0592 \| 0.0988 512 \| 0.0445\| 0.0447 \| 0.1150 1024 \| 0.0634 \| 0.0640 \| 0.1454 2048 \| 0.1027 \| 0.0637 \| 0.1589 4096 \| 0.1789 \| 0.0631 \| 0.1806 8192 \| 0.3288 \| 0.0655 \| 0.2146 We can see that the kernel for token generation still have room to improve. #### Limitations Only support right-side padding and unidirectional attention. The following are not supported in the first version: (1) Packed mode like PackedMultiHeadAttention where input has been removed padding. (2) paged attention. (3) bidirectional attention. (4) GPU compute capacity that is not 8.0, 8.6 and 8.9. (5) Left side padding. Some of these limitations will be removed in the future (may be in a new operator).	2024-04-30 09:06:29 -07:00
Yi-Hong Lyu	b2481e3602	Bump up version in main from 1.18.0 to 1.19.0 (#20489 ) Bump up version in main from 1.18.0 to 1.19.0 since the release branch has been cut. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-04-29 20:21:41 -07:00
Yulong Wang	b1085b51ca	[js/web] update README (#20492 ) ### Description Update README.md in /js/web/ - update compatibility table - update links to onnxruntime.ai	2024-04-29 17:56:23 -07:00
Chi Lo	a1558fe117	[TensorRT EP] Make TRT EP use priority-based topo sort (#20512 ) This PR is needed for https://github.com/microsoft/onnxruntime/pull/20411 to make sure TRT EP use priority-based topo sort for consistency across TRT EP.	2024-04-29 16:00:43 -07:00
Rachel Guo	8c31f27dd1	Catalyst nuget package .NET changes only (#20424 ) ### Description <!-- Describe your changes. --> https://github.com/microsoft/onnxruntime/pull/20418 Add back Catalyst changes only for now. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-04-29 15:39:48 -07:00
Satya Kumar Jandhyala	99b0e19f11	[JS/WebGPU] MatMulNBits remove unnecessary condition (#20396 ) Distribute writing-to-output work over all threads in MatMulNBits. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-29 14:27:21 -07:00
winskuo-quic	509cbcae6e	[QNN EP] Support prelu fp16 (#20428 ) ### Description Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32. QNN requires alpha is fp16 when input is fp16. This can be resolved by casting alpha to fp16 and pass it to QNN. ### Motivation and Context Makes QNN Prelu support fp16 case. --------- Co-authored-by: Hector Li <hecli@microsoft.com>	2024-04-29 13:26:51 -07:00
Edward Chen	358f5bb022	Update regex to match correct pattern. (#20483 ) In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.	2024-04-29 10:43:31 -07:00
George Wu	49b2bebe85	[qnn ep] include qnn sdk in onnxruntime-qnn python whl (#20485 ) script changes to include qnn sdk libs with onnxruntime-qnn python package.	2024-04-29 09:44:54 -07:00
guyang3532	3e4db2c686	Fuse Cast + SoftmaxCrossEntropyLossInternal (#20334 ) ### Description Fuse Cast + SoftmaxCrossEntropyLossInternal to SoftmaxCrossEntropyLossInternal.	2024-04-29 14:12:10 +08:00
Scott McKay	923b0ef323	Run fuzz testing before the CG task cleans up the build directory (#20500 ) ### Description <!-- Describe your changes. --> Update order of steps ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix CI	2024-04-29 16:02:53 +10:00
zz002	50e41984b0	[VitisAI] Solve the problem that gsl cannot be found when compiling under linux (#20466 ) ### Description <!-- Describe your changes. --> [VitisAI] Solve the problem that gsl cannot be found when compiling under linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-04-28 20:56:16 -07:00
pengwa	f31486c8b7	Disable test_aten_conv_bf16 to unblock amd ci (#20499 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-29 11:38:40 +08:00
pengwa	b789677e65	Fix Onnxruntime-TVM CI (#20498 ) ### Description ``` tvm_execution_provider.cc denormal.cc D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj] D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6): see declaration of 'onnxruntime::GraphViewerToProto' D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)' cpuid_uarch.cc get_execution_providers.cc abi_session_options.cc bias_dropout_fusion.cc if.cc ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-29 11:03:57 +08:00
Satya Kumar Jandhyala	736cbb3925	[JS/WebGU] Support fp16 in Attention by performing the computation in fp32. (#20486 ) ### Description Perform computation in fp32 and convert finally to fp16. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-27 08:30:26 -07:00
Rachel Guo	ff505b9f44	Follow up fix for #20472 (#20484 ) ### Description <!-- Describe your changes. --> Error: *Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss) ##[error]Artifact name is not valid: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /', "', ':', '<', '>', '\|', '', and '?'** Date not correctly showing up in the artifact name. Use predefined pipeline variable BuildNumber instead which also serves similarly as a timestamp. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> RN CI failure --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-04-27 13:42:24 +10:00
Scott McKay	321c1e5730	Use flatbuffers::String::str instead of c_str. (#20487 ) ### Description <!-- Describe your changes. --> flatbuffers::String::c_str returns a pointer that may not be null terminated. This causes a warning when building on an A100 with gcc 11. Not clear why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a warning. Either way it's safer to use str() as that constructs a std::string with data() and size(). Unclear if this is an issue in reality as it's reading from the flatbuffer and most likely didn't write out an empty string in order to save space. There's no perf need to use c_str instead of str, and in LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a std::string anyway. ```c++ struct String : public Vector<char> { const char c_str() const { return reinterpret_cast<const char >(Data()); } std::string str() const { return std::string(c_str(), size()); } ``` ``` inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3: /usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread] ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix build error on A100	2024-04-27 13:41:38 +10:00
Maximilian Müller	a0775d74a1	Fix: Shared lib tests fail during build for CUDA,TRT,DML (#20453 ) The order of defines for these test have to be in the same order. If we check for TRT -> CUDA ->DML wen cannot reverse that order in later defines as we might want to build for multiple EPs. +@PatriceVignola	2024-04-26 20:25:24 -07:00
liqun Fu	2f5fe4500d	mlas nbit matmul requires packed_b (#20482 ) ### Description mlas matmul nbits implementation requires packed b. have a condition for this. need to update this logic if it changes. ### Motivation and Context --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2024-04-26 17:18:53 -07:00
Johan MEJIA	619ceeed9e	[python] MinMax calibration per channel (#19285 ) ### Description Following the issue #19223, introduce `per_channel` attribute in `MinMaxCalibrater` to develop per-channel calibration. If required, this new functionality should be implemented in the other _Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...). ### Motivation and Context - This is the first part to solve #19223's proposal. - If per channel calibration was allowed, the quantization algorithm could be updated to improve quantization performance, i.e. weights quantization per channel and not per tensor. That is why it would be interesting to have a 'per_channel' option in any 'Calibrater' class to produce a set of calibration vectors instead of a single scalar.	2024-04-26 12:40:49 -07:00
Wanming Lin	ddd4e8c3e3	[WebNN EP] Improve activation fusion (#20320 ) - Create a common util to get supported activation set - Fuse activation to BatchNormalization if possible	2024-04-26 08:16:55 -07:00
Rachel Guo	88904b9220	Add unique identifier to e2e_test_logs artifacts in react-native-ci.yml (#20472 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-26 22:20:10 +10:00
Scott McKay	b842effa29	Fix some x86 build warnings in training code (#20451 ) ### Description <!-- Describe your changes. --> Fix some misc build warnings from x86 Windows build ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-26 20:29:21 +10:00
Scott McKay	aa27dadd1c	Use download.onnxruntime.ai in podspec (#20474 ) ### Description <!-- Describe your changes. --> Update to more generic url ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-26 20:28:54 +10:00
Hector Li	d2d4639ddb	fix the build issue for Win Arm64 Release build (#20475 ) ### Description Fix the build error for Win ARM64 Release build. graph_transform_test.cc(1,1): error C1128: number of sections exceeded object file format limit: compile with /bigobj [D:\build\Windows\Release\onnxruntime_test_all.vcxproj] ### Motivation and Context Fix issue: https://github.com/microsoft/onnxruntime/issues/20406	2024-04-25 22:08:19 -07:00
liqun Fu	cc26b2dac2	Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163 ) ### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>	2024-04-25 21:30:50 -07:00
Chi Lo	bbc30feb63	Make execution order an option for GraphViewerToProto() (#20411 ) Current issue: Once ORT gets the capability from EP's GetCapability(), it creates a graph viewer based on the capability as below: `viewers.push_back(std::make_unique<GraphViewer>(graph, cur_capability.sub_graph));` or see the code [here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/graph_partitioner.cc#L458). At this point, the graph viewer has the chance to generate the wrong order of `nodes_in_topological_order_` when calling [Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107), so that during EP Compile(), EP might create the "wrong nodes ordering" model proto from the graph viewer when calling [GraphViewerToProto()](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_proto_serializer.cc#L37) because of the `nodes_in_topological_order_`. This is a problem for TRT EP to refit weights to the "weightless" engine. Since the engine is built from the model proto provided by TRT EP and the weights is in the original onnx model. The model proto and the orignal onnx model are not the same in terms of node ordering which makes TRT complain when refitting. The original model (subgraph of ResNet50):* <img width="442" alt="image" src="https://github.com/microsoft/onnxruntime/assets/54722500/bb9a641d-f2f2-46c3-aebf-4084a08ff289"> The serialized model proto generated by TRT EP: (The highlighted part has the wrong node order compared to the original model.) <img width="340" alt="image" src="https://github.com/microsoft/onnxruntime/assets/54722500/bbc6bf34-f960-4753-9474-a18ebc2dc48b"> The solution 1: Change default comparator to `NodeCompare::operator() {return n1->Index() > n2->Index();}` The root cause of the different node order between original model and EP generated model is from graph viewer [generating ](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107)the different `nodes_in_topological_order_`. Modifying the `NodeCompare::operator()` for sorting can fix the problem. The `NodeCompare::operator()` will be used in [Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1760) where the input nodes of the current node will be [sorted](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1802) based on node index. Due to the sorted nodes will be pushed into a stack which later determines the final topological node order in a "first in, last out" approach, the larger node index should be pushed into the stack first. So that we can get a topological node order aligns with smaller index node comes first. The solution 2 (This PR uses this solution): Use priority-based BFS for topological sort in GraphViewerToProto().	2024-04-25 16:07:36 -07:00
Satya Kumar Jandhyala	21b3cbc3af	[WIP][JS/WebGPU] Inputs Key and Value could be 4-dims. (#20470 ) ### Description The Key and Value inputs could be 4-dims ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-25 13:33:46 -07:00
Edward Chen	2c19db0af1	Put x64 specific benchmark code into ifdefs. (#20456 )	2024-04-25 12:33:12 -07:00
Frank Dong	227c4419fc	add bf16 support for few ops (#20385 ) ### Description Add bf16 support for below ops: ConstantOfShape Exp Erf convolution PythonOp ### Motivation and Context phimm model works on bf16, ORT need support bf16 on previous ops to work with phimm on bf16	2024-04-25 11:28:34 -07:00
Yi Zhang	464f199b95	Extend mac package jobs time out limit (#20459 )	2024-04-25 10:13:13 -07:00
Yi-Hong Lyu	edffa2a180	Optimize MlasComputeSoftmax with prefetch (#20393 ) The prefetching instructions (_mm_prefetch) is used to anticipate memory accesses by prefetching the next row of the input buffer. This optimization is designed to reduce the impact of memory latency, thereby enhancing the performance of the MlasComputeSoftmax function. As a result, the worst-case performance of the OCR model has improved by approximately 50ms, which equates to a 3% improvement.	2024-04-25 08:28:59 -07:00
Chi Lo	a077330c3e	[TensorRT] adapt for TRT lib name change after TRT 10 GA (#20445 ) For TensorRT 10 GA onwards, the TensorRT libraries will have major version appended to the end on Windows, for example, nvinfer_10.dll, nvinfer_plugin_10.dll, nvonnxparser_10.dll ... Change cmake file accordingly.	2024-04-24 21:46:54 -07:00
Yi Zhang	e5947f5729	Two improvements in pipelines (#20449 ) ### Description 1. Update the image name to avoid docker image wouldn't be overwrite. there was an mistake that variables.CUDA_VERSION_MAJOR is always empty `14fcf0a52d/tools/ci_build/github/azure-pipelines/stages/nuget-linux-cuda-packaging-stage.yml (L120)` 3. set one artifact name as variable to make the job rerunnable ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-25 10:15:40 +08:00
Xavier Dupré	218b6b0a73	Fix missing argument when calling _get_quantize_input_nodes (#20245 ) ### Description The current code is calling one method with a missing argument. ### Motivation and Context It breaks Olive's unittests. --------- Co-authored-by: Xavier Dupré <xavier.dupre@gmail.com>	2024-04-25 00:46:48 +02:00
Yulong Wang	a5182a2ef3	[js/web] update test condition for '--force-localhost' (#20450 ) ### Description Fixes the NPM packaging pipeline failure.	2024-04-24 12:14:03 -07:00
Edward Chen	9cc5badc49	Fix Objective-C static analysis warnings. (#20417 ) Replace most usages of [NSString stringWithUTF8String:] with checked helper function. The issue is that the former can return nil.	2024-04-24 11:48:29 -07:00
maggie1059	dfd4bce36e	Use compute queues by default in DML EP (#20438 ) ### Description We originally only use compute queues for compute-only devices; this change sets the default for DX12 devices to use compute queues as well. ### Motivation and Context There have been issues with TDRs occurring when using the current default queues, which doesn't happen on compute queues.	2024-04-24 10:44:16 -07:00
Xavier Dupré	f78215adad	Fix quantization tools for issue #19529 (#19591 ) ### Description Fix issue #19529, the code was using a variable loop outside a loop.	2024-04-24 19:16:27 +02:00
Scott McKay	a46bab6364	Update podspec url to use AFD hostname (#20452 ) Update to use AFD url when generating podspec	2024-04-24 09:37:24 -07:00
Satya Kumar Jandhyala	ae78cdb5d7	[JS/WebGPU] MultiheadAttention bugfix (#20447 ) ### Description Fixed pastkey, key and pastvalue, value concatenation condition and fixed index error. Added new test cases. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-24 08:43:14 -07:00
Guenther Schmuelling	33d5ea39b3	[js/webgpu] fixes for fp16 attention (#20440 )	2024-04-24 08:01:28 -07:00
Xavier Dupré	80213a9e66	Add implementation for ScatterND (#19540 ) ### Description onnxruntime switches to CPU for ScatterND after opset 13. This extends the implementation of higher opsets.	2024-04-24 14:08:50 +02:00
Rachel Guo	14fcf0a52d	Support visionos build (#20365 ) ### Description <!-- Describe your changes. --> This PR supports a build of onnxruntime.xcframework for xros/xrsimulator for visionos via the build command of `python3 tools/ci_build/github/apple/build_apple_framework.py --config Release/Debug tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`. For officially include visionos in ios cocoapods package and testing in CI, would require separate work for upgrading the Xcode version & upgrade macOS CI agent to macos-13-arm64 or higher. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> visionos support: https://github.com/microsoft/onnxruntime/discussions/19313 --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2024-04-23 18:15:07 -07:00
Adam Louly	4ce7bbf6f1	Add LayerSpec Support to ORTPipelineModule (#20410 ) ### Description In Deepspeed's Pipeline Parallel Implementation, there is a class used to instantiate the object after it's moved to the device and assigned in a stage. This approach helps reduce peak memory usage. In this PR, we're adding support to ORT for wrapping this LayerSpec.	2024-04-23 17:57:08 -07:00
Yulong Wang	5055dc0aa8	[js/web] add diagnose log for chrome (#20439 ) ### Description Add logs to further diagnose the pipeline issue.	2024-04-23 17:18:54 -07:00
Maximilian Müller	b4e50758c0	Fix shape conv fuse opt (#20282 ) FIx: - Multiples Convs into an Add+Relu will fuse the op although intermediates are needed ![image](https://github.com/microsoft/onnxruntime/assets/44298237/0c85a30c-5f41-4e62-ae2e-f41eada6c2c3) - Also fixes an issue with Shape Initializers Merge as input, that occurs when the input initializer is the same across multiple nodes but not all nodes are Shape nodes.	2024-04-23 16:19:57 -07:00
Yulong Wang	8f53957bcf	[js/web] add "browser" field to support parcel v2 (#20422 ) ### Description As described in latest discussion in #19915, parcel v2 without using the [new resolver](https://parceljs.org/blog/v2-9-0/#new-resolver) will not work correctly with onnxruntime-web. There are still users who uses parcel with default resolver, so add this deprecated field "browser" back for backward compatibility. This PR also corrects the "main" field, which is for old resolver for Node.js.	2024-04-23 13:10:11 -07:00
Yulong Wang	13bda11583	[Node.js binding] Fix install script (#20416 ) ### Description Fix a few bugs of the install script of onnxruntime-node package. This change is integrated from branch `rel-1.17.3` (#20397)	2024-04-23 13:01:16 -07:00

1 2 3 4 5 ...

11003 commits