onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-24 02:47:54 +00:00

Author	SHA1	Message	Date
Ryan Hill	310273cbe4	BeamScorer to use contiguous arrays for BeamHypotheses (#15923 ) ### Description Change BeamHypotheses to not use a stl::priority_queue and instead all BeamHypotheses use a single buffer that they each get a small slice of. As the beam count is really small (typically 4,8, max of 32) and the array size fixed, the BeamHypotheses just does a sorted insert into an array. This also allows for the BeamHypotheses inside of the BeamSearchScorer to be a single fixed allocation vs an onnxruntime::FastAllocVector. ### Motivation and Context The goal is to simplify the memory usage and make the code more easily ported to CUDA.	2023-05-13 14:17:45 -07:00
Dmitri Smirnov	896a963492	Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921 ) ### Description This PR partially reverts changes introduced in https://github.com/microsoft/onnxruntime/pull/15643 We make two API return std::string always in UTF-8. We also move the entry points from OrtApiBase to OrtApi to make them versioned. ### Motivation and Context `GetVersionString` always returns x.y.z numbers that are not subject to internationalization. `GetBuildInfoString` can hold international chars, but UTF-8 should be fine to contain those. We prefix them with u8"" in case the compiler default charset is not UTF-8. Furthermore, creating platform dependent APIs is discouraged. `ORTCHAR_T` is platform dependent and was created for paths only. On non-unix platforms would still produce `std::string` that can only contain UTF-8 The API was introduced after the latest release, and can still be adjusted.	2023-05-13 13:45:07 -07:00
RandySheriffH	9fe6d58857	Separate execution plan serialization for a new PR. (#15916 ) Remove serialization for execution plans, will follow up with another PR along with proper unit tests. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-13 10:39:54 -07:00
Chester Liu	984dd02df3	Update optimize_pipeline.py to use __name__ detection (#15866 ) ### Description <!-- Describe your changes. --> Use `__name__` detection in `optimize_pipeline.py`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It prevents unwanted execution of `main` when importing the file.	2023-05-12 20:43:29 -07:00
Yulong Wang	9328a0f955	[js/webgpu] run test on chrome instead of chrome canary for webgpu (#15902 ) ### Description webgpu is released in chrome v113. No longer to use chrome canary in test cli	2023-05-12 15:47:59 -07:00
Maximilian Müller	143551092f	fix: setting builder optimization level to TRT 8.6 default (#15897 ) The actual released default level is 3 and not the previously used 2. Just a small sample of the effects: ![Screenshot 2023-05-10 at 15 49 55](https://github.com/microsoft/onnxruntime/assets/44298237/5a694446-22c0-4943-9ddf-80670781878f)	2023-05-12 13:29:30 -07:00
Numfor Tiapo	b473d3eee5	Reenable ConstantOfShape TypeTests (#15910 ) ConstantOfShape TypeTests were previously broken due to a bug where the case for the uint64 test was being passed an int64_data_size. Changing the data type to uint64_data_size fixes the bug. TensorProto Int8 and Int16 tests are reenabled since they are now passing.	2023-05-12 11:28:57 -07:00
petermcaughan	e5189330d5	Address OOM Issue when exporting Whisper (#15880 ) ### Description Remove attention_mask from unnecessary code paths in the whisper export process. ### Motivation and Context Current export script frequently hits OOM error when export whisper-large. Memory profiling shows that this is a result of generating dummy inputs for the `encoder_attention_mask` input for a model pass during exporting - in whisper-large, this dummy tensor can be around 20GB in size. `encoder_attention_mask` is ultimately a dummy input - it's just there to satisfy certain BeamSearch requirements. Thus, we're currently creating a 20GB tensor and passing it to the model, which then discards the input anyways. By removing the code path to generate a dummy encoder_mask tensor, we can reduce the memory requirements to export whisper substantially, while keeping the BeamSearch checks satisfied. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-05-12 11:23:07 -07:00
Numfor Tiapo	000a600080	DML EP Mark ImageScalar Test As 'won't fix' (#15894 ) ImageScalar is an experimental operator added in ONNX 1.2.1 and removed in ONNX 1.5 so it's no longer in use. Changing the comment to won't fix. --------- Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2023-05-12 10:11:51 -07:00
Hector Li	1bebc88069	[SNPE EP] Add option to enable SNPE init caching feature (#15917 ) ### Description [SNPE EP] Add option to enable SNPE init caching feature ### Motivation and Context To save model initialization time	2023-05-12 07:57:11 -07:00
RandySheriffH	7c4e8267e7	Implement openAI endpoint invoker for nuget (#15797 ) Implement openAI audio endpoint, and enable nuget packaging. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-11 22:04:02 -07:00
Yi Zhang	0e7ae13e74	Run Linux GPU tests in docker container (#15872 ) ### Description Run Linux GPU tests in docker container ### Motivation and Context	2023-05-12 06:29:22 +08:00
Yufeng Li	902c5f53ae	add cutlass fmha support in PackedAttention (#15838 ) ### Description <!-- Describe your changes. --> Support cutlass fMHA in PackedAttention. Though we have fMHA trt kernel, it doesn't support relative bias position. Cutlass fmha has support for RBP and also support lower end GPUs(5.3, 6.x). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-11 13:47:15 -07:00
Nat Kershaw (MSFT)	27b2815d42	Update publish-csharp-apidocs.yml .NET 5->6 (#15854 )	2023-05-11 13:35:53 -07:00
Jian Chen	1a73d61829	Update eigen to 3.4 and remove the eigen from git submodule (#15875 ) ### Description Update eigen to 3.4 and remove the eigen from git submodule ### Motivation and Context We need to have eigen 3.4 for c++20	2023-05-11 11:56:59 -07:00
Changming Sun	7c58d013aa	Remove Ubuntu 18.04 usages (#15781 ) ### Description Remove Ubuntu 18.04 usages because it will be EOL this month. ### Motivation and Context	2023-05-11 11:44:00 -07:00
sdegrande	cf062dbdb1	FlatBuffers fails to compile with gcc13. (#15787 ) When building the FlatBuffers dependencies, gcc13 emits a stringop-overflow warning. All warnings being turned into errors, that fails the compilation of FlatBuffers, and as a consequence also fails the build of onnxruntime. This commit adds the application of a patch to FlatBuffers's CMakeList.txt, to add -Wno-error=stringop-overflow to the CMAKE_CXX_FLAGS.	2023-05-11 11:20:19 -07:00
Yulong Wang	756cf3a76f	increase web CI timeout (#15876 ) ### Description The CI is extremely slow on downloading source code (~1MB/sec) so the web CI went timeout. This is blocking the PR/checks. Increase the timeout temporarily.	2023-05-11 11:17:46 -07:00
Ryan Hill	e15ab78052	Ryanunderhill/beamsearch simplify (#15883 ) ### Description Simplify some sections of code by removing some extra gsl::span conversions and passing parameter packs by an existing structure vs directly. ### Motivation and Context While stepping through the code, I noticed parts that could be simplified. Simplifying then helped me understand it further.	2023-05-11 09:50:14 -07:00
RandySheriffH	657ab2f43c	Sync between parent node and subgraph (#15757 ) By https://github.com/microsoft/onnxruntime/issues/14691, we found that there is a mis-reuse of GPU memory between NonZero(GPU) and Identity(GPU) which is a subgraph node in If(CPU). The NonZero gives a GPU output consumed by Transpose(GPU), after which that GPU output marks as free in BFCArena, and soon be reused by Identity(GPU) in a subgraph of If(CPU). However, NonZero(GPU) and Identity(GPU) run on separate cuda streams, there is no synchronization because the Identity node is in a subgraph of If(CPU). Meaning - Identity(GPU) can write to the memory when Transpose(GPU) is reading from it. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-11 09:28:04 -07:00
pengwa	fed52053a7	Refine a bit (on device training) (#15803 ) ### Few minor refinements: - Simplify ParameterOptimizerState a bit - Use inlined containers - Remove GetStateDict APIs] - Re-enable cuda test for lr scheduler	2023-05-10 20:36:13 -07:00
pengwa	346ec12377	Fix no contrib tests in TVM CI (#15895 ) ### Fix no contrib tests in TVM CI Linux_CI / Onnxruntime-TVM Windows_CI / Onnxruntime-TVM ``` [----------] Global test environment tear-down [==========] 2850 tests from 186 test suites ran. (51340 ms total) [ PASSED ] 2820 tests. [ FAILED ] 30 tests, listed below: [ FAILED ] GraphTransformationTests.LayerNormFusionTest [ FAILED ] GraphTransformationTests.LayerNormWithCastFusionTest_2 [ FAILED ] GraphTransformationTests.LayerNormWithCastFusionTest_3 [ FAILED ] GraphTransformationTests.LayerNormWithCastFusionTest_4 [ FAILED ] GraphTransformationTests.LayerNormWithCastFusionTest_5 [ FAILED ] GraphTransformationTests.SimplifiedLayerNormFusionTest [ FAILED ] GraphTransformationTests.SimplifiedLayerNormWithCastsFusionTestCudaEp [ FAILED ] GraphTransformationTests.SkipLayerNormFusionTest [ FAILED ] GraphTransformationTests.SkipLayerNormFusionWithCastTest [ FAILED ] GraphTransformationTests.SkipLayerNormFusion_Input_Output_Check [ FAILED ] GraphTransformationTests.SkipLayerNormFusion_NoBeta [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat1 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat2 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat3 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat3_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat3NoCast [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat3NoCast_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat4 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat5 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat5_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat6 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat6_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat7 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat7_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat8 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat8_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat9 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionFormat9_OpSet13 [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionMultiple [ FAILED ] GraphTransformationTests.EmbedLayerNormFusionMultiple_OpSet13 ``` Looks related to https://github.com/microsoft/onnxruntime/pull/15844. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-11 09:14:13 +08:00
George Nash	19f2cc6fb6	Check the AMX tile configuration is unchanged (#15387 ) Don't assume the AMX tile configuration will always remain unchanged It is possible that other code will change the AMX tile configuration. This change will read the current tile configuration - if the tile is un-configured it will be configured - if the tile is configured but does not match the expected configuration it will be configured for the expected configuration This resolves issues seen in unit tests when building OneDNN ep. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: George Nash <george.nash@intel.com>	2023-05-10 15:50:59 -07:00
liqun Fu	ac9ae9f7c5	update onnx release 1.14 for docker files (#15680 ) ### Description this is for ort 1.15 release to work with onnx 1.14 It shall be merged after onnx 1.14 release and before ort 1.15 release. ### Motivation and Context --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-05-10 13:15:56 -07:00
Jian Chen	bd58109678	relax threshold of InternalNumericalCheck on cpu from 0.0001 to 0.0002 (#15879 ) ### Description relax threshold of InternalNumericalCheck on cpu from 0.0001 to 0.0002 ### Motivation and Context To fix a failed Unit Test.	2023-05-10 10:56:12 -07:00
Nat Kershaw (MSFT)	36c9ae0f58	Fix release version suffix for RC builds (#15865 )	2023-05-09 23:06:08 -07:00
Linnea May	95a4607dcf	User/linneamay/roi align 16 (#15812 ) ### Description <!-- Describe your changes. --> Add registration for DML RoiAlign-16 and tests for new coordinate_transform_mode attribute. PR [7354](https://github.com/microsoft/onnxruntime/pull/7354) is still open to fix the CPU EP version, which is why there are skipped tests right now. That will be completed separately so that, for now, we can officially support opset16 with the next release. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Linnea May <linneamay@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2023-05-09 21:56:41 -07:00
Patrice Vignola	c7b27f4486	[DML EP] Add Outer Product Einsum (#15850 ) This operator is used in some LLMs to compute the rotary embeddings.	2023-05-09 15:51:55 -07:00
Tianlei Wu	e0c1fa35a8	update stable diffusion script and doc (#15846 ) ### Description Update script: (1) change some float16 verbose logging to debug level. (2) Let requirements-cuda.txt includes requirements.txt (3) Use an environment variable ORT_DISABLE_TRT_FLASH_ATTENTION=1 to avoid black image in 2.1 model. Update benchmark and doc. (4) Update document to include command lines to build ORT rocm from source. (5) Update optimize_pipeline.py so that user can disable packed qkv/kv from command line options. (6) Update document to use torch < 2.0 for onnx export. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 15:29:13 -07:00
Sumit Agarwal	b473e3f3c6	[DML EP] Update DirectML version to 1.11.0 (#15858 ) ### Description - Update DML version to 1.11.0 - Disable Gemm+Softmax fusion ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 12:48:15 -07:00
Jian Chen	c2e631d356	Update Conv-Add-Relu Fusion Transformation (#15834 ) …n name to make it more intuitive. ### Description Update Conv-Add-Relu Fusion Transformation to handle additional case where NhwcFusedConv is present. ### Motivation and Context Handle additional case where NhwcFusedConv is present.	2023-05-09 11:12:49 -07:00
Yulong Wang	02d94bcc8e	[js/web] fix terser reserved symbols for worker (#15864 ) ### Description due to change from `3935cdcc57`, our minimizer need to be updated to add "startWorker" to reserved symbol.	2023-05-09 11:11:26 -07:00
Yulong Wang	357e6289be	[wasm] allow pull debug artifacts from script (#15859 ) ### Description allow pull debug artifacts from script `npm run pull:wasm` - to pull release artifacts `npm run pull:wasm:debug` - to pull debug artifacts	2023-05-09 11:00:08 -07:00
Ye Wang	475f661acd	use __hmul2 instead of __hmul2_rn (#15852 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/15840	2023-05-09 10:54:00 -07:00
Jian Chen	34cb293c6b	Remove unused ADO YML pipeline template (#15857 ) ### Description Remove unused ADO YML pipeline template ### Motivation and Context Clean up and reduce our codebase.	2023-05-09 09:15:04 -07:00
JiCheng	686d42e6c8	layer_norm_fix (#15844 ) ### Description Fix bugs of Layernorm Fusion. More checks on ReduceMean axes separate out layernorm transform_test ### Motivation and Context Our layernorm fusion pattern works only for axis=-1 currently. - For training senario: The pattern produced error results directly as they didn't handle "axes" and only assumed it's the default vaue. - For Inference: ~~We lost some oppotunities to fuse layernrom. ~~ ReduceMean has default axes 0 which means reduce on all dimensions	2023-05-09 22:09:00 +08:00
Wanming Lin	00b1e79e04	Support WebNN EP (#15698 ) Description: This PR intends to enable WebNN EP in ONNX Runtime Web. It translates the ONNX nodes by [WebNN API](https://webmachinelearning.github.io/webnn/), which is implemented in C++ and uses Emscripten [Embind API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#). Temporarily using preferred layout NHWC for WebNN graph partitions since the restriction in WebNN XNNPack backend implementation and the ongoing [discussion](https://github.com/webmachinelearning/webnn/issues/324) in WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW' layouts. No WebNN native EP, only for Web. Motivation and Context: Allow ONNXRuntime Web developers to access WebNN API to benefit from hardware acceleration. WebNN API Implementation Status in Chromium: - Tracked in Chromium issue: [#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291) - CPU device: based on XNNPack backend, and had been available on Chrome Canary M112 behind "#enable-experimental-web-platform-features" flag for Windows and Linux platforms. Further implementation for more ops is ongoing. - GPU device: based on DML, implementation is ongoing. Open: - GitHub CI: WebNN currently is only available on Chrome Canary/Dev with XNNPack backend for Linux and Windows. This is an open to reviewers to help identify which GitHub CI should involved the WebNN EP and guide me to enable it. Thanks!	2023-05-08 21:25:10 -07:00
Chen Fu	685e5b00f6	NhwcFusedConv: Add before Activation (#15837 ) ### Description Fp16 FusedConv and NhwcFusedConv. Fused Add operator should be performed BEFORE the activation operator. ### Motivation and Context Previous understanding of fused conv is incorrect.	2023-05-08 21:02:35 -07:00
pengwa	003c7d3e4d	Add CPU allocation test for multiple GPU distributed run (#15829 ) ### Add CPU allocation test for non-CPU devices distributed run When CUDA EP is enabled in distributed training, CPU memory is still used for some node output. Early we have distributed run test coverage, but don't cover the case when some of the node are using CPU devices for storing tensor output. As a result, I recalled we hit regression twice in the passing months: - https://github.com/microsoft/onnxruntime/pull/14050 - https://github.com/microsoft/onnxruntime/pull/15823 So adding this test to avoid future regressions. The test graph looks like this: ![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 10:27:19 +08:00
Rachel Guo	817d70a63b	[js/rn] Fix extensions header include issue (#15800 ) ### Description <!-- Describe your changes. --> Identified the cause for a `redefinition compilation error` happened in a react native expo app with ort-extensions enabled when running the ios side. Fix the include path now, so we can remove the temporary forward declaration in OnnxruntimeModule.mm file. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix implementation detail. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-05-08 17:12:10 -07:00
Yulong Wang	0457fd0b40	upgrade emsdk to 3.1.37 (#15817 ) ### Description upgrade emsdk to 3.1.37 WIP branch to debug the mystery memory issue in web assembly multi-thread build.	2023-05-08 16:49:47 -07:00
Tianlei Wu	191ee1d3c0	Fix symbolic shape infer empty value_info (#15842 ) ### Description When node output is optional, symbolic shape infer might add an empty value_info item. Add some checking to avoid this. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - Stable diffusion optimized model reported invalid data type 0 during inference.	2023-05-08 16:18:35 -07:00
Yi Zhang	045c623415	Make Nuget workflow easy to debug (#15808 ) ### Description Fix the bug in #15693 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-08 20:53:08 +08:00
Nat Kershaw (MSFT)	5e9b42326c	Fix packaging pipeline for nightly builds (#15839 )	2023-05-07 20:42:38 -07:00
PeixuanZuo	2735e0d031	[ROCm] simplify ck data type Adaptor (#15734 ) DataTypeAdaptor is defined many times in every file that integrates CK. This PR refactor the code to put DataTypeAdaptor in a header file.	2023-05-06 17:39:52 +08:00
Ted Themistokleous	42d62b8f2b	Fixes to get stable diffusion benchmark running (#15755 ) ### Description Added changes to MIGraphX EP to suppoert stable diffusion 1. Added parameterized input dimensions to not trigger a precompile to set input parameters in the EP 2. Removed input checking for Resize operator in EP as MIGraphX already performs these checks 3. Add support to benchmark script to use the MIGraphX execution provider 4. Add support for an odd valued batch size (3) that was seen on other benchmarks we were performing comparison on. ### Motivation and Context These changes are required to get stable diffusion mdoels to run on MIGraphX through the EP. Without these changes we see the following incorrect behavior. 1. Resize operators are pushed onto the CPU EP instead of MIGraphX, causing a significant slowdown during runs 2. Precompile operations incorrectly parse input_ids parameter for our text model, with a 1, which breaks during MIGraphX Compile of onnx. This in turn throws an error and stops any setup before inference. 3. Selecting the correct EP in the benchmark script which was previously missing the MIGraphX option 5. Suppressed an error we keep seeing with pthread_set_affinity - this is a quality of life change when using the MIGraphX EP This was testing with the benchmark.py script using stable diffusion v2 located in onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/ --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-05-06 17:35:21 +08:00
PeixuanZuo	41457885e0	[ROCm] add rocm5.5 to python package pipeline (#15820 ) add rocm5.5 to python packaging pipeline. https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=306082&view=results TODO: Remove version 5.2.3, 5.3.2 and 5.4 in the next PR.	2023-05-06 10:21:15 +08:00
Nat Kershaw (MSFT)	ed31e4b737	Add nuget release version suffix to support publishing rcs to nuget.org (#15791 )	2023-05-05 18:18:24 -07:00
pengwa	dfac096501	Fix segfault for multiple GPU run (regression) (#15823 ) ### Fix segfault for multiple GPU run https://github.com/microsoft/onnxruntime/pull/15618 introduced `GetOrtDeviceByMemType`. The intention should be: handle CPU device differently in the if branch, while might by mistakenly passing the unique default non-cpu device id. ``` OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const { if (mem_type == OrtMemTypeCPUInput \|\| mem_type == OrtMemTypeCPUOutput) { return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id()); } return default_device_; } ``` We observed a segement fault thrown when running multiple GPU training ` CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=2 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path distilbert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 400 --logging_steps 1 ` It is found GPU0 works fine, GPU1 throw segement fault. Looking further, a Shape node trying to allocate it's output tensor, trying to fetch corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1 DeviceId:1]), while CPU device did not have device id = 1, so a no allocator returned. When we try to call `AsStreamBasedAllocator` for the allocator, segement happens as no null check was done there. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-06 08:48:53 +08:00
Sheil Kumar	2b7f26af7c	Add GridSample implementation to DirectML (#15788 ) Add GridSample implementation to DirectML EP. Temporary add HLSL shader in the DirectML EP to handle GridSample until officially added to DirectML.	2023-05-05 15:59:33 -07:00

1 2 3 4 5 ...

8793 commits