onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Edward Chen	71e7c2b423	Cache build docker images in container registry. (#5811 ) This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry. Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources. With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository. The cache container registry will need to be cleaned up periodically. This is not automated yet.	2020-11-17 17:02:24 -08:00
zhijxu	89e5b3a24f	resolve review comments	2020-11-16 11:23:01 +08:00
zhijxu	89902c2519	fix frontend bug. old ort session may already exists when creating new ort session, this may cause OOM error	2020-11-16 11:23:01 +08:00
Jesse Benson	ced5b66306	Re-enable multi-tensor-apply for LAMB optimizer	2020-11-15 09:35:00 -08:00
Weixing Zhang	fc614ad050	revert the code change which was based on `b4869926` The change `b4869926` which was to remove per-thread allocator would cause seg fault for distributed training. In addition, add dockerfile for ROCm3.9	2020-11-15 00:24:32 -08:00
Vincent Wang	0c8902cbbe	Update Gradient Builder of Some Ops for OpSet13 (#5748 ) * gradient builder for opset13 * code clean. * resolve comments * stop grad for axes input * add split to stop grad list. Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-13 16:20:34 +08:00
Alberto Magni	88c3704257	Add shape inference for additional ops This commit adds shape inference support for the following ops: SoftmaxCrossEntropy SoftmaxCrossEntropyLossGrad SoftmaxCrossEntropyGrad LayerNormalizationGrad Motivation and Context	2020-11-12 20:18:54 +00:00
pengwa	49288de17c	Fix memory planning issues (#5752 ) * Fix memory planning issues * fix build * fix the wrong line...	2020-11-13 03:07:59 +08:00
Vincent Wang	2a87108431	SoftmaxCrossEntropyLoss OpSet13. (#5777 ) Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-11-12 15:50:34 +08:00
Sherlock	07dc25e939	Compute global gradient norm according to 'enable_grad_norm_clip' (#5728 ) * Introduce PassThrough op to wait for all gradient ready before weight update * Compute gradient norm for fp32 runs * Update FE UT expected value * Respect enable_grad_norm_clip	2020-11-11 21:10:34 -08:00
ashbhandare	5aec34500d	Add megatron transforms for BART (#5521 ) * Large model export and run ORT Python support * Megatron change refine a bit workaround self attention issue use partitioned name for weights when megatron model parallel is enabled Fix Megatron Transformer Issue (cuased by the renaming) Add UTs for T5 model parallel Fix megatron seed issue fix log a bit checkkpointing changes + rebase Unintended reshape transform change t5 layer norm changes add t5 layer norm kernel use template for t5 layer norm template definition changes no build error add CPU cuda kernel first unit test other forward unit tests add T5LayerNormGrad Add c++ transform and test for T5 LN minor fix BART MLP Megatron tranform Add concat slice transform + test Cosmetic improvements in concat slice transform Constant folding bug fix + megatron attention transform for BART Undo unnecessary changes * Cleanup * Remove unnecessary changes * Cleanup megatron * Windows build * Add self attention test graph * Correcting transforms + cleanup * review comments * review comments * fix build and test failures * Fix CI * fix windows CI Co-authored-by: Peng Wang <pengwa@microsoft.com> Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-11 16:21:36 -08:00
Xueyun Zhu	d8ace07ad7	Add CPU send/recv for pipeline (#5315 ) * cpu send/recv * clean up send/recv * remove unused code * assert and nccl option for mnist * add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu * Add USE_MPI distinct from USE_NCCL/USE_HOROVOD * fix * fix * exclude cpu send/recv for machines without mpi Co-authored-by: Tim Harris <tiharr@microsoft.com>	2020-11-11 12:41:39 -08:00
Derek Murray	bc1768c7f1	Stop gradient flowing to the `k` input of TopK (#5762 )	2020-11-11 10:24:44 -08:00
liqunfu	1416d12f0b	Liqun/merge e2e pipelines (#5702 ) * Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-11 09:42:08 -08:00
edgchen1	2acdc3cd82	Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729 )	2020-11-09 11:37:06 -08:00
Weixing Zhang	bb1af718b5	fix build failures due to recent change(`858040fa`) in CUDA EP (#5736 ) Some part of code for reduction kernels has been changed in `858040fa`, which cause failures in rocm build since ROCm EP shares some code with CUDA EP. This PR is to quick fix this failure by not sharing two files for now to unblock CI enabling on ROCm EP. Another PR for leveraging `858040fa` for ROCm EP will be done later.	2020-11-09 08:41:30 -08:00
Weixing Zhang	fff85a6a35	Add GPU kernels for ROCm EP (#5655 ) * Add kernels for AMD GPU. This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible. Please refer to "HIP Porting Guide" for details. * like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value". * Use hipMemsetAsync and add checks on HIP calls. * move the generated files to build folder. Co-authored-by: Jesse Benson <jesseb@microsoft.com>	2020-11-06 16:11:06 -08:00
edgchen1	858040faaa	Implement reduce_matrix_columns() to optimize ReduceSum (#5639 ) Implement reduce_matrix_columns() to optimize ReduceSum.	2020-11-05 10:25:00 -08:00
ashbhandare	6d8e81cb08	Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5691 ) * Split change * ReduceSum and Split change * Other op changes, Grad builder, tests, registering required opset 13 ops * Rebase fixes * Fix tests, add some more * Review changes, rebase * Fix windows build * Disable new tests for TesnorRT EP * Disable unsupported for OpenVINO Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-04 20:00:27 -08:00
wezuo	62a99824cb	Wezuo/priority in nodedef (#5692 ) * set the priority in nodedef * remove debugging stmts * revoke zero builder * remove unnecessary namespace comment Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com> Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-04 12:40:37 -08:00
edgchen1	28f1e32898	Loosen tolerance of CudaKernelTest.ReduceSum_MidTensor, allow test random seed to be regenerated within a test run. (#5675 )	2020-11-03 10:37:00 -08:00
Changming Sun	87e1063e19	Revert "Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488 )" (#5668 ) This reverts commit `db63c5d10f`.	2020-11-02 16:09:22 -08:00
Jesse Benson	1495f737ca	Use cudaMemsetAsync and add checks on CUDA calls.	2020-11-02 11:25:13 -08:00
ashbhandare	db63c5d10f	Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488 ) * Split change * ReduceSum and Split change * Other op changes, Grad builder, tests, registering required opset 13 ops * Rebase fixes * Fix tests, add some more * Review changes, rebase * Fix windows build Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-02 10:51:48 -08:00
M. Zeeshan Siddiqui	f2168cef29	Misc. cleanup. (#5659 ) Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-02 07:05:28 -08:00
M. Zeeshan Siddiqui	9af0d48524	Memory planner and pattern generation enhancements. (#4443 ) * static allocation. * chanegs. * contigious dynamic allocation. * contigious dynamic allocation. * fix bugs. * fix bug. * build errors. * PR feedback. * PR feedback. * Update Graph builder for nccl_allreduce, mps. * misc. * fix windows build break. * changes. * fine-grained memory-time scheduling. * merge. * fix misc stuff. * fix windows build. * fix windows build. * fix merge bug. * merge conflicts. * revert onnx-tensorrt submodule commit. * fix submodule commit. * misc. * merge conflicts. * Revert "merge conflicts." This reverts commit `319a071a6e`. * merge conflict. * merge conflict. * merge conflicts. * fixes. * PR feedback. * build break. * build break. * Add asserts. * Add asserts. * asserts. * asserts. * asserts. * asserts. * asserts. * fixes. * fixes. Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-01 23:05:46 -08:00
Zhang Lei	17bce6f07e	Implement Im2colNd NHWC and related qlinearconv logic for u8s8. (#5612 ) Implement Im2colNd NHWC and related qlinearconv logic for u8s8, and training.	2020-10-30 15:28:30 -07:00
Weixing Zhang	aec4cb489e	ROCm EP for AMD GPU (#5480 ) The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/ ROCm EP was created based on the following things: 1. AMD GPU programming language: HIP 2. AMD GPU HIP language runtime: amdhip64 3. BLAS: rocBLAS, hipBLAS 4. DNN: miOpen 5. Collective Communication library: RCCL 6. cub: hipCub 7. … Current status: BERT-L and GPT2 training can be ran on AMD GPU with data parallel. Next: 1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA. 2. Continue improving the implementation. 3. Continue GPU kernel optimization. 4. Support model parallelism on ROCm EP. …… The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels. Co-authored-by: Weixing Zhang <wezhan@microsoft.com> Co-authored-by: sabreshao <sabre.shao@amd.com> Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com> Co-authored-by: Suffian Khan <sukha@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2020-10-29 17:13:04 -07:00
Vincent Wang	1fa1c51544	bug fix for name of gradient constant (#5626 ) Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>	2020-10-30 07:08:19 +08:00
Sergii Dymchenko	2e1fa3ccb7	Fix GeluRecompute for 2 inputs case. (#5573 ) * Add test for FastGelu + GeluRecompute. * Fix GeluRecompute for 2 inputs case. * Fix test for BiasGelu + GeluRecompute. * Copy all inputs to Gelu, not just 2. * Move GeluRecompute test to training-specific file.	2020-10-29 00:07:13 -07:00
liqunfu	5129b4d5bc	batch size tests (#5508 ) * batch size tests Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-28 15:55:40 -07:00
Tim Harris	5e8952ef89	ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590 ) Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases. Motivation and Context The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.	2020-10-28 09:49:18 +00:00
liqunfu	92662659ba	Liqun/remove number matching (#5606 ) replace number matching with relaxed comparison in frontend tests Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-27 21:27:37 -07:00
Ryan Hill	e90b6f06d1	Factor out IAllocator so that it can be shared with shared providers (#5567 ) * Factor out IAllocator so shared providers can use it directly.	2020-10-27 17:28:17 -07:00
Weixing Zhang	b851973f22	pipeline_worker_pool_.JoinAll() should be called in pipeline code path (#5604 ) Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2020-10-27 11:57:46 -07:00
ytaous	6f824c25e5	Dropout op elimination - enable for ORT training (#5588 ) * dropout elimination * per comments * fix build * fix build Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-27 11:51:23 -07:00
Dmitri Smirnov	3433576fd3	Support for Sparse Initializers (#5540 ) Introduce sparse_initializers support. Convert them to dense on model load and prune graph_proto_ so they don't consume space. Convert back to sparse on ORT Format model save. Implement serializing sparse initializers to OrtFormat. Fix Model::ToProto() to return original sparse initializers Set a flag that graph_sync is needed when loading a simple ORT Format model. otherwise nothing is resolved. Add ORT Format history to README.md ifdef MINIMAL build for DenseToSparseTensorInitializer Allow duplicate initializers to support existing models. Issue a warning instead of aborting. * Revert "Remove SparseTensor support from minimal build. (#5114)" This reverts commit `59ee8ffb17`. Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>	2020-10-27 10:32:06 -07:00
Sherlock	694a4d6413	Add more loggings for GradientBuilder (#5556 ) * Add more loggings for GradientBuilder Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-26 15:15:52 -07:00
edgchen1	68fe722691	GatherGrad optimization (#5524 ) The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance. The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output. Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time. The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead. The implementation was adapted from PyTorch (`b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu`).	2020-10-26 12:53:53 -07:00
Sergii Dymchenko	8224718f8f	Enable CommonSubexpressionElimination in training. (#5504 ) * Add test for CommonSubexpressionElimination in training. * Enable CommonSubexpressionElimination in training. * Add ommonSubexpressionEliminationApplyOnce for training.	2020-10-26 11:25:15 -07:00
ashbhandare	0a9b83a313	Add zero test (#5476 )	2020-10-21 17:12:00 -07:00
Vincent Wang	b48f596a91	GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel (#5511 ) * TopKGrad CPU kernel * use Scatter for GatherElementsGrad and TopKGrad. * rollback convgrad change. Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-10-21 09:29:29 +08:00
Xavier Dupré	66c8a441e0	Improves ReduceSum performance by removing transposition. (#5370 ) * Improves ReduceSum performance * Add min, max, L1, L2, logsum, sumsquare * remove all reduce implementation including transpose	2020-10-20 10:36:31 +02:00
Juliana Franco	0298b9734e	Save in EndTraining only if in last rank (#5500 ) * Only save partition of graph with loss (during EndTraining) * fix comments Co-authored-by: Juliana <jufranc@microsoft.com>	2020-10-19 14:16:48 -07:00
Derek Murray	0b59004666	Add fallback function implementation for DivGrad (#5518 ) * Add fallback function implementation for DivGrad. * Add shape inference for DivGrad. * Add missing argument. Co-authored-by: Derek Murray <demurra@microsoft.com>	2020-10-19 10:47:47 -07:00
Derek Murray	6f65e2ad2c	Mark the dX and dB outputs of ConvGrad as OpSchema::Optional. (#5462 ) * Mark the dB output of ConvGrad as OpSchema::Optional. * Also mark dX as optional Co-authored-by: Derek Murray <demurra@microsoft.com>	2020-10-15 16:54:17 -07:00
Derek Murray	64f6d856e4	Add FlattenGrad and test. (#5461 ) Co-authored-by: Derek Murray <demurra@microsoft.com>	2020-10-15 16:11:57 -07:00
Derek Murray	88f6523baf	Add type inference for BroadcastGradientArgs (#5501 ) * Add type inference for BroadcastGradientArgs This change enables the ONNX shape and type inference to work on a function body containing a BroadcastGradientArgs op. Without this change, the dummy inference function is used, and no types are inferred for the output here: `531e6dd459/onnx/shape_inference/implementation.cc (L467-L469)` * Handle optional outputs.	2020-10-15 16:11:24 -07:00
Scott McKay	7da7e07909	Cleanup some test infrastructure (#5484 ) * Created shared version of InferenceSession wrapper class and update relevant tests to use it. Include domain in the ops counting helper so it's more general and we don't need to duplicate it in the nchwc tests. Update tests to include domain in key being checked. * Fix some training tests * Fix prefixing of contrib op names in test	2020-10-16 06:44:01 +10:00
KeDengMS	c444b9d76a	Add CUDA option to run copy in default stream (#5445 ) * Add CUDA option to run copy in default stream This change fixes #4829. Thanks @maherzog for providing the repro! The bug is caused by memory reuse in BFC arena, where copy and compute stream in CUDA has a racing condition. BFC arena is an arena allocator on top of cudaMalloc/Free to reduce the cost in syncing CPU and GPU when alloc/free. It means when CPU alloc/free the memory, GPU might not finished previous work on the memory, so that CPU and GPU could run asynchronously. This is OK if there's only one stream, where the execution order in CPU and GPU are consistent. For example, if we have two kernels A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB, A and B could shares the same memory since computeA and computeB will not have racing as long as they run in the same GPU compute stream. However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB, the order of execution in GPU could have copyA happen after computeB, if copy and compute happens in different GPU streams. This change makes copy to run in default compute stream, while adding an option to fall back to previous behavior if there's perf hit. This is a short term fix before BFC arena could support multiple streams. User may use following options to revert to previous behavior: C API: struct OrtCUDAProviderOptions cudaProviderOpt; cudaProviderOpt.do_copy_in_default_stream = false; C++ API: CUDAExecutionProviderInfo cudaEPInfo; cudaEPInfo.do_copy_in_default_stream = false; C# API: pending... Python: import onnxruntime onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False) * Confirmed the test failes in CI when doing copy in separate stream Revert the test to get CI pass now * Fix Windows test * Address CR	2020-10-12 22:12:05 -07:00

1 2 3 4 5 ...

345 commits