onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-20 02:07:56 +00:00

Author	SHA1	Message	Date
Thiago Crepaldi	77cefcd6c2	Perform forward pass using training graph with intermediate outputs	2020-12-15 09:03:07 -08:00
Thiago Crepaldi	11b69f141e	Forward pass using InferenceSession on exported ONNX Although forward pass works, this has the limitation of not working for backward pass due to the lack of intermediate tensors needed for gradient. Next step is to export a training graph and split it manually	2020-12-15 09:03:07 -08:00
Edward Chen	9810b9e02b	Reduce amount of compiled CUDA device code (#6118 ) Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight. Make corresponding changes for ROCM execution provider code. Other minor cleanup.	2020-12-14 15:27:40 -08:00
liqunfu	cde723a136	Liqun/move nightly pl to linux multi gpu v100 (#6024 ) * move e2e nightly pipeline to azure devop Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-12-14 12:43:41 -08:00
baijumeswani	dd2e5a1a05	state_dict and load_state_dict for ORTTrainer (#6095 ) * add functions state_dict and load_state_dict to ORTTrainer * unit tests for state_dict and load_state_dict for ORTTrainer	2020-12-14 11:55:52 -08:00
Suffian Khan	6cb5d3ac09	Fix multi-tensor LAMB reduction to be deterministic (#6028 ) * define ordering of reduction across blocks * save state * remove debug code * remove debug code * review comments * significant correction for reduction only over blocks on same tensor * addressing ocmments * update rocm/lamb.cc to build as well * remove times 2048size in multitensor test until threshold error in rocm resolved convert tuple => struct as per recomendation * update comment * apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer * remove excess template arguments from rocm lamb.cc launch_multitensor as well * fixes for AMD build * pr comments * run formatter from vscode * formatter on cuda files	2020-12-11 13:13:05 -08:00
Sherlock	a53f4dd379	Introduce VariadicAlias, remove hardcoded alias limits (#6106 ) * Introduce VariadicAlias, remove hardcoded alias limits * Include optional-lite in winml build Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-12-11 10:47:08 -08:00
Jesse Benson	38c49c2483	Make ROCM and CUDA reduction_all code more similar.	2020-12-11 09:35:07 -08:00
Vincent Wang	7ddeafdfcc	Add ReduceL2Grad and ClipGrad (#5970 ) * ReduceL2Grad and ClipGrad. * fix win build and amd ci pipeline * resolve comments. Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>	2020-12-10 11:03:26 +08:00
Sergii Dymchenko	9e26e59a37	Deprecate opsets <12 for training. (#6027 )	2020-12-09 00:15:27 -08:00
Weixing Zhang	d95fc5e849	clean un-used code. (#6059 ) Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2020-12-08 23:15:30 -08:00
Weixing Zhang	2705115732	add dockerfile for ROCm3.10 and update BUILD.md for ROCm EP (#5821 ) * add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used * update BUILD.md * add dockerfile for rocm3.10	2020-12-08 23:14:56 -08:00
ashbhandare	b1a75d0e98	Enable passing initial optimizer state while creating training session (#5869 ) * Support to pass initial optimizer states to optimizer graph builder * Changes for passing init optim state to training session config * Pass optimizer state through cpp and python frontend * Cleanup * Review comments * Fix windows and mac CI * Review comments * review comments * Review comments * Frontend review changes * Fix CI	2020-12-08 21:20:51 -05:00
Sherlock	7a43fa0028	Fix AllReduce kernel for contiguous buffer (#6064 )	2020-12-08 15:55:13 -08:00
baijumeswani	523d187193	save data to and load data from an hdf5 file for checkpointing (#5975 ) * save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary * unit tests for saving data to and loading data from hdf5 file	2020-12-08 11:40:57 -08:00
ashbhandare	7cebf76a46	Improve checkpointing for Zero stage 1 (#5478 ) * Initial running changes * Checkpointing aggregation changes * compare with older version * initial cleanup * Add zero test, minor fix * Fix zero test, transform, formatting * Review comments * add more unit tests * review comments * Try fix CI * Add additional check on just aggregation code * Try fix ckpt gen * Add pregenerated ckpt for CI, enable zero test in e2e * Moving test to nightly, removing ckpt files * Add tests to dist GPU CI * Fix dist test * Review comments * Fix test	2020-12-07 09:16:01 -08:00
Jesse Benson	14f6eb14b1	Use __launch_bounds__ workaround, rather than limiting threads to 256 on AMD.	2020-12-03 13:06:34 -08:00
Jesse Benson	245d43615d	Fix AMD multi-tensor implementation.	2020-12-03 13:06:34 -08:00
Sherlock	c86a1e5c13	Fix Flaky orttraining tests (#5977 ) * Fix Flacky orttraining tests	2020-12-03 10:24:25 -08:00
Alberto Magni	fb310fba0c	Avoid adding non-existent inputs to new Event nodes (#5915 ) During graph resolve non-existent nodes cause shape-inference failures.	2020-12-01 08:21:05 -08:00
Jesse Benson	45966d878a	Code review feedback	2020-11-30 09:24:22 -08:00
Jesse Benson	86e30a2db6	Update CUDA IsAllFinite kernel	2020-11-30 09:24:22 -08:00
Jesse Benson	bd96f60888	Use CUDA's IsAllFinite kernel for ROCm	2020-11-30 09:24:22 -08:00
baijumeswani	69b9368c93	Add unit tests to identify configuration migration scenarios for checkpointing (#5678 )	2020-11-25 09:40:26 -08:00
baijumeswani	208f4c1d3c	Azure ci pipeline for distributed environment tests (#5881 )	2020-11-23 14:01:00 -08:00
Vincent Wang	47185b9513	reducealll2 cpu kernel (#5833 ) Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>	2020-11-19 10:20:05 +08:00
Tracy Sharpe	f964bb94ba	Add QLinearConv NHWC transformer (#5824 ) The implementation of QLinearConv internally does a transpose(NHWC)->im2col+GEMM->transpose(NCHW). This adds a graph transformer to change a model to use a com.microsoft.QLinearConv that supports NHWC natively to avoid unnecessary transposes.	2020-11-17 20:51:02 -08:00
Edward Chen	71e7c2b423	Cache build docker images in container registry. (#5811 ) This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry. Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources. With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository. The cache container registry will need to be cleaned up periodically. This is not automated yet.	2020-11-17 17:02:24 -08:00
zhijxu	89e5b3a24f	resolve review comments	2020-11-16 11:23:01 +08:00
zhijxu	89902c2519	fix frontend bug. old ort session may already exists when creating new ort session, this may cause OOM error	2020-11-16 11:23:01 +08:00
Jesse Benson	ced5b66306	Re-enable multi-tensor-apply for LAMB optimizer	2020-11-15 09:35:00 -08:00
Weixing Zhang	fc614ad050	revert the code change which was based on `b4869926` The change `b4869926` which was to remove per-thread allocator would cause seg fault for distributed training. In addition, add dockerfile for ROCm3.9	2020-11-15 00:24:32 -08:00
Vincent Wang	0c8902cbbe	Update Gradient Builder of Some Ops for OpSet13 (#5748 ) * gradient builder for opset13 * code clean. * resolve comments * stop grad for axes input * add split to stop grad list. Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-13 16:20:34 +08:00
Alberto Magni	88c3704257	Add shape inference for additional ops This commit adds shape inference support for the following ops: SoftmaxCrossEntropy SoftmaxCrossEntropyLossGrad SoftmaxCrossEntropyGrad LayerNormalizationGrad Motivation and Context	2020-11-12 20:18:54 +00:00
pengwa	49288de17c	Fix memory planning issues (#5752 ) * Fix memory planning issues * fix build * fix the wrong line...	2020-11-13 03:07:59 +08:00
Vincent Wang	2a87108431	SoftmaxCrossEntropyLoss OpSet13. (#5777 ) Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-11-12 15:50:34 +08:00
Sherlock	07dc25e939	Compute global gradient norm according to 'enable_grad_norm_clip' (#5728 ) * Introduce PassThrough op to wait for all gradient ready before weight update * Compute gradient norm for fp32 runs * Update FE UT expected value * Respect enable_grad_norm_clip	2020-11-11 21:10:34 -08:00
ashbhandare	5aec34500d	Add megatron transforms for BART (#5521 ) * Large model export and run ORT Python support * Megatron change refine a bit workaround self attention issue use partitioned name for weights when megatron model parallel is enabled Fix Megatron Transformer Issue (cuased by the renaming) Add UTs for T5 model parallel Fix megatron seed issue fix log a bit checkkpointing changes + rebase Unintended reshape transform change t5 layer norm changes add t5 layer norm kernel use template for t5 layer norm template definition changes no build error add CPU cuda kernel first unit test other forward unit tests add T5LayerNormGrad Add c++ transform and test for T5 LN minor fix BART MLP Megatron tranform Add concat slice transform + test Cosmetic improvements in concat slice transform Constant folding bug fix + megatron attention transform for BART Undo unnecessary changes * Cleanup * Remove unnecessary changes * Cleanup megatron * Windows build * Add self attention test graph * Correcting transforms + cleanup * review comments * review comments * fix build and test failures * Fix CI * fix windows CI Co-authored-by: Peng Wang <pengwa@microsoft.com> Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-11 16:21:36 -08:00
Xueyun Zhu	d8ace07ad7	Add CPU send/recv for pipeline (#5315 ) * cpu send/recv * clean up send/recv * remove unused code * assert and nccl option for mnist * add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu * Add USE_MPI distinct from USE_NCCL/USE_HOROVOD * fix * fix * exclude cpu send/recv for machines without mpi Co-authored-by: Tim Harris <tiharr@microsoft.com>	2020-11-11 12:41:39 -08:00
Derek Murray	bc1768c7f1	Stop gradient flowing to the `k` input of TopK (#5762 )	2020-11-11 10:24:44 -08:00
liqunfu	1416d12f0b	Liqun/merge e2e pipelines (#5702 ) * Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-11 09:42:08 -08:00
edgchen1	2acdc3cd82	Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729 )	2020-11-09 11:37:06 -08:00
Weixing Zhang	bb1af718b5	fix build failures due to recent change(`858040fa`) in CUDA EP (#5736 ) Some part of code for reduction kernels has been changed in `858040fa`, which cause failures in rocm build since ROCm EP shares some code with CUDA EP. This PR is to quick fix this failure by not sharing two files for now to unblock CI enabling on ROCm EP. Another PR for leveraging `858040fa` for ROCm EP will be done later.	2020-11-09 08:41:30 -08:00
Weixing Zhang	fff85a6a35	Add GPU kernels for ROCm EP (#5655 ) * Add kernels for AMD GPU. This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible. Please refer to "HIP Porting Guide" for details. * like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value". * Use hipMemsetAsync and add checks on HIP calls. * move the generated files to build folder. Co-authored-by: Jesse Benson <jesseb@microsoft.com>	2020-11-06 16:11:06 -08:00
edgchen1	858040faaa	Implement reduce_matrix_columns() to optimize ReduceSum (#5639 ) Implement reduce_matrix_columns() to optimize ReduceSum.	2020-11-05 10:25:00 -08:00
ashbhandare	6d8e81cb08	Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5691 ) * Split change * ReduceSum and Split change * Other op changes, Grad builder, tests, registering required opset 13 ops * Rebase fixes * Fix tests, add some more * Review changes, rebase * Fix windows build * Disable new tests for TesnorRT EP * Disable unsupported for OpenVINO Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-04 20:00:27 -08:00
wezuo	62a99824cb	Wezuo/priority in nodedef (#5692 ) * set the priority in nodedef * remove debugging stmts * revoke zero builder * remove unnecessary namespace comment Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com> Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-04 12:40:37 -08:00
edgchen1	28f1e32898	Loosen tolerance of CudaKernelTest.ReduceSum_MidTensor, allow test random seed to be regenerated within a test run. (#5675 )	2020-11-03 10:37:00 -08:00
Changming Sun	87e1063e19	Revert "Update Squeeze, Unsqueeze, Split and ReduceSum kernel for Opset13 (#5488 )" (#5668 ) This reverts commit `db63c5d10f`.	2020-11-02 16:09:22 -08:00
Jesse Benson	1495f737ca	Use cudaMemsetAsync and add checks on CUDA calls.	2020-11-02 11:25:13 -08:00

1 2 3 4 5 ...

372 commits