onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-28 03:20:58 +00:00

Author	SHA1	Message	Date
Wei-Sheng Chin	4ccca20def	Replace MPI Send and Recv with NCCL Send and Recv (#5054 ) * Prototype NCCL P2P * Clean code * Fix NCCL path and some minor bugs * Add path * Fix path * Try fix path * Add missed files * Address some comments * Clean code * Rename files * Add MPI path back and fix a path * Put MPI path under USE_NCCL flag * not to build Send and Recv when MPI is not installed	2020-09-09 09:39:56 -07:00
Vincent Wang	07bf8b968e	Register BiasGelu and BiasDropout for CUDA only. (#5060 ) Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-09-09 11:46:55 +08:00
Sherlock	38453acae3	Further populate Stop Gradient list (#5021 ) * Add to Stop Gradient list * Improve Stop gradient	2020-09-08 12:49:09 -07:00
liqunfu	de58720a97	Liqun/transformer test and e2e golden numbers (#5064 ) * match new/old api numbers * new golden numbers for Roberta and MC Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-09-04 18:11:37 -07:00
Vincent Wang	84de14a833	Register OpSet13 CUDA Kernels for BERT/UniLMv2 (#4856 ) * opset13 cuda kernels for BERT. * add opset13 SoftmaxCrossEntropyLoss. * opset13 size. * fix argmax/min for ut. * fix ut failure for argmax/min. * OrtMemTypeCPUInput Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-09-05 08:09:52 +08:00
Bowen Bao	6dd4af3936	Fix initializer name only when wrapper is applied (#4920 ) * Fix initializer name only when wrapper is applied * fix inspect import	2020-09-04 12:08:07 -07:00
Thiago Crepaldi	0fc9c504fe	Re-enable CI tests for the new PyTorch frontend (#5017 ) This PR includes: * Re-enable CI tests for new PyTorch frontend * Re-enable fp16 and adjust tolerances for number matching	2020-09-04 09:36:24 -07:00
liqunfu	bb13b52291	to allow parallel training with mpi4py (#4942 ) to allow parallel training with mpi4py Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-09-03 12:47:12 -07:00
Thiago Crepaldi	9388d49c0d	Add warning to non pickable models (#5037 )	2020-09-03 11:53:56 -07:00
Thiago Crepaldi	9d1bdef195	Update CODEOWNERS and minor docstring fix (#5002 ) This PR includes: * Previous CODEOWNERS was encompassing more files than just training files * Polynomial optimizer config is missing part of its docstring	2020-09-03 11:52:38 -07:00
Suffian Khan	546965c2da	Add deterministic path for AllReduceL2 (used to compute gradient norm) (#5027 ) * add deterministic path for reduce l2 * add unit tests * memset zero size off by one * eliminate windows warning as error Co-authored-by: suffian khan <sukha@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-09-03 10:02:41 -07:00
Bowen Bao	22ba266bd6	Add flag to _internal_use to control export of contrib ops in ort trainer (#4968 )	2020-09-03 09:11:47 -07:00
Scott McKay	28445c88f9	Changes to enable saving and loading an ORT format model (#4995 ) * Changes to enable saving and loading an ORT format model via the public APIs. Cleanup session.py to try and make slightly more understandable. More refactoring is needed here. Couple of bug fixes * Fix bug in handling NodeArg serialization for optional inputs which has a name and no type info. * Address PR comments - tweak SessionOptions config to avoid double lookup - merge duplicated functionality in python binding around registering an EP with optional options Fix a couple of build issues. * Update C API to be consistent with python API - only load model in InferenceSession ctor if required - support loading ORT model in minimal build * Fix nodejs test. We get an invalid path error from LoadInterOp first now * Another attempt at fixing nodejs test. Error message depends on whether ENABLE_LANGUAGE_INTEROP_OPS is defined. Make the output consistent. The interop implementation looks suspicious given it appears to be internal code that is going via the public api. TBD if that should be fixed. * Fix couple of build issues. * Disable test temporarily so PR can be checked in. Will fix in separate PR that adds final pieces for minimal build as the test is required there. * Give up on nodejs test and make the match simpler. Fix init call in TrainingSession python to not pass through sess. it wasn't being used in Session anyway so passing it through just adds confusion. * Fix call to Session.__init__ in TrainingSession. Session now initializes Session._sess to None to make it clearer where the 'ownership' of that member is, and that needs to happen before TrainingSession sets it.	2020-09-03 09:10:48 -07:00
Sherlock	a935731bd3	Neg Gradient (#5022 ) Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-09-02 15:54:17 -07:00
Thiago Crepaldi	aabed34d5c	Fix checkpoint API and improve loss scaler handling (#4950 ) This PR also includes: * More LossScaler tests * Minor LossScaler improvement * Check model after extra post processing * Improve basic training tests to include all optimizers * Set rtol=1e-7 tolerance for Legacy vs Experimental frontend API tests * Increase number of training tests for Legacy vs Experimental tests * Minor refactoring on existing tests * Fix Checkpoint API for Gradient Accumulation / fp16 scenarios	2020-09-02 09:38:02 -07:00
Thiago Crepaldi	eebc2cccce	Fix fetches when eval_step's input is a subset of train_step's input (#4966 ) This PR also includes MNIST sample using the new forntend as a sample	2020-09-02 08:57:44 -07:00
Thiago Crepaldi	f38f2d5b54	Port #4920 into the new pytorch frontend (#4965 )	2020-09-01 19:00:49 -07:00
Hariharan Seshadri	d30dd41c0e	Remove public default ctor in PyInferenceSession and replace it with a protected ctor (#4990 )	2020-09-01 17:10:36 -07:00
liqunfu	d79af260bb	Liqun/new api orttraining test transformers (#4982 ) * matching transformer model test with Lamb * increase epochs * use atol 1e-6 to pass full precision test	2020-09-01 13:11:06 -07:00
Xueyun Zhu	1e1f5a9c79	support data parallel + pipeline parallel (#4648 ) * enable data + pipeline parallel * distributed group calculation * fix typo * fix test and minor changes	2020-08-31 17:32:03 -07:00
Thiago Crepaldi	9817b8c8a7	Fix state_dict/checkpoint issue introduced by #4639 (#4984 ) https://github.com/microsoft/onnxruntime/pull/4639 changed the default behavior by removing optimizer state from state_dict/checkpoint APIs. The reason for the previous change was to allow models trained on ORT to be used for inference on PyTorch, which is an important feature. Due to the change aforementioned, when resuming training from a checkpoint, the optimizer would start with random weights, leading to a bad performance. This behavior would also cause reproducibility issues, as the optimizer wouldnt be able to resume from its previous state. This PR adds a boolean flag to state_dict/save_xheckpoint API that when True (default) it saves both model and optimizer state. When False, only the model state is kept.	2020-08-31 17:00:14 -07:00
Sherlock	50c610e70a	Stop Gradient at Shape op (#4983 )	2020-08-31 13:13:17 -07:00
M. Zeeshan Siddiqui	6d9d252bc3	Disable NegativeLogLikelihoodLoss_LargeSizeTensor test (#4979 ) Disabling this test until it's intermittent failure is root caused, this is a function and does not have a dedicated op by itself. However, this op is not used in known model to the best of my knowledge to disabling this test for the sanity of CI until the investigation is over is probably reasonable.	2020-08-31 11:02:07 -07:00
Sherlock	98f7fdd7da	Handle MatmulGradient with 2D Weight at B (#4977 )	2020-08-30 22:56:33 -07:00
Hariharan Seshadri	7045910d10	Support RegisterCustomOpsLibrary via the Python API (#4764 )	2020-08-28 13:24:29 -07:00
Wei-Sheng Chin	1281ff6462	Put operators in-between Wait and Record (#4916 )	2020-08-28 11:44:54 -07:00
Tang, Cheng	efdd96595f	bfloat16 and opset13 related fix (#4913 ) * regsiter part of opset13 cpu kernels; fix a bug in func impl; adjust reshapefusion order * remove useless function Co-authored-by: Cheng Tang <chenta@microsoft.com>	2020-08-27 16:10:53 -07:00
Sherlock	9f5d4918dc	MatMul Gradient optimization for dB when B's is 2D tensor (#4899 ) * Optimized MatMulGrad for dB when B's shape is 2D * Refactor for ConstantScalarNode Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-08-27 11:33:20 -07:00
harshithapv	00fe718264	Fix divide-by-zero for SSCE kernel when normalize factor is zero. (#4911 ) * Changes in SSCE for all tokens ignored case.	2020-08-26 17:12:17 -07:00
Thiago Crepaldi	cac25751bd	Fix mnist example (#4926 )	2020-08-26 15:28:39 -07:00
liqunfu	b3783a9f85	matching multiple choice between new and old apis (#4918 ) * matching multiple choice between new and old apis * update according to reviewer's comments Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-08-26 12:36:10 -07:00
Bowen Bao	db6a821869	Enable example transformer test with dynamic size inputs (#4888 ) Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2020-08-24 14:31:08 -07:00
Rayan-Krishnan	eb05db5a2a	Fix OptimizerConfig params groups (#4877 ) * Copy samples to build folder and load models from there. Fix CI * This PR also includes a fix to path validation for save_as_onnx API * Add torchtext to CI for GPU training * Remove new frontend tests from CI Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2020-08-22 22:04:17 -07:00
Pranav Sharma	29dcfb24ab	Allow multiple sessions to share an allocator, optimize constant folding memory usage, expose arena configs. (#4813 ) * Add support for sharing allocators * Incremental update * Address some PR comments, add unit tests, add documentation. * Address PR comments, add tests and some documentation. * Fix build and test issues * Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this. Also fixed the orttraining segfault. The segfault was because in the case of training session, the CPU exec prov is not available at the time the transformers are applied. Changed it to create a new one.	2020-08-22 10:03:17 -07:00
jingyanwangms	fa68bbc82e	Relu grad kernel (#4864 ) * create branch for debug * move unit test * more changes * move relu to activations_grad* * Fix ReluGrad Domain and opset version * added unit test, CudaKernelTest.Relu_basic doesn't work yet * remove CudaKernelTest.Relu_basic * PR comment * add unit test ReluGradTest_Basic Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-08-22 01:03:44 -07:00
Thiago Crepaldi	dce2ce7a4f	Fix checkpoint API and copy samples into build dir (#4887 ) * Fix state_dict APIs * Copy samples to build folder and fix CI	2020-08-22 00:09:48 -07:00
liqunfu	6260d073b3	Glue parallel training (#4550 ) add mpi size, rank python API add single node parallel training example	2020-08-21 21:24:27 -07:00
Thiago Crepaldi	acbf6d15c6	Improve LRScheduler tests (#4885 ) * LRScheduler tests added to the Transformer model * Refactored LRScheduler tests for the BERT Toy onnx example * Removed dead code	2020-08-21 16:18:30 -07:00
Thiago Crepaldi	5427a7e9af	Update LRScheduler to use scheduling similar to HuggingFace (#4880 )	2020-08-21 10:24:04 -07:00
Rayan-Krishnan	7589445e6e	Add ONNX BERT Frozen Weights and Save as ONNX Tests (#4859 ) Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2020-08-19 21:31:38 -07:00
liqunfu	25cc6158a8	update golden numbers (#4865 ) Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-08-19 20:52:10 -07:00
liqunfu	d7233c7c97	Fix training for models with dict input (#4842 ) This PR also includes: * Remove defaults from named tuples to support python 3.6 * Allows model which takes dicts as input * Adapts BERT finetuning example to run on the new frontend * Match numbers for BERT fine tuning model Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2020-08-19 18:36:36 -07:00
Thiago Crepaldi	7cc88ef7ed	Port legacy checkpoint API into new front-end (#4855 ) * Port legacy checkpoint API into new front-end This PR also fixes: * Warnings on ORTTrainer for improper tensor copies * Inaccurate LRScheduler tests using wrong LR * Stale DeepSpeed documentation * Minor code refactoring for Toy BERT tests * Move experimental state_dict() and load_state_dict() into checkpoint ns	2020-08-19 14:27:28 -07:00
Vincent Wang	5eaac31faa	support opset13 on transformers. (#4837 ) Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-08-19 11:13:37 +08:00
gwang-msft	dee7596724	Add a generic collection of session configurations to the SessionOptions (#4718 ) * adding generic configurations for session options * fix a build break on linux * fix training ci build break * fix training ci build break * addressed CR comments * fix traning ci build break * move config_key from enum to string * add c# api * add python api * fix build break * move prepacking from 2 new api entries to session options configs * fix traning ci build break * add python test, update some comments, move const key definition to avoid build break * addressed comments * move definitions of keys to common.h * move api to version 5 * remove accidental change in build.py * remove pragma to avoid build break * addressed CR comments * fix the python build break, and move location of config keys definition * small typo changes	2020-08-18 13:40:40 -07:00
ytaous	2605af9a0b	Fix for mainz model (#4744 ) * fix for mainz model * fix build * on comments * revert the extra check * on comments Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-08-18 11:47:19 -07:00
Thiago Crepaldi	f3b0c93a45	Fix issue preventing loss scaler to run due (#4833 ) `LossScaler.update()` was not being properly called due to the incorrect TrainStepInfo.all_finite assignment. Additionally to this fix, _ORTTrainerModelDesc.is_finite was renamed to _ORTTrainerModelDesc.all_finite to make it more uniform with TrainStepInfo	2020-08-18 10:03:02 -07:00
Hariharan Seshadri	a3c95374c3	Support asymmetric paddings in CUDA Conv kernel (#4627 )	2020-08-18 02:09:30 -07:00
Rayan-Krishnan	24d9f4e0c3	Add More Extensive ONNX BERT Tests (#4827 ) Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2020-08-17 19:54:22 -07:00
Thiago Crepaldi	f933910ea3	Update LambConfig defaults to match backend (#4826 )	2020-08-17 16:58:14 -07:00

1 2 3 4 5 ...

256 commits