onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-04 23:59:56 +00:00

Author	SHA1	Message	Date
edgchen1	2cb8cb816f	Disable or update flaky tests, improve test random seed accessibility. (#3495 ) - Add output of test random seed - Allow setting of test random seed with environment variable - Disable / relax tolerance for flaky tests	2020-04-17 15:57:32 -07:00
manashgoswami	9fc2b6482b	Ort training README (#3404 ) Added README for ORT Training	2020-04-16 14:51:33 -07:00
M. Zeeshan Siddiqui	6c1ccb659f	SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (#3465 ) * Update ONNX submodule commit to the latest. * build break. * SoftmaxCrossEntropyLoss: Forward and backward kernel implementation. * Revert "build break." This reverts commit 847cb50d294efbe6c09fa760e7cacf25bfb6146d. * Add more tests and misc clean up. * revert unintended changes. * PR feedback. * cleanup. * PR feedback.	2020-04-16 12:27:07 -07:00
Jesse Benson	644bc05830	Add Python API to set random seed: onnxruntime.seed(<seed>)	2020-04-15 09:44:48 -07:00
pengwa	2c7c45076b	MaxBatchSize E2E Test (#3454 ) * max batch size e2e test *update test data snapshot	2020-04-15 09:50:44 +08:00
edgchen1	4fa88a0a23	Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (#3523 )	2020-04-14 14:30:26 -07:00
Tixxx	06b63975c0	Fix fp16 type mismatch when graph output is an fp32-only node (#3411 ) * verify output node before changing its type in mixed precision mode	2020-04-14 09:35:19 -07:00
edgchen1	ba7225f986	Update Graph SetInputs and SetOutputs for training (#3446 ) Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.	2020-04-13 19:10:44 -07:00
M. Zeeshan Siddiqui	5d99f179b9	Merge pull request #3486 from microsoft/sedymche/merge_master_ort_training Merge from master into ort_training	2020-04-13 10:55:36 -07:00
Tixxx	f5ba9c922d	fix internal loss scale (#3483 ) * Changed internal loss scale to 1-D * added test Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>	2020-04-10 14:13:48 -07:00
edgchen1	20c7dd9f5c	Remove orttraining/docker directory. (#3476 ) The docker images are not publicly available yet. Addressing PR comment: https://github.com/microsoft/onnxruntime/pull/3174#discussion_r390761308	2020-04-10 09:41:22 -07:00
Vincent Wang	03996c7c08	Fixes for Where, ConcatGrad and ReduceSumGrad (#3415 ) * Fixes for Expand, Where, ConcatGrad ReduceSumGrad. * Roll back expand, fix, add tests for reduce grad. * Roll back CPU Expand change. * Fix after merge. Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2020-04-10 19:35:32 +08:00
Sergii Dymchenko	84773c61c6	Rename ONNX OPTIONAL to OPTIONAL_VALUE.	2020-04-09 16:22:30 -07:00
liqunfu	1ddfe1249b	frontend test to use random seed (#3209 ) frontend test to use random seed	2020-04-08 10:03:07 -07:00
ytaous	b35468289a	View Op - new unit tests and add support for tensor memcpy by offset/size (#3439 ) * view ops UTs * update per comments * PR comments - code clean up * code clean up per comments Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-04-07 13:07:11 -07:00
Thiago Crepaldi	15e32b44fd	Merge pull request #3383 Merge from master into ort_training	2020-04-06 19:05:01 -07:00
Edward Chen	95707d22a5	Disable gradient clipping for E2E test.	2020-04-06 23:07:28 +00:00
Sherlock	a3ab2ba036	Reapply commit 131c65d; Fix memory regression issue. (#3423 ) * Reapply commit `131c65d` * fix merge error	2020-04-06 10:29:31 -07:00
edgchen1	82c1e1b3db	Enable loss scale input from Python frontend (#3327 ) Made some fixes to enable loss scale to be wired up to ORT from the Python frontend. In particular, now addition of loss scaling is done unconditionally if mixed precision is enabled. The generated loss scale input name is passed back to the frontend. Also fixed how inputs were added during the training graph configuration. Graph::SetInputs() was causing some issues - it seems to not be working correctly. Also added some mixed precision Python frontend tests. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-04-03 16:02:14 -07:00
Sherlock	f437665360	Revert "Addressing PR comments (#3334 )" (#3412 ) This reverts commit `131c65d23d`.	2020-04-03 11:59:47 -07:00
Thiago Crepaldi	d89e5d91a6	Disable GradientCheckerTest tests for GPU/Debug build (#3407 )	2020-04-03 01:01:58 +00:00
Thiago Crepaldi	675035b1a8	Disable GradientCheckerTest tests for GPU/Debug build (#3407 )	2020-04-02 18:00:54 -07:00
Sherlock	614eb438ae	Update Op's Domain and Version (#3356 ) * Update Nccl ops domain opset * Update ZeroGradient Domain OpSet * Update InPlaceAccumulator Domain OpSet * Update SoftmaxGrad Domain and OpSet * Update LayerNormalizationGrad Domain and OpSet * Update BatchNormGrad Domain and Opset * Update IsAllFinite Domain and Opset * Update DivGrad Domain and Opset * Update GatherGrad Domain and Opset * Update IsFinite Domain and OpSet * Update ReduceAllL2 Domain and Opset * Update MixedPrecisionScale Doman and Opset * Update AllOp Domain and Opset * Update GroupOp Domain and OpSet * Update ViewOp Domain and OpSet	2020-04-01 10:10:38 -07:00
Xueyun Zhu	efc8bd738f	add pipeline graph split script (#3275 ) * pipeline graph cut * add element type * add input wait event and shape info * shape inference * support multiple cuts * format script * address feedback * address feedback	2020-03-31 19:30:18 -07:00
Thiago Crepaldi	83c3da3fc0	Fix code-base after breaking API changes	2020-03-31 17:59:20 -07:00
Weixing Zhang	1bbc421884	Don't cast to fp16 in LayernormGrad (#3328 ) Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2020-03-28 19:07:32 -07:00
Sherlock	ffb2a3359e	Implement WhereGrad (#3343 )	2020-03-27 19:10:40 -07:00
Tixxx	49e6043d07	support Huggingface's adamw (#3318 ) * add weight decay mode to support both pytorch and huggingface's adamw	2020-03-27 08:04:27 -07:00
ytaous	131c65d23d	Addressing PR comments (#3334 ) * PR comments * PR comments * PR comments * error out bad shape Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-03-26 18:43:30 -07:00
Xueyun Zhu	0a6ec0df56	Merge pull request #3285 from microsoft/xuzhu/merge_from_master Merge from master to ort_training	2020-03-26 12:10:13 -07:00
Sherlock	d143b41b81	Expose frozen_weights in PyTorch Frontend (#3317 )	2020-03-26 11:26:54 -07:00
Wei-Sheng Chin	b38fc0d541	Add bias correction in Adam & Lamb for C++ frontend & python frontend (#3301 )	2020-03-25 09:46:44 -07:00
Xueyun Zhu	e9877850a4	fix python error	2020-03-25 01:59:37 +00:00
Bowen Bao	6474801ceb	Update ort_trainer.py with lazy onnx export (#3244 ) * Delay onnx export to avoid extra info * handle cases where onnx model is provided at initialization * address comments * fix rebase error	2020-03-24 13:34:15 -07:00
Li-Wen Chang	98c28060b0	Aggregated Send/Recv (#3232 ) * Aggregated Send/Recv * fix typos * CR refine * CR refine * CR refine * Add scalar check. * typo * reformat * CR refine * Forgot to swap order in the implementation after spec changed * CR refine * Cr refine * add Send's input type checking	2020-03-24 10:20:11 -07:00
KeDengMS	d15c74e713	Implement pipeline event generator (#3206 ) Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch.	2020-03-23 17:32:54 -07:00
Xueyun Zhu	8f7bd51f7a	fix pybind issue introduced by merge	2020-03-23 23:23:34 +00:00
Tixxx	7f610caca0	Make gradient clipping configurable. (#3243 ) * Make gradient clipping configurable. add control flag to c++ and python frontend	2020-03-23 12:21:48 -07:00
Xueyun Zhu	9dbc50c438	fix build break	2020-03-21 02:16:00 +00:00
liqunfu	d521efd904	refactor frontend (#3235 ) * refactor frontend * remove training python files from inferencing build * update according to reviewer's comments * merge pybind_state.cc * refactor pybind_state.cc * code clean up * missed a forward declaration in ort_pybind_state.cc * passed pytest * move training_session.py into a subfolder per reviewer's comment * add copyright Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-03-19 20:59:41 -07:00
edgchen1	d9f628cb1d	Remove orttraining/tools/scripts/profile directory. (#3268 )	2020-03-19 14:13:05 -07:00
edgchen1	61e8a24340	Address PR comments (#3255 ) * Added comment for ntfw_remove(). * Rewrite WindowsEnv::DeleteFolder(), some other clean up.	2020-03-18 17:57:57 -07:00
edgchen1	c5576d70a6	Fix build issues (#3214 ) * Fixed issues with Python and inference-only build. * Handle ImportError for training imports. * fix windows build * fix compile error * fix centos build * fix windows build * fix compile error * Use SafeInt for allocation calculation, fix typo. Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-03-17 16:10:23 -07:00
Sherlock	4b2c8e884e	Udpate License Header (#3212 )	2020-03-16 10:24:31 -07:00
Jesse Benson	3a7539e071	Update bert-base convergence values	2020-03-13 23:03:34 -07:00
Jesse Benson	dc11b82956	Tweak the dropout calculation.	2020-03-13 23:03:34 -07:00
Ethan Tao	2f1e997e5b	Merged PR 5686: fix P100/fp16 issues 1. misaligned address in atomic_add() 2. GatherNDGradKernel to use atomic_add 3. enable/add UTs for GatherNDGrad and reduction_ops using half - __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead 4. verified convergence graph and perf test - p100 is much slower than v100 on fp16 - fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue 5. verify convergence test on Dev3/v100 TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)	2020-03-12 16:51:45 -07:00
Ke Deng	75025461e2	Initial implementation of graph cut and pipeline This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"	2020-03-12 16:51:45 -07:00
Edward Chen	80dd62a240	Enable CI for training.	2020-03-11 14:41:32 -07:00
Edward Chen	e542cfd0e0	Introduce training changes.	2020-03-11 14:39:03 -07:00

50 commits