Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch.
We want to implement SoftmaxCrossentropy and NegativeLossLikelihoodLoss forward training ops for opset-12 but that requires ONNX submodule to point to the latest commit to have the latest and greatest ONNX spec!
- Reverse integrate changes from *.in.proto files in github ONNX repo.
- Regenerate csharp/test/Microsoft.ML.OnnxRuntime.Tests/OnnxMl.cs
- Disable ONNX tests that don't have op implementation for the latest opset.
1. misaligned address in atomic_add()
2. GatherNDGradKernel to use atomic_add
3. enable/add UTs for GatherNDGrad and reduction_ops using half
- __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead
4. verified convergence graph and perf test
- p100 is much slower than v100 on fp16
- fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue
5. verify convergence test on Dev3/v100
TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)
This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"
Discussed with Faith, because the data size is very small and changes are gradual, there is no need to delete the old data. We want to keep all the history.
* update GeluFusion to support pattern from PyTorch 1.4;
* Fix a bug that missing the check of an edge between mul2 and root.
* update script to fuse gelu from PyTorch 1.4
* Add test for python optimizer
This change fixes#3129. When running onnxruntime as dll on Windows, CUDA does some internal cleanups when process exits. After this, any call to CUDA would cause crash. Delayload makes thread_local destructor to happen after CUDA cleanup, thus the crash.
Override native package name. Preserve managed package name the same.
Specify pckage name for validation purposes.
Fix up validation package name parameter.
(1) Add performance test tool for bert model.
(2) Add accuracy test tool to compare inference results of original and optimized bert models.
(3) Add test data generator tool to create test data for onnxruntime_perf_test.exe
(4) Improve bert optimization script: Verify model producer for model_type; Add warning if model is not fully optimized.
(5) Add shape optimizer tool to assist developing optimization script.
(6) Update readme.
Previously, we put the "bin" folder of all the CUDA verions in the system PATH. And 10.2 is in the front. It's a mess.
So I've removed all of them from the system PATH env. But I need to add one of them back through build scripts.
(The problem only affect the C# test, not the C/C++ tests that forked from build.py).
* add dml gpu pipelines
* add x86 to the gpu dml dev build pipeline
* Enable DML x86 builds
* Fix uint64_t -> size_t warning
* fix warnings
* enable dml on x86 ci builds
* operatorHelper 773 error uint32_t vs uint64_t
* operatorHelper 773 error uint32_t vs uint64_t
* make x86 pipeline use the gpu pool
* more warnings
* fix x86 directml path
* make dml nuget package
* disable tf_pnasnet_large
* disable zfnet512
* make validation use wildcards
* disable x86 dml gpu tests
* add args.
* update gpu.yml
* change nupkg wildcard
* add debug statements
* package x86 dml nupkg
* dont drop managed nuget again from dml pipeline build
* Add DML EULA
* directml license should be renamed to not clobber the existing license
* casing on dml package....
* {} to ()
* fix license name
* disable dml from x86 ci
* typo and cr feedback
* remove featurizers
* ship the dml pdb as well