* first attempt rocm training wheel
* modifications needed to python packaging pipeline for Rocm 4.1
* changges to not conflict with cuda
missed stage1 changes
remove package push
add option r to getopt
try again without python install
try again without python install
try again without python install
split pipelines and add back push to remote storage
try on cuda gpu pool
try again
try again
try running without az subscription set
try again on original pipeline
change pool
passing AMD Rocm whl on AMD-GPU pool
split rocm pipeline from cuda pipeline
remove comments
* try adding Rocm tests as well
* try with tests in place
* fix trailing ws
* add training data
* try again as root for tests
* use python3
* typo
* try to map video, render group into container
* try again
* try again
* try to avoid yum error code
* make UID 1001
* try without yum downgrade
* define rocm_version=None
* remove CUDA related comments for Rocm Dockerfile
* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)
* missed requirements-rocm.txt from last commit
* fix whitespace
* working on re-organizing js code for ortweb
* remove dup files
* move folder
* fix common references
* fix common es5
* add webpack to common
* split interfact/impl
* use cjs for node
* add npmignore for common
* update sourcemap config for common
* update node
* adjust folder/path in CI and build
* update folder
* nit: readme
* add bundle for dev
* correct nodejs paths
* enable ORT_API_MANUAL_INIT
* set name for umd library
* correct name for commonjs export
* add priority into registerBackend()
* fix npm ci pwd
* update eslintrc
* revise code
* revert package-lock lockfileVersion 2->1
* update prebuild
* resolve comments
* update document
* revise eslint config
* update eslint for typescript rules
* revert changes by mistake in backend.ts
* add env
* resolve comments
* Simplified version of WebAssembly support to keep most of existing data structures and add cmake using Ninja and emcmake
* Clean up CMakeLists.txt and add an example to create and compute a kernel
* Load a model from bytes and remove graph building steps
* Add all cpu and contrib ops with mlas library
* WebAssembly build with Onnxruntime C/CXX API
* Use protobuf cmakefile directory instead of adding every necessary source file
* Fix invalid output at example
* add missing files
* Change an example to use Teams model and support ort mobile format
* add API for javascript
* fix input releasing in _ort_run()
* update API
* Let onnxruntime cmake build WebAssembly with option '--wasm'
* allow one-step building for wasm
* Make build script working on Linux and MacOS
* Fix broken build from Windows command
* Enable unit test on building WebAssembly
* Resolve comments
* update build flags
* wasm conv improvement from: 1) GemmV; 2) Depthwise direct convolution 3x3; 3) Direct convolution 3x3
* Cleaned mlas unittest.
* use glob
* update comments
* Update baseline due to loss scale fix (#6948)
* fix stream sync issue (#6954)
* Enable type reduction in EyeLike, Mod, random.cc CPU kernels. (#6960)
* Update EyeLike CPU kernel.
* Update Mod CPU kernel.
* Update Multinomial CPU kernel.
* Slight improvement to Pad CPU kernel binary size.
* Update RandomNormal[Like], RandomUniform[Like] CPU kernels.
* Fix warning from setting multiple MSVC warning level options. (#6917)
Fix warning from setting multiple MSVC warning level options. Replace an existing /Wn flag instead of always appending a new one.
* MLAS: quantized GEMM update (#6916)
Various updates to the int8_t GEMMs:
1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before.
2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size.
3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.
* Implement QLinearAveragePool with unit tests. (#6896)
Implement QLinearAveragePool with unit tests.
* Attention fusion detect num_heads and hidden_size automatically (#6920)
* fixed type to experimental session constructor (#6950)
* fixed type to experimental session constructor
Co-authored-by: David Medine <david.medine@brainproducts.com>
* Update onnxruntime_perf_test.exe to accept free dimension overrides (#6962)
Co-authored-by: Ori Levari <orlevari@microsoft.com>
* Fix possible fd leak in NNAPI (#6966)
* Release buffers for prepacked tensors (#6820)
Unsolved problems:
1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?
2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.
3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.
* Enable type reduction for Range, ReverseSequence, ScatterND, Split, and Unique CPU kernels. (#6963)
* add CI
* fix test in ci
* fix flags for nsync in wasm build
* add copyright banner
* fix wasm source glob
* add missing exports
* resolve comments
* Perf gain by make packb wide to 4 from 16 on GEMM for WASM.
Remove no need direct conv in previous perf tuning.
* fix buildbreak introduced from latest master merge
* fix buildbreak in mlasi.h
* resolve all comments except MLAS
* rewrite packb related 3 functions for WASM_SCALAR seperately rather than using #ifdef in each.
and other changes according to PR feedback in mlas.
* More complete scalar path in sgemm from Tracy.
* Fix edge case handling in depthwise conv2d kernel 3x3. where:
*) support input W==1 and H==1
*) recalc in accurate pad_right and pad_bottom
*) support hidden pad_right == 2 or pad_bottom == 2 when W == 1 or H==1 and no pad left/top
* Add more test coverage for conv depthwise from Tracy.
Fix one typo according to PR.
* resolve comments
* replace typedef by using
* do not use throw in OrtRun()
* output error message
Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Tracy Sharpe <42477615+tracysh@users.noreply.github.com>
Co-authored-by: David Medine <david.eric.medine@gmail.com>
Co-authored-by: David Medine <david.medine@brainproducts.com>
Co-authored-by: Ori Levari <ori.levari@microsoft.com>
Co-authored-by: Ori Levari <orlevari@microsoft.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: Chen Fu <chenfucs@gmail.com>
* Add robust dependency check for Python package
* Add version_info.py to .gitignore
* Fix Linux build
* Fix Windows CPU build
* Fix Windows 32-bit build
* Minor tweak
* Generate version_info.py earlier in onnxruntime_python.cmake
* Print a user-friendly message if cuDNN is not found in
* Relax version requirements for CUDA 11 - only the major version has to match
* Fix PATH environment variable to include CUDA 11 in 'Python packaging pipeline' (Windows/GPU)
* Fix the build with cuDNN 7
* Remove support from custom ops from the base minimal build as they contribute too much binary growth to an Android build.
Add ability to explicitly enable custom op support in a minimal build.
Change one minimal build CI to test adding custom op support (unit tests are run in that build to validate)
* model building
* fix build
* winml adapter model building api
* model building
* make build
* make build again
* add model building with audio op
* inplace and inorder fft
* add ifft
* works!
* cleanup
* add comments
* switch to iterative rather than recursive and use parallelization
* batched parallelization
* fft->dft
* cleanup
* window functions
* add melweightmatrix op
* updates to make spectrogram test work
* push latest
* add onesided
* cleanup
* Clean up building apis and fix mel
* cleanup
* cleanup
* naive stft
* fix test output
* middle c complete
* 3 tones
* cleanup
* signal def new line
* Add save functionality
* Perf improvements, 10x improvement
* cleanup
* use bitreverse lookup table for performance
* implement constant initializers for tensors
* small changes
* add matmul tests
* merge issues
* support add attribute
* add tests for double data type windowfunctions and minor cleanup
* stft onesided/and not tests
* cleanup
* cleanup
* clean up
* cleanup
* remove threading attribute
* forward declare orttypeinfo
* warnings
* fwd declare
* fix warnings
* 1 more warning
* remove saving to e drive...
* cleanup and fix stft test
* add opset picker
* small additions
* add onnxruntime tests
* add signed/unsigned
* fix warning
* fix warning
* finish onnxruntime tests
* make windows namespace build succeed
* add experimental flag
* add experimental api into nuget package
* add experimental api build flag and add to windows ai nuget package
* turn experimental for tests
* add minimum opset version to new experimental domain
* api cleanup
* disable ms experimental ops test when --ms_experimental is not enabled
* add macro behind flag
* remove unused x
* pr feedback
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* minimal_build with training ops
* Removing redundant comment from an earlier attempt at a fix
* Fixing a bad merge conflict resolution
* Responding to PR feedback
* tweaking the makefiles based on feedback
* combining two enable_training blocks in CMakeLists.txt
* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
- Add python bindings for ORT format models
- Add script to update bindings and help info
- Add parsing of ORT format models
- Add ability to enable type reduction to config generation
- Update build.py to only allow operator/type reduction via config
- simpler to require config to be generated first
- can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
- Add script to create reduced build config
- Update CIs
* merged alloc_plan
* pass compilation
* Start running, incorrect allocation memory info
* add in comments
* fix a bug of recording pattern too early.
* debugging lifetime
* fix lifetime
* passed mnist
* in process of visualization
* Add code to generate chrome trace for allocations.
* in process of collecting fragmentation
* before rebuild
* passed mnist
* passed bert tiny
* fix the inplace reuse
* fix the exception of weight in pinned memory
* add guards to ensure the tensor is in AllocPlan
* add customized profiling
* debugging
* debugging
* fix the reuse of differnt location type
* add rank
* add the rank
* add fragmentation
* add time_step_trace
* Add summary for each execution step (total bytes, used/free bytes).
* add top k
* change type of top k parameter
* remove prints
* change heap to set{
* add the name pattern
* add the useage for pattern
* add partition
* change to static class
* add custom group
* remove const
* update memory_info
* in process of adding it as runtime config
* change the memory profiling to be an argument
* add some comments
* add checks to recored meomry_info in traaining session
* set the "local rank setting" to correct argument.
* addressing comments
* format adjustment
* formatting
* remove alloc_interval
* update memory_info.cc to skip session when there is no tensor for a particular memory type
* fix memory_info multiple iteration seg-fault
* consolidate mainz changes
* fixed some minor errors
* guard by ORT_MINIMAL_BUILD
* add ORT_MEMORY_PROFILE flag
* added compiler flag to turn on/off memory profiling related code
* clean up the code regarding comments
* add comments
* revoke the onnx version
* clean up the code to match master
* clean up the code to match master
* clean up the code to match master
Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-yclzsf.eastus.cloudapp.azure.com>
* Use readelf for minimal build binary size checks.
The on-disk size grows in 4KB chunks which makes it hard to see how much growth an individual checkin causes.
Only downside is that the sum of the sections is larger than the on-disk size (assumably things get packed smaller on disk and some of the section alignment constraints can be ignored)
* Remove unused function
1. Update the ProtoSrc path. The old one is not used anymore.
2. Regenerate OnnxMl.cs
3. Delete some unused code in tools/ci_build/build.py
4. Avoid set intra_op_param.thread_pool_size in ModelTests in OpenMP build.
5. Fix a typo in the C API pipeline.
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary
* unit tests for saving data to and loading data from hdf5 file
* Added Onnxruntime_GCOV_COVERAGE flag for Android.
* Set CMAKE_SYSTEM_NAME explicityly for Android.
* Added GCOV_PREFIX option to collect code coverage data.
Added a new python script to generate code coverage info.
Modified build pipeline to geneate Android code coverage info
* Added build command line option --android_coverage
* Added a comment describing the GCOV environment variables
* Fixed PEP8 issues.
* Added --android_coverage option to the build command.
* Increased Android emulator memory from 3K to 8K.
* Increased Android partition-size from 2GB to 4GB to overcome no-space-left-on-device error
* Removed source_dir from command line args.
* Use cwd absolute path to run tests.
* Added commands to output the contents of /data/local/tmp on the emulator.
* Added run_adb_shell function.
* Format changes.
* Removed keywd argument cwd.
* Removed Android in the --build_dir path.
* Removed commands added for debugging.
* Removed exxtra new-lines.
* Fix MacOs build pipeline failures by uninstalling openssl before running build script.
* Revert "Fix MacOs build pipeline failures by uninstalling openssl before running build script."
This reverts commit 90d0568fe533e9456c20d061a2d435c8fea48266.
* Change dir to the build directory where the tar file is copied.
* Changed the option from --android_coverage to --code_coverage
* Moved steps to generate Android code coverage to run_nnap_code_coverage.sh
* Require --android option if --code_coverage is specified.
* No code coverage needed for onnx_test_runner.
* Expect that the emulator is running when the script is executed.
* Fixed the title in the buildpipeline step.
* Fixed the formatting issue.
* Added a command line argument, ORT_ROOT, to run_nnapi_code_coverage.sh script
Co-authored-by: Satya Jandhyala <satyajandhyala@Satyas-Mac-mini.local>