- Only set them as targets for the ORT nuget package
- Use OrtPackageId as the condition for inclusion, if installed
- need to do the nuget restore via msbuild so that this property is set correctly
- Add desktop-only version of the C# sln as there is no way to exclude the mobile specific csproj's from an sln
- use this when applicable if someone is running build.py with the `--build_nuget` flag
Other
- remove attempt to include symbols in the nuget package as nuget doesn't support symbols in native packages
- update build.py to use `nuget` and not a windows specific path and filename for a linux build with `--build_nuget`
* add use_tensorrt build option
* Add use_tensorrt to running tests
* add use_tensorrt for Windows
* make trt ep to skip backend test
* make trt ep to skip backend test
* Fix bug
* Add/Modify description
* modify for debug
* swtich pool to test
* modify to debug
* modify to debug
* add vobersity
* refine the code
* refine the code
* refine the code
* fix flake8 warning
* refine the code
* add pre_load check for trt as well as add cupti lib to cuda depedencies
* modify script to make trt build path the same as cuda
* show error message when user wants to run TensorRT but TensorRT is not installed in the env
* fix bug
* fix bug
* add trt lib for manylinux
* include cuda_dependencies for trt
* rewrite the condition to throw exception
* make code more compact
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Please commit error message and rectification of param.context
* Alignment fixed
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the string to OpenVINO_GPU
* hanged OpenVINO to to OpenVINO_CPU
* Onnxruntime updated API for memory location
* Removing Duplicate LOG Error
* Tensor.h removed DeviceType function. Updated comment
* API Comments updated
* Removing changes to Provider Indo
* Erronous commit
* Removing Extra logs
* Merge CMAKE
* Not copy from a local location
* Duplicate Entry
* Remove extra line
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
* Only serialize runtime optimization records container if non-empty.
* Remove runtime optimizations from onnxruntime/core/flatbuffers/schema/README.md as it's not completely implemented yet.
* Disable partial runtime optimization implementation by default.
ORT format model runtime optimization implementation is in progress.
This change adds a build.py option to disable the partial runtime optimization implementation, adds CI builds to test it, and disables runtime optimizations in mobile package builds.
* removing warnings which are causing errors from torch and changing flags for Windows
* adding MKL library resolution and comments
* cleaning up the code
* fixing onnxruntime_python file for windows build
* fix the include order to aovid the python_d.lib issue on win debug build
* changes for warnings, typos and other comments
* merge conflict
* adding fix for mkl library error
* Revert "adding fix for mkl library error"
This reverts commit 73b87c73c2.
* fix for dll path for windows
* typo for dll path
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* model caching changes for 2021.4
Signed-off-by: Your Name <you@example.com>
* changed the ov version check
* Minor changes added
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added support for external data format
Starting from OpenVINO 2021.4 version, OpenVINO-EP
will support onnx models with Weights saved in external
file location.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Introduced Hetero/Multi options for perf_test
Enabled to use HETERO/MULTI device feature from
OpenVINO-EP using the onnxruntime_perf_test tool.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* cleaned up CMake code for older OV version support
OV 2020.3 is now longer supported by OpenVINO-EP.
This check is not required now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Add option to disable graph partitioning
Added a option to diable graph partitioning
during build time for OpenVINO-EP.
with this option, when the model is not fully
supported on OpenVINO-EP, the model fully fall
backs to default CPU EP (MLAS).
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the flag for diabling graph partitioning
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes the flake8 check error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added changes for disable graph partition option
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed flake8 indentation error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: Your Name <you@example.com>
* Include pytorch_export_contrib_ops in inference builds
Rename / move it from tools/python/register_custom_ops_pytorch_exporter
to onnxruntime/python/tools/pytorch_export_contrib_ops.
Rationale for inclusion in inference builds:
This code is potentially useful for anyone using ORT, not just training.
Rationale for new name:
"Contrib op" is the nomenclature used within ORT to refer to the set of
ops that are not in the standard op set but are included by default with
ORT. This is more specific than "custom op", which is what the PyTorch
exporter uses to refer to any non-standard op.
Step 1 of addressing #8818. After this is merged I will update the docs.
* Enable test_pytorch_export_contrib_ops.py in CI
Fixes AB#1342330
* Add command to skip tests
* Remove support for OV_2021.3_LTS and ov_2021.1
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Removed request_id parameter from all references
request_id parameter was being used with ov_2020.3
release. Starting from 2020.4 OV release, input_name
paramater is being used instead to get the
KernelContext_GetInput.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling CI Logs in the branch
* CI Commits to enable logs
* Enable CI Print
* Added Imagescaler op to the supported op's list
Fixes test_tiny_yolo_V2 opset 8 model to support
fully on OV-EP. This model is the older variation
of tiny_yolo_v2 model which has Imagescaler op.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added ops to fully support yolov3 model
-Added changes to support yolov3 opset 10 model
fully on CPU_FP32.
-This also increases the operator coverage for GPU
hardware. There by enabling yolov3 model on GPU
with fewer subgraphs.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling tiny_yolov3 model fully on CPU
->Enabled tiny_yolov3 model fully on CPU.
-> Also reduces the number of subgraphs
to infer this model on GPU
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Adding GatherND op support for CPU and GPU
->This enables yolov3_pytorch model to work
with fewer subgraphs on CPU and GPU Devices.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes Albert model for ISV customer
ConvTranspose op was getting rejected
due to a condition. Fixed it.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabling this 4 cpp tests for openvino-ep
These unit tests are failing with special conditions
for conv_transpose op with output_shape attribute.
so disabling them for now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Docker file changes for 2021.4-v3.1
* Remvoing duplicate code
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* ReduceMax No dimension supported
* Fixes failing protobuf issue for docker
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Excluding openvinoep type for convtranpose test
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabled 2 Failing convtranspose tests with TensorRT EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: Aravind Gunda <aravindx.gunda@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
* Expose symbols in onnx and protobuf namespaces in python when building with --enable_external_custom_op_schemas
* Add external onnx and protobuf files to wheel
* Added an example to demonstrate external custom ops use-case
* Added a Linux build pipeline to test external custom ops
* Enable selecting custom ops in onnxruntime-extensions.
* Move cmake_helper.py.
* Remove over-indented spaces.
* Add doc.
* Remove onnxruntime-extensions from git submodules, and user should pass path of onnxruntime-extensions for build.
* Modify doc.
* Remove argument --enable_onnxruntime_extensions and use --onnxruntime_extensions_path.
* Fix build error.
* Fix build error.
* Use onnxruntime_extensions_path.
* support both submodule and external source folders
* refinement
* Update cgmanifest.json
* Support building onnxruntime-extensions from either git submodule or pre-pulled path.
* Update doc.
* more standard name
* update docs
* add the copyright header
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
* Ported changes / bug fixes from torch/ort.
* Fixed formatting
* Renamed function
* Renamed module_ to module.
* Revert "Renamed module_ to module."
This reverts commit b17fc114b3db20d174283811d90592b5b8154c19.
* Include pybind common header to fix linker errors on windows debug.
* Fix to generation of > 1 custom op.
Co-authored-by: Ashwin Hari <ashari@microsoft.com>
* integrate eager mode source codde; build with cmake and integrate the python test
* Adding the python path for importing libraries in the Eager mode
* fix clang break;check if training and python enabled
* handling the linking of torch libraries across multiple platforms
* merge and fix the naming
* add build instruction
Co-authored-by: Abhishek Jindal <abjindal@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: ajindal1 <abjindal@microsoft.com>
1. Update SDLNativeRules from v2 to v3. The new one allows us setting excluded paths.
2. Update TSAUpload from v1 to v2. And add a config file ".gdn/.gdntsa" for it.
3. Fix some parentheses warnings
4. Update cmake to the latest.
5. Remove "--x86" build option from pipeline yaml files. Now we can auto-detect cpu architecture from python. So we don't need to ask user to specify it.
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces
* Add ability to generate ios static framework
* Fix typos
* Add pod cache clean, update some comments of previous commit
* Fix CI failure with newly added cpuinfo library
* Update test model (CoreML requires node has a name)
* Addressed CR comments
Switched the code to C++17. To build ONNX Runtime on old distros like CentOS 7, you need to install a newer GCC from additionary repos. If you build onnxruntime with the newer GCC, typically the result binary can't be distributed to other places because it depends on the new GCC's runtime libraries, something that the stock OS doesn't have. But on RHEL/CentOS, it can be better. We use Red Hat devtoolset 8/9/10 with CentOS7 building our code. The new library features(like std::filesystem) that not exists in the old C++ runtime will be statically linked into the applications with some restrictions:
1. GCC has dual ABI, but we can only use the old one. It means std::string is still copy-on-write and std::list::size() is still O(n). Also, if you build onnxruntime on CentOS 7 and link it with some binaries that were built on CentOS 8 or Ubuntu with the new ABI and export C++ symbols directly(instead of using a C API), the it won't work.
2. We still can't use std::optional. It is a limitation coming from macOS. We will solve it when we got macOS 11 build machines. It won't be too long.
3. Please avoid to use C++17 in CUDA files(*.cu). Also, the *.h files that they include(like core/framework/float16.h). This is Because CUDA 10.2 doesn't support C++17. You are welcome to use the new features in any *.cc files.
* checkin transformers pipeline
* add docker requirements
* only trigger linux cpu
* temp remove tf instalation due to numpy version conflicts
* test numpy>=1.7
* revert numpy and disable transformers
* add coloredlogs
* enable shape_infer_helper and install transformers when needed
* pip3?
* testtest
* enable more tets
* line too long
* remove pytorch1.4 test and added back some onnx files
* add tests
* copy dir
* disable 2 teests
* trim lines
* add missing onnx
* fix type
* fix version conflicts
* install psutil
* change file path
* mfix path
* remove cached files
* add back attention fusion test
* labeled the shape infer test as slow
* fix
* enable tf2onnx test and enable pytest
* refactor path
* fix typo
* add cwd
1. Fix training e2e pipeline. The failure was caused by my recent change #7632. The fix is adding "--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=70" to the build parameters because the machines are with V100 GPUs.
2. Simplify Nuphar pipeline. It doesn't need to install a separated ONNX version(1.5.0)
3. Fix a problem that run_dockerbuild.sh ignored OS version parameter. Now because it starts to take effect, I also set python version to the system default one(3.8 for ubuntu 20.04)
1. Update manylinux build scripts. This will add [PEP600](https://www.python.org/dev/peps/pep-0600/)(manylinux2 tags) support. numpy has adopted this new feature, we should do the same. The old build script files were copied from https://github.com/pypa/manylinux, but they has been deleted and replaced in the upstream repo. The manylinux repo doesn't have a manylinux2014 branch anymore. So I'm removing the obsolete code, sync the files with the latest master.
2. Update GPU CUDA version from 11.0 to 11.1(after a discussion with PMs).
3. Delete tools/ci_build/github/linux/docker/Dockerfile.manylinux2014_cuda10_2. (Merged the content to tools/ci_build/github/linux/docker/Dockerfile.manylinux2014_cuda11)
4. Modernize the cmake code of how to locate python devel files. It was suggested in https://github.com/onnx/onnx/pull/1631 .
5. Remove `onnxruntime_MSVC_STATIC_RUNTIME` and `onnxruntime_GCC_STATIC_CPP_RUNTIME` build options. Now cmake has builtin support for it. Starting from cmake 3.15, we can use `CMAKE_MSVC_RUNTIME_LIBRARY` cmake variable to choose which MSVC runtime library we want to use.
6. Update Ubuntu docker images that used in our CI build from Ubuntu 18.04 to Ubuntu 20.04.
7. Update GCC version in CUDA 11.1 pipelines from 8.x to 9.3.1
8. Split Linux GPU CI pipeline to two jobs: build the code on a CPU machine then run the tests on another GPU machines. In the past we didn't test our python packages. We only tested the pre-packed files. So we didn't catch the rpath issue in CI build.
9. Add a CentOS machine pool and test our Linux GPU build on real CentOS machines.
10. Rework ARM64 Linux GPU python packaging pipeline. Previously it uses cross-compiling therefore we must static link to C Runtime. But now have pluggable EP API and it doesn't support static link. So I changed to use qemu emulation instead. Now the build is 10x slower than before. But it is more extensible.
* Update the operator documentation generation
- Make layout a little nicer
- Update to latest supported operators including training
- Fix some links that are broken when the docs content is copied to github-pages
- Fix incorrect usage of 'onnx.ai.ml' as the default domain
- ML ops are now separated from the real default domain of 'onnx.ai'
- Include CPU, CUDA and training kernels
- exclude DNNL as it's not an EP we own
* There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library.
* Enable validation of both contrib ops and operator kernels in build
Filter generation so it's deterministic
Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs.
Remove workarounds for github-pages as that will now link to the github docs which display correctly
I saw a test timeout in our nodejs packaging pipeline. I'm not sure if it is because it ran slower than before or it's a deadlock issue. Increasing the timeout will be helpful for investigating such issues.
* test
* [gwang] make cmake compile work
* [gwang] enble build apks
* some build update
* add simple sigmoid test android project and cmake
* add build.py
* refine and remove unused import lib
* address CR comments
* remove unnecessary files
* add README.md
* minor update
* remove
* minor change
* fix ci failure and minor update
* fix typo in project folder
* remove
* remove and minor update
* refine
* minor fix
* fix
* fix typo
* add gradle spotlessApply task to fix CI failure
* fix
* enable spotlessApply in build gradle
* revert some changes
* minor fix
* run spotless apply for format
* address CR comments and fix CI version and format
* refine
* Refine
* address comments
* refine
* refine
* modify
* reformat
* resolve version conflicts
* minor update
* minor update
* address comments
* minor update
Co-authored-by: Guoyu Wang <wanggy@outlook.com>
* initial draft for kernel invoke api
* initial implementation of kernel invoker
* [eager] fix build on Mac
* [eager] increment input name in kernel invoker
* temp fix for type in eager mode
* use global default log manager
* rollback the previous commit since it break linux build
* Revert "rollback the previous commit since it break linux build"
This reverts commit 58c2c3423a.
* Eager Mode: fix linking on macOS
* optimizer_execution_frame: ignore unused lambda capture (model_path)
* fix link issue
* ORTInvoker: set correct input argument tensor element proto types
Do not set a type proto on output arguments to allow ORT to deduce them
* ORTInvoker: create only one logging manager
* Minor fix to set execution provider type correctly. (#7000)
Co-authored-by: Chandru Ramakrishnan <chandru-r@github.com>
* training fix
* support config output ml values in frame, so we can use it to implement inplace update
* Fix range loop error while building. (#7087)
Co-authored-by: Chandru Ramakrishnan <chandru-r@github.com>
* Conditionally link with nsync_cpp if not windows. (#7151)
Co-authored-by: Chandru Ramakrishnan <chandru-r@github.com>
* Fixed initialization order in ORT kernel invoker (#7342)
* Updated constructor of ort_kernel_invoker to take a logger.
* Changed linking order.
* Updated test.
* add inplace ut
* add build option
* Update include/onnxruntime/core/eager/ort_kernel_invoker.h
Co-authored-by: Derek Murray <Derek.Murray@microsoft.com>
* resolve comments in pr
* fix build break;merge from master
* fix build break
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Aaron Bockover <abock@microsoft.com>
Co-authored-by: Chandru Ramakrishnan <41447659+chandru-r@users.noreply.github.com>
Co-authored-by: Chandru Ramakrishnan <chandru-r@github.com>
Co-authored-by: Derek Murray <Derek.Murray@microsoft.com>
* first attempt rocm training wheel
* modifications needed to python packaging pipeline for Rocm 4.1
* changges to not conflict with cuda
missed stage1 changes
remove package push
add option r to getopt
try again without python install
try again without python install
try again without python install
split pipelines and add back push to remote storage
try on cuda gpu pool
try again
try again
try running without az subscription set
try again on original pipeline
change pool
passing AMD Rocm whl on AMD-GPU pool
split rocm pipeline from cuda pipeline
remove comments
* try adding Rocm tests as well
* try with tests in place
* fix trailing ws
* add training data
* try again as root for tests
* use python3
* typo
* try to map video, render group into container
* try again
* try again
* try to avoid yum error code
* make UID 1001
* try without yum downgrade
* define rocm_version=None
* remove CUDA related comments for Rocm Dockerfile
* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)
* missed requirements-rocm.txt from last commit
* fix whitespace
* working on re-organizing js code for ortweb
* remove dup files
* move folder
* fix common references
* fix common es5
* add webpack to common
* split interfact/impl
* use cjs for node
* add npmignore for common
* update sourcemap config for common
* update node
* adjust folder/path in CI and build
* update folder
* nit: readme
* add bundle for dev
* correct nodejs paths
* enable ORT_API_MANUAL_INIT
* set name for umd library
* correct name for commonjs export
* add priority into registerBackend()
* fix npm ci pwd
* update eslintrc
* revise code
* revert package-lock lockfileVersion 2->1
* update prebuild
* resolve comments
* update document
* revise eslint config
* update eslint for typescript rules
* revert changes by mistake in backend.ts
* add env
* resolve comments
* Simplified version of WebAssembly support to keep most of existing data structures and add cmake using Ninja and emcmake
* Clean up CMakeLists.txt and add an example to create and compute a kernel
* Load a model from bytes and remove graph building steps
* Add all cpu and contrib ops with mlas library
* WebAssembly build with Onnxruntime C/CXX API
* Use protobuf cmakefile directory instead of adding every necessary source file
* Fix invalid output at example
* add missing files
* Change an example to use Teams model and support ort mobile format
* add API for javascript
* fix input releasing in _ort_run()
* update API
* Let onnxruntime cmake build WebAssembly with option '--wasm'
* allow one-step building for wasm
* Make build script working on Linux and MacOS
* Fix broken build from Windows command
* Enable unit test on building WebAssembly
* Resolve comments
* update build flags
* wasm conv improvement from: 1) GemmV; 2) Depthwise direct convolution 3x3; 3) Direct convolution 3x3
* Cleaned mlas unittest.
* use glob
* update comments
* Update baseline due to loss scale fix (#6948)
* fix stream sync issue (#6954)
* Enable type reduction in EyeLike, Mod, random.cc CPU kernels. (#6960)
* Update EyeLike CPU kernel.
* Update Mod CPU kernel.
* Update Multinomial CPU kernel.
* Slight improvement to Pad CPU kernel binary size.
* Update RandomNormal[Like], RandomUniform[Like] CPU kernels.
* Fix warning from setting multiple MSVC warning level options. (#6917)
Fix warning from setting multiple MSVC warning level options. Replace an existing /Wn flag instead of always appending a new one.
* MLAS: quantized GEMM update (#6916)
Various updates to the int8_t GEMMs:
1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before.
2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size.
3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.
* Implement QLinearAveragePool with unit tests. (#6896)
Implement QLinearAveragePool with unit tests.
* Attention fusion detect num_heads and hidden_size automatically (#6920)
* fixed type to experimental session constructor (#6950)
* fixed type to experimental session constructor
Co-authored-by: David Medine <david.medine@brainproducts.com>
* Update onnxruntime_perf_test.exe to accept free dimension overrides (#6962)
Co-authored-by: Ori Levari <orlevari@microsoft.com>
* Fix possible fd leak in NNAPI (#6966)
* Release buffers for prepacked tensors (#6820)
Unsolved problems:
1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?
2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.
3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.
* Enable type reduction for Range, ReverseSequence, ScatterND, Split, and Unique CPU kernels. (#6963)
* add CI
* fix test in ci
* fix flags for nsync in wasm build
* add copyright banner
* fix wasm source glob
* add missing exports
* resolve comments
* Perf gain by make packb wide to 4 from 16 on GEMM for WASM.
Remove no need direct conv in previous perf tuning.
* fix buildbreak introduced from latest master merge
* fix buildbreak in mlasi.h
* resolve all comments except MLAS
* rewrite packb related 3 functions for WASM_SCALAR seperately rather than using #ifdef in each.
and other changes according to PR feedback in mlas.
* More complete scalar path in sgemm from Tracy.
* Fix edge case handling in depthwise conv2d kernel 3x3. where:
*) support input W==1 and H==1
*) recalc in accurate pad_right and pad_bottom
*) support hidden pad_right == 2 or pad_bottom == 2 when W == 1 or H==1 and no pad left/top
* Add more test coverage for conv depthwise from Tracy.
Fix one typo according to PR.
* resolve comments
* replace typedef by using
* do not use throw in OrtRun()
* output error message
Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Tracy Sharpe <42477615+tracysh@users.noreply.github.com>
Co-authored-by: David Medine <david.eric.medine@gmail.com>
Co-authored-by: David Medine <david.medine@brainproducts.com>
Co-authored-by: Ori Levari <ori.levari@microsoft.com>
Co-authored-by: Ori Levari <orlevari@microsoft.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: Chen Fu <chenfucs@gmail.com>
* Add robust dependency check for Python package
* Add version_info.py to .gitignore
* Fix Linux build
* Fix Windows CPU build
* Fix Windows 32-bit build
* Minor tweak
* Generate version_info.py earlier in onnxruntime_python.cmake
* Print a user-friendly message if cuDNN is not found in
* Relax version requirements for CUDA 11 - only the major version has to match
* Fix PATH environment variable to include CUDA 11 in 'Python packaging pipeline' (Windows/GPU)
* Fix the build with cuDNN 7
* Remove support from custom ops from the base minimal build as they contribute too much binary growth to an Android build.
Add ability to explicitly enable custom op support in a minimal build.
Change one minimal build CI to test adding custom op support (unit tests are run in that build to validate)
* model building
* fix build
* winml adapter model building api
* model building
* make build
* make build again
* add model building with audio op
* inplace and inorder fft
* add ifft
* works!
* cleanup
* add comments
* switch to iterative rather than recursive and use parallelization
* batched parallelization
* fft->dft
* cleanup
* window functions
* add melweightmatrix op
* updates to make spectrogram test work
* push latest
* add onesided
* cleanup
* Clean up building apis and fix mel
* cleanup
* cleanup
* naive stft
* fix test output
* middle c complete
* 3 tones
* cleanup
* signal def new line
* Add save functionality
* Perf improvements, 10x improvement
* cleanup
* use bitreverse lookup table for performance
* implement constant initializers for tensors
* small changes
* add matmul tests
* merge issues
* support add attribute
* add tests for double data type windowfunctions and minor cleanup
* stft onesided/and not tests
* cleanup
* cleanup
* clean up
* cleanup
* remove threading attribute
* forward declare orttypeinfo
* warnings
* fwd declare
* fix warnings
* 1 more warning
* remove saving to e drive...
* cleanup and fix stft test
* add opset picker
* small additions
* add onnxruntime tests
* add signed/unsigned
* fix warning
* fix warning
* finish onnxruntime tests
* make windows namespace build succeed
* add experimental flag
* add experimental api into nuget package
* add experimental api build flag and add to windows ai nuget package
* turn experimental for tests
* add minimum opset version to new experimental domain
* api cleanup
* disable ms experimental ops test when --ms_experimental is not enabled
* add macro behind flag
* remove unused x
* pr feedback
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* minimal_build with training ops
* Removing redundant comment from an earlier attempt at a fix
* Fixing a bad merge conflict resolution
* Responding to PR feedback
* tweaking the makefiles based on feedback
* combining two enable_training blocks in CMakeLists.txt
* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
- Add python bindings for ORT format models
- Add script to update bindings and help info
- Add parsing of ORT format models
- Add ability to enable type reduction to config generation
- Update build.py to only allow operator/type reduction via config
- simpler to require config to be generated first
- can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
- Add script to create reduced build config
- Update CIs
* merged alloc_plan
* pass compilation
* Start running, incorrect allocation memory info
* add in comments
* fix a bug of recording pattern too early.
* debugging lifetime
* fix lifetime
* passed mnist
* in process of visualization
* Add code to generate chrome trace for allocations.
* in process of collecting fragmentation
* before rebuild
* passed mnist
* passed bert tiny
* fix the inplace reuse
* fix the exception of weight in pinned memory
* add guards to ensure the tensor is in AllocPlan
* add customized profiling
* debugging
* debugging
* fix the reuse of differnt location type
* add rank
* add the rank
* add fragmentation
* add time_step_trace
* Add summary for each execution step (total bytes, used/free bytes).
* add top k
* change type of top k parameter
* remove prints
* change heap to set{
* add the name pattern
* add the useage for pattern
* add partition
* change to static class
* add custom group
* remove const
* update memory_info
* in process of adding it as runtime config
* change the memory profiling to be an argument
* add some comments
* add checks to recored meomry_info in traaining session
* set the "local rank setting" to correct argument.
* addressing comments
* format adjustment
* formatting
* remove alloc_interval
* update memory_info.cc to skip session when there is no tensor for a particular memory type
* fix memory_info multiple iteration seg-fault
* consolidate mainz changes
* fixed some minor errors
* guard by ORT_MINIMAL_BUILD
* add ORT_MEMORY_PROFILE flag
* added compiler flag to turn on/off memory profiling related code
* clean up the code regarding comments
* add comments
* revoke the onnx version
* clean up the code to match master
* clean up the code to match master
* clean up the code to match master
Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-yclzsf.eastus.cloudapp.azure.com>
* Use readelf for minimal build binary size checks.
The on-disk size grows in 4KB chunks which makes it hard to see how much growth an individual checkin causes.
Only downside is that the sum of the sections is larger than the on-disk size (assumably things get packed smaller on disk and some of the section alignment constraints can be ignored)
* Remove unused function
1. Update the ProtoSrc path. The old one is not used anymore.
2. Regenerate OnnxMl.cs
3. Delete some unused code in tools/ci_build/build.py
4. Avoid set intra_op_param.thread_pool_size in ModelTests in OpenMP build.
5. Fix a typo in the C API pipeline.
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary
* unit tests for saving data to and loading data from hdf5 file
* Added Onnxruntime_GCOV_COVERAGE flag for Android.
* Set CMAKE_SYSTEM_NAME explicityly for Android.
* Added GCOV_PREFIX option to collect code coverage data.
Added a new python script to generate code coverage info.
Modified build pipeline to geneate Android code coverage info
* Added build command line option --android_coverage
* Added a comment describing the GCOV environment variables
* Fixed PEP8 issues.
* Added --android_coverage option to the build command.
* Increased Android emulator memory from 3K to 8K.
* Increased Android partition-size from 2GB to 4GB to overcome no-space-left-on-device error
* Removed source_dir from command line args.
* Use cwd absolute path to run tests.
* Added commands to output the contents of /data/local/tmp on the emulator.
* Added run_adb_shell function.
* Format changes.
* Removed keywd argument cwd.
* Removed Android in the --build_dir path.
* Removed commands added for debugging.
* Removed exxtra new-lines.
* Fix MacOs build pipeline failures by uninstalling openssl before running build script.
* Revert "Fix MacOs build pipeline failures by uninstalling openssl before running build script."
This reverts commit 90d0568fe533e9456c20d061a2d435c8fea48266.
* Change dir to the build directory where the tar file is copied.
* Changed the option from --android_coverage to --code_coverage
* Moved steps to generate Android code coverage to run_nnap_code_coverage.sh
* Require --android option if --code_coverage is specified.
* No code coverage needed for onnx_test_runner.
* Expect that the emulator is running when the script is executed.
* Fixed the title in the buildpipeline step.
* Fixed the formatting issue.
* Added a command line argument, ORT_ROOT, to run_nnapi_code_coverage.sh script
Co-authored-by: Satya Jandhyala <satyajandhyala@Satyas-Mac-mini.local>
Follow up to #5811 to automate cleanup of the build docker image cache.
Added a script and build definition to clean up docker images that haven't been accessed recently.
* Remove nGraph Execution Provider
Pursuant to nGraph deprecation notice: https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md#deprecation-notice
**Deprecation Notice**
| | |
| --- | --- |
| Deprecation Begins | June 1, 2020 |
| Removal Date | December 1, 2020 |
Starting with the OpenVINO™ toolkit 2020.2 release, all of the features
previously available through nGraph have been merged into the OpenVINO™
toolkit. As a result, all the features previously available through
ONNX RT Execution Provider for nGraph have been merged with ONNX RT
Execution Provider for OpenVINO™ toolkit.
Therefore, ONNX RT Execution Provider for **nGraph** will be deprecated
starting June 1, 2020 and will be completely removed on December 1,
2020. Users are recommended to migrate to the ONNX RT Execution Provider
for OpenVINO™ toolkit as the unified solution for all AI inferencing on
Intel® hardware.
* Remove nGraph Licence info from ThirdPartyNotices.txt
* Use simple Test.Run() for tests without EP exclusions
To be consistent with rest of test code.
* Remove nGraph EP functions from Java code
This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.
Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.
With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.
The cache container registry will need to be cleaned up periodically. This is not automated yet.
* Make NNAPI EP build on non-Android Platform
* minor updates
* Adress CR comments
* Fix build issue using Windows, address CR comments
* Fix linux build warnings
* Fix for test failure
* Fix for test failure
* Fix model_tests failure
* Enabling Multi Device support for UEP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor fix added
*Added a simple fix to determine OpenVINO
version for Arm build as well
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* cpu send/recv
* clean up send/recv
* remove unused code
* assert and nccl option for mnist
* add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu
* Add USE_MPI distinct from USE_NCCL/USE_HOROVOD
* fix
* fix
* exclude cpu send/recv for machines without mpi
Co-authored-by: Tim Harris <tiharr@microsoft.com>
* Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Add kernels for AMD GPU.
This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.
Please refer to "HIP Porting Guide" for details.
* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".
* Use hipMemsetAsync and add checks on HIP calls.
* move the generated files to build folder.
Co-authored-by: Jesse Benson <jesseb@microsoft.com>
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Passing right device type for vadm_baackend
Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed batching logic for 2020.4 and above
* Fixed flake8 PEP8 errors
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Introduce sparse_initializers support.
Convert them to dense on model load and prune graph_proto_
so they don't consume space. Convert back to sparse on ORT Format model save.
Implement serializing sparse initializers to OrtFormat.
Fix Model::ToProto() to return original sparse initializers
Set a flag that graph_sync is needed when loading a simple ORT Format model.
otherwise nothing is resolved.
Add ORT Format history to README.md
ifdef MINIMAL build for DenseToSparseTensorInitializer
Allow duplicate initializers to support existing models.
Issue a warning instead of aborting.
* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
* Build ACL and ArmNN with custom library path
* Define import to tensor as a separate function for maintenance and readability
* Enabled optimized depthwise convolution for ACL v20.02
* Check operation status for ACL and ArmNN Execution Providers
* Enabled fused operation for convolution-activation
Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
* use run_orttraining_test_orttrainer_frontend_separately to work around a sporadic segfault.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* add tensor board, remove torch.distributed.lanuch because ort nccl depends on MPI. Use MPI to launch parallel training.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* add the ios ci build.
* no dependency on mac ci pipeline.
* fix the command line.
* keep sync
* automatically retrieve sdpath
* fix the case errors and warnings
* fix the vlog switch issue.
* add parallel flag for build.
* update the display name of the pipeline.