* Revert "Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)"
This reverts commit 47888392ab.
* Add BatchNorm kernel for ROCm (#9014)
* Add BatchNorm kernel for ROCm, update BN test
* correct epsilon_ setting; limit min epsilon
* Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070)
* try to run inside 4.3.1 container
* no \ in container run command
* remove networking options
* try with adding video render groups
* add job to build docker image
* try without 1st stage
* change alpha, beta to float
* try adding service connection
* retain huggingface directory
* static video and render gid
* use runtime expression for variables
* install torch-ort
* pin sacrebleu==1.5.1
* update curves for rocm 4.3.1
* try again
* disable determinism and only check tail of loss curve and with a much larger threshold of 0.05
* disable RoBERTa due to high run variablity on ROCm 4.3.1
* put reduction unit tests back in
* Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)
* make work for both rocm 4.2 and rocm 4.3.1
* fix rocm 4.3.1 docker image reference
* fix CUDA_VERSION to ROCM_VERSION
* fix ReduceConsts conflict def
* add ifdef to miopen_common.h as well
* trailing ws
Co-authored-by: wangye <wangye@microsoft.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
* Fixing MORE mlas unittest failures in POWER (#8673)
* Ensure ms-experimental domain Audio Ops build in mac pipeline (#8857)
* Globally enable ms-experimental ops
* change meaning of ms_experimental to mean *all* ms_experimental ops. Some experimental ops will still be enabled globally without this flag like audio ops.
* add cmath
* add cmath to signal_defs.cc
* move audio back into experimental, verify on mac
* remove experimental from mac builds
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* Remove cpuinfo from WCOS builds (#9076)
* Fix a bug for Openvino Python binding (#9130)
* Fix default initialization value in C API header (#9126)
* fix default initialization value in C API header
* Fix conflicts
* Nits
* Do not generate nuget symbol packages on Linux
* fix name conflict in 1.9 for Fix default initialization value in C API header
* Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)
* make work for both rocm 4.2 and rocm 4.3.1
* fix rocm 4.3.1 docker image reference
* fix CUDA_VERSION to ROCM_VERSION
* fix ReduceConsts conflict def
* add ifdef to miopen_common.h as well
* trailing ws
* remove OrtCUDAProviderOptions() and simply set value
* revert to use custom ctor and fix tests
Co-authored-by: austinpagan <fossum@us.ibm.com>
Co-authored-by: Sheil Kumar <smk2007@gmail.com>
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Tiago Koji Castro Shibata <ticastro@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
* Adding async fetching for webgl backend (#8951)
* Adding async fetching for webgl backend
* fix PR comments and CI failure.
* fixing a bug
* adding a flag
* Enable linking in exception throwing support library when build onnxruntime wasm. (#8973)
* Enable linking in exception throwing support library when build onnxruntime webassembly containing onnxruntime-extensions.
* Add flag in build.py to enable linking exceptions throwing library.
* Update onnxruntime-extensions document and bind custom_ops build flag with use_extensions.
* Update doc.
* Update cgmanifest.json.
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
* Remove document text from error message in a couple of ops (#9003)
* do not add pkg wheel entry to the index html file if it already exists (#9004)
* do not add pkg wheel entry to the index html file if it already exists
* [js/web] fix ort web e2e test (#9025)
* Fix cmake POWER10 detection
Recent commit 60c98a8 changed variable mlas_common_srcs which affects
POWER10 detection.
* Fix Where op type reduction processing (#9033)
* Update type reduction script to track Where Op's second input type.
* Clean up op_kernel_type_control.h includes.
* Use more maintainable include.
* Fix ROCm wheels CI pipeline break by installing latest protobuf from source (#9047)
* install protobuf from source
* fix rm command in Dockerfile
* fix options on rm command
* fix cd into protobuf source directory
* try again
* remove strip step
* debug list the files
* ls on /usr
* more debug
* more debug
* adjust LD_LIBRARY_PATH
* try remove protobuf before ORT build
* [js/web] a bugfix and add tests for wasm proxy worker (#9048)
* [js/web] add tests for wasm proxy worker
* fix script src override
* Set onnxruntime_DISABLE_RTTI to default OFF (#9049)
Co-authored-by: Du Li <duli1@microsoft.com>
Co-authored-by: Zuwei Zhao <4123666+Zuwei-Zhao@users.noreply.github.com>
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: liqun Fu <liqfu@microsoft.com>
Co-authored-by: Yulong Wang <yulongw@microsoft.com>
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
* fast reduction for reducemean (#8976)
* Adding preprocessor checks for torch version during torch cpp extensions compilation (#8989)
* custom autograd func memory refinement (#8993)
* Release torch tensor referenced by torch gradient graph (created in PythonOp)
* Update orttraining/orttraining/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc
* refine with comments
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
* Fix issues in TensorRT EP (#8996)
* fix big engine load issue and add cuda_cpu_alloc
* remove redundancy
* fix minor issues
* [js/web] fix karma launch with chrome headless (#8998)
* Update Nuget Packge Pipline to CUDA11.4 and TensorRT8 on Windows (#9000)
* Update to CUDA11.4 and TensorRT-8.0.3.4
* update trt pool, remove cudnn from setup_env_gpu.bat
* revert pool
* test gpu package pipeline on t4
* back out changes
* back out changes
Co-authored-by: George Wu <jywu@microsoft.com>
* Fix fuzz testing build blocking release. (#9008)
* add model local function support (#8540)
* updates for picking pnnx commit
* add tests filter to c# tests
* plus test fixes
* fix versioning for contrib ops
* fix tests
* test filter for optional ops
* more versioning related updates
* fix test
* fix layernorm spec
* more updates
* update docs
* add more test filters
* more filters
* update binary size threshold
* update docs
* draft - enable model local function
* enable model local functions in ORT
* update to latest rel onnx commit
* plus tests
* plus more updates
* plus updates
* test updates
* Fix for nested functions + shape inference
* plus bug fix and updates per review
* plus fixes per review
* plus test updates
* plus updates per review
* plus fixes
* fix a test
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: baijumeswani <bmeswani@microsoft.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com>
Co-authored-by: Yulong Wang <yulongw@microsoft.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
* Add netstandard2.0 to nuget managed package.
Re-does PR that was backed out due to packaging pipeline changes.
Allows deprecation of netstandard1.1 in the following release as netstandard2 is the preferred lowest level framework.
* copy changes from trt_and_mem
* second edits
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* change to cuda 11.4
* build with cuda 11.4
* Update Dockerfile.ubuntu_cuda11_1_tensorrt7_2
* add cmake extra defines
* cmake architectures
* fix cmake arch
* Delete ubuntu-18.04.Dockerfile
* Rename Dockerfile.ubuntu_cuda11_1_tensorrt7_2 to Dockerfile.ubuntu_cuda11_4_tensorrt7_2
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* removing previous ort args
* rename to cuda 11.4
* remove cuda 10_2
* delete trt 7.1
* remove 7.1
* Passing in cuda architecture to reduce build time
* always add submodule sync due to recursive cloning
* fix run command
* add and
* take away unused arms and share python installation script
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml
* Update Dockerfile.tensorrt
* cleanup file
* install python directly on dockerfile - move to scripts in future
* Update Dockerfile.custom-trt-perf
* adding cuda 11.1 for missing Libnvrtc.so.11.1
* Delete install_python.sh
* Include pytorch_export_contrib_ops in inference builds
Rename / move it from tools/python/register_custom_ops_pytorch_exporter
to onnxruntime/python/tools/pytorch_export_contrib_ops.
Rationale for inclusion in inference builds:
This code is potentially useful for anyone using ORT, not just training.
Rationale for new name:
"Contrib op" is the nomenclature used within ORT to refer to the set of
ops that are not in the standard op set but are included by default with
ORT. This is more specific than "custom op", which is what the PyTorch
exporter uses to refer to any non-standard op.
Step 1 of addressing #8818. After this is merged I will update the docs.
* Enable test_pytorch_export_contrib_ops.py in CI
Fixes AB#1342330
* Use PROTOBUF_LIB instead of protobuf::libprotbuf
* Moved setdlopenflags to _pybind_state.py
* Copy the generated _pybind_state.py to required location for Windows.
- Move flatbuffers SessionState access code into helper functions instead of duplicating them between InferenceSession and SessionState.
- Trim VerifyEachNodeIsAssignedToAnEp(), e.g., disable verbose log output in a minimal build.
* special case concat and split when sizes are equal
* add tests for 16 and 32 inputs with same dim
* add tests for 16/64 inputs on concat or 16/64 outputs on split
* try eliminate windows warning
* outter => outer
* Change the strided copy to switch on data size not data type.
Move to header so we can reduce on the enabled types.
Setup type reduction for Concat now that it's using this implementation.
* test running hf bert-large
* try again
* try again
* include other models
* correct names
* disable deberta-v2-xxlarge
* avoid torch.distributed
* add compare json loss and perf for bert-large to test
* fix sed expression
* remove pytest
* add more models
* move unit tests u
* display samples/sec