Commit graph

23 commits

Author SHA1 Message Date
Dmitri Smirnov
d9de054eb5
Multi-Lora support (#22046)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-30 15:59:07 -07:00
Tianlei Wu
23996bbbbe
[CUDA][ROCm][Training] Fix cuda/rocm provider info hash (#19398)
When I test a new provider option, the training pipeline failed. I found
that training uses hash code of provider info to try get provider
instance. If a provider option is not used in hashing, the provider
instance fetched from cache might have different configuration for that
option.

Here I fix the hashing to use all provider options (except the default
Arena config that cannot be set from python API since training is used
with PyTorch in most cases).

Fixed a few obvious typo in the touched files.

Add regression test cases.
2024-02-05 14:35:57 -08:00
Changming Sun
23a91c8ba8
Fix warning C4003 in ORT python binding code (#18612)
### Description
Fix warning C4003 in ORT python binding code.

### Motivation and Context
It's better to fix the warning instead of suppressing it.
2023-11-30 08:07:47 -08:00
Ashwini Khade
02333293de
Removed all the deprecated python training code and related tests and utils (#18333)
### Description
Motivation for this PR is code cleanup.

1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
2023-11-17 18:19:21 -08:00
shaahji
5a623dca01
Python API to check whether collective ops are available or not (#17730)
Python API to check whether collective ops are available or not

### Description
<!-- Describe your changes. -->

Adding an API to check whether collective ops are available or not.
Since there is no independent MPI enabled build, this flag can be used
on Python front for branching. Specifically, to conditionally enable
tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Flag to be used in Python to check whether onnxruntime supports
collective ops or not. Handy for conditionally enabling/disabling tests
and for other branching decisions.
2023-09-29 14:11:05 -07:00
cao lei
dd72192cf4
ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833)
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers

---------

Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-06-19 17:44:45 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
Changming Sun
d175e87a1f
Delete eager mode code and increase minimal required python version to 3.8 (#15450)
### Description
1. Delete eager mode code.
2. Increase the minimal required python version to 3.8.
2023-04-10 16:00:04 -07:00
Dmitri Smirnov
0d7855ea5a
Re-work global objects dependancies in pybind layer. (#14941)
### Description
Re-work handling of static objects in pybind.
Make sure we ref-count Environment from Sessions.

The following has been done:

- Make global objects function static. This ensures that the objects are
constructed on demand. The first object constructed is destructed last.
This is platform independent.
- Make global objects ownership shared as suggested by pybind since they
are not surfaced at Python level, and they cannot be referred to by
dependent python objects. Verified that all python objects are GCed
before globals are destroyed. This takes care of inference session
dependency on environment and its default logger and this is also
platform independent.
- Utilize pybind atexit mechanism to clear execution providers and
unload CUDA libraries (as suggested by
https://github.com/microsoft/onnxruntime/pull/14903) . Since this is
registered for module exit, it takes place before any other global are
destroyed and clears shared objects state or even unloads the libraries.
This should also work in a platform independent way.

### Motivation and Context

- Global object destruction order is managed manually and that becomes
source of trouble. We want to make it deterministic and platform
independent.
- Frequent hangs in Python layer due to the static object's destruction
order. Some of the Python session objects are being garbage collected
after main exits and they require ORT environment to be alive. (Use
after free)
2023-03-10 13:55:31 -08:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
Dmitri Smirnov
5d729839b5
Support loading widechar paths on windows (#14066)
### Description
Make GetRuntimePath() and LoadDynamicLibrary() operate on platform
specific paths

### Motivation and Context
This addresses https://github.com/microsoft/onnxruntime/issues/14063
2022-12-30 16:30:11 -08:00
cloudhan
9e649d1ac4
Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601)
This ports #13116 from ROCm EP to CUDA EP
2022-11-15 14:43:54 +08:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
Pranav Sharma
a8b0f57d1a
Fix eager mode pipeline to accommodate recent allocator change. (#13000) 2022-09-20 12:53:46 +08:00
Wei-Sheng Chin
dc486d146b
Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460)
* Make ORT as Pytorch JIT backend

LORT likely doesn't work with aten fallback so we only test LORT in its own CI.

* Revert changes to enable external CUDA allocator. Will add it later.

Revert "Revert changes to enable external CUDA allocator. Will add it later."

This reverts commit d5487f2e193014c805505afae8fb577c53667658.

Fix external allocator

* Relax tolerance and remove commented code

* Print more information in CI

* Fix pointer

* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.

* Use Pytorch master branch as all PRs are merged

Fix

* Refine based on cpplint feedbacks

* Revert changes to allow custom CUDA allocator in public APIs

* Use torch.testing.assert_close

* Use unittest framework

* Switch docker repo

* Rename *.cpp to *.cc

* Address comments

* Add comment

* Use same pipeline file for eager and lort pipelines

* Address comments

* Add yaml comment

* Fix cmake files

* Address comments

* Rename flags, remove printing code, remove dead comment
2022-08-22 09:40:40 -07:00
Xavier Dupré
a805a49363
Move OrtValueVector from onnxruntime-training to onnxruntime (#11176)
* Move OrtValueVector from onnxruntime-training to onnxruntime

* disable dlpack on onnxruntime

* disable dlpack

* dlpack

* opaque inlcuded in any cc file of the python binding

* fix type issue

* fix incomplete name

* remove len()

* remove unused parameter

* black

* black

* black

* remove unused import

* add unit test to check the output type

* black

* lint

* lint

* lint

* fix method name

* Update onnxruntime/python/onnxruntime_pybind_ortvalue.cc

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/python/onnxruntime_pybind_ortvalue.cc

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/python/onnxruntime_pybind_ortvalue.cc

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/python/onnxruntime_pybind_ortvalue.cc

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/python/onnxruntime_pybind_ortvalue.cc

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/test/python/onnxruntime_test_python_sparse_matmul.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Update onnxruntime/test/python/onnxruntime_test_python_sparse_matmul.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* check return type of C API

* lint

* lint

* fix missing ;

* fix type issue

* fix merge issue

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2022-06-15 09:36:28 +02:00
Changming Sun
ec4362f8f3
Enable more static analysis warnings and enable the analyzer for training cpu (#10176) 2022-01-27 11:17:20 -08:00
Abhishek Jindal
4aa7cee0d8
Abjindal/clean eager backend (#10055)
* clearing map for eager mode backends

* clearing map for eager mode backends manager

* making OrtBackendsManager an extern variable and trying to delete it

* cleaning backends manager when the python interpret exits

* adding ifdef for eager mode code

* disabling warning for pybind state file

* disabling warning for python module file

* running clang auto format and reducing redundancy

* remove new line

* moving declaration to a new header file

* adding the header file for eager mode for python module

* removing source files for eager mode

* add source file for python module in eager mode

* Update orttraining/orttraining/python/orttraining_python_module_eager.h

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2022-01-19 14:20:09 -08:00
Jeff Daily
c8789d3047
[ROCm] static re-hipify of CUDA EP to ROCm EP, now a shared provider (#8877)
* re-hipify all rocm EP sources

* fix all other files affected by re-hipify

* add cuda_provider_factory.h to amd_hipify.py

* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration

* Fix ReduceConsts template specialization introduced in #9101.

Fixes the error when building for ROCm 4.3.1:

error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)

* fix flake8 error in amd_hipify.py

* speed up hipify with concurrent.futures

* flake8 fix in amd_hipify.py
2021-10-14 15:15:51 -07:00
Tang, Cheng
48737091c0
resolve the provider options before create training session in orttrainer (#9199)
* resolve the provider options before create training session in orttrainer

* Update orttraining/orttraining/python/orttraining_pybind_common.h

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* support clear the training ep instance pool

* fix status error

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2021-10-12 09:30:45 -07:00
Tang, Cheng
4dc0ddf606
support register external ep lib information (#8897)
* support register external ep lib inforation; make eager mode share the same ep pools with training workloads

* fix inference code

* fix build break

* fix the message
2021-08-31 20:51:22 -07:00
Tang, Cheng
ae7f2d824d
Share the execution provider instance for training (#8719)
* seperate the training python module; share the execution proivder instance

* fix build break

* fix cuda test crash; reorg the python module code base

* se correct env

* use provider customized hash func

* fixbuild break

* fix rocm break

* use const ref in argument

* rename the file

* move hash func to trainiing module
2021-08-27 16:23:35 -07:00