onnxruntime/tools/ci_build/build.py
Thiago Crepaldi 8a890ddfd7
Sync ORTModule branch with master and fix tests (#6526)
* Deprecate Python global configuration functions [Part 1] (#5923)

Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.

* remove dnnl_dll_path from post build copy (#6142)

* Model Fusion For Bart (#6105)

Fusion fix for Bart models

* Unify IExecutionProvider and IExecutionProviderFactory interfaces (#6108)

* Remove Provider_IExecutionProvider and make the internal IExecutionProvider usable by shared providers
* Change Provider_IExecutionProviderFactory to be the core version.

* Enable running the mnist_training sample without cuda (#6085)

Signed-off-by: George Nash <george.nash@intel.com>

* nnapi add min max support (#6117)

* Fix CUDA test hang: (#6138)

- Make condition check in `CUDAAllocatorTest` to ensure CUDA device is present.

* Fix TensorRT kernel conflict issue for subgraphs of control flow operators (#6115)

* add static subgraph kernel index

* change kernel naming to avoid conflicts

* Add gradient registration for Abs. (#6139)

* Partition initial optimizer state for Zero-1 (#6093)

* Initial changes

* Working changes

* Working changes

* Cleanup

* fix windows CI

* Review comments

* review comments

* Fix edge case in BFCArena where allocation failures could lead to an infinite loop. (#6145)

#4656

* Revert "work around of the build break in mac (#6069)" (#6150)

This reverts commit 3cae28699b.

* Fix clean_docker_image_cache.py detection of image pushes. (#6151)

Fix clean_docker_image_cache.py detection of image pushes. They were being ignored because the expected HTTP status code was wrong. For pushes, it's 201 instead of 200.

* MLAS: add NEON version of int8 depthwise convolution (#6152)

* Using a map of of ops to stages as input of partition function. (#5940)

* New partition algorithm running before AD

* Convert cut_group_info into device map. Work in progress -- works for  bert-tiny with pp=2

* Removing code for partition of bwd graphs

* Remove old code

* Adding some verification code

* Handle Shared Initializer

* Renaming rank with stage

* Added first unit test

* new test

* redundant check

* undo change in bert

* Moved cut-based partition to testing utils file

Co-authored-by: xzhu1900
Co-authored-by: wschin

* New conversion function and tests

* minor

* remove test that is not needed2

* improve GetDeviceAssignment and PR comments

* minor changes

* PR comments

* improving documentation and variable naming

* add documentation

* Variable naming and docs

* more doc improvements

* more doc improvements

* missing static cast

* Fix test file for windows

* Fix test file for windows

* Fix test file for windows

* stage id is not the same as rank id

* PR comments

* PR comments

* More comments

* More comments

* Minor fix to satisfy c++14 (#6162)

* Deprecating Horovod and refactored Adasum computations (#5468)

deprecated horovod submodule
refactored adasum logic to be ort-native
added tests for native kernel and e2e tests

* Update TensorRT-ExecutionProvider.md (#6161)

* Bugfix for topk cuda kernel (#6164)

* fix the issue that std::numeric_limits cannot handle half type

* adding a test

Co-authored-by: Du Li <duli@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Revert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" (#6169)

This reverts commit f2dcba7afe.

* Remove ignored build warnings for pybind on Mac (#6165)

* save_checkpoint, load_checkpoint and aggregate_checkpoints (#6136)

* save_checkpoint and load_checkpoint implementations

* checkpoint aggregation logic

* unit tests for save_checkpoint, load_checkpoint and aggregate_checkpoints

* Don't try to bind unused inputs in the Training frontend (#6166)

* Update documentation for contributing a PR and add deprecation notices for PyOp and ORT server. (#6172)

* aggregate model states only for the case when mixed precision was true (#6176)

* [NNAPI EP] Enable per-channel quantization for QlinearConv  (#6155)

* Enable qlinearconv per-channel quantization

* Fix the android CI test failure

* Add Android Version Check for Per-Channel Quant

* Address PR comments

* Fix some minor issues

* Add verification of per-channel zero points

* Make the error tolerance configurable

* Fix typo in BERT pretraining script (#6175)

A misplaced `}` meant that the `'enable_adasum'` option was interpreted incorrectly, causing the test to fail.

* Update get_docker_image.py to enable use without image cache container registry. (#6177)

Update get_docker_image.py to enable use without image cache container registry.

* Helper for compiling EP to generate deterministic unique ids for use in MetaDef names (#6156)

* Create a helper for generating unique ids that can be used by an EP that creates compiled nodes and needs ids to be deterministic for a model when used in multiple sessions.

Added to IExecutionProvider as this can potentially be used by all compiling EPs and is more robust than a simplistic counter (although EP implementer is free to choose either approach).

* Restructure the helper so it can be called across the EP bridge.
Add ability to call id generation helper from EP bridge
  - convert DNNL EP to use helper to validate
Address issue where a new Model may be loaded into the same address as a previous one.
  - hash the bytes in the Graph instance (1728 bytes currently) to use as the key to the full hash for the model
Add lock around id generation to ensure no issues if multiple sessions partitions graphs at exactly the same time.
  - Extremely unlikely but would be hard to debug and the locking cost is not an issue as it's only incurred during graph partitioning and not execution.

* Backend APIs for checkpointing (#5803)

* Add backend API GetOptimizerState and GetModelState

* add GetPartitionInfoMap

* Android coverage dashboard (#6163)

* Write the report to a file.

* Post code coverage to the Dashboard database.

* Add usage details of unified MCR container image (#6182)

Going forward, a single unifed docker image will be published in
MCR. The hardware accelerator target choice will have to be made
in the application using OpenVINO EP's runtime config options.

* improve perf for softmax (#6128)

* improve perf for both gathergrad and softmax

* revert the change in gathergrad and will be done in another PR.

* address comments from code review.

* Tune fast Gelu to use exp(x) instead of tanh(x) on Rocm platform (#6174)

* tune fast gelu to use exp(x) instead of tanh(x) on rocm

* update to use expression 2/(1+exp(-2x))-1 for stability

* Add Status.csv to EP Perf Tool (#6167)

* merge master, keep postprocess status commit

* download float16.py everytime

* removing hardcoded values

* Lochi/quantization tool for trt (#6103)

* Initial implementation of generating calibration dynamic range table

* Initialize validation support for Quantization

* Initialize validation support for Quantization (cont.)

* Improve validation support for Quantization

* Improve validation support for Quantization

* Rewrite/Refine for calibration and validation

* Rewrite/Refine for calibration and validation (cont.)

* Refine code

* Refine code

* Add data reader for BERT

* Add flatbuffers to serialize calibration table

* Refine code and add BERT evaluation

* Refine the code

* minor modification

* Add preprocess/postprocess of vision team yolov3 and refine the code

* Update annotation

* Make bbox cooridates more accurate

* Fix bug

* Add support of batch processing

* Batch processing for model zoo yolov3

* Add batch inference for evaluation

* Refine the code

* Add README

* Add comments

* Refine the code for PR

* Remove batch support checking in data_reader and refine the code

* Refine the code for PR

* Refine the code for PR review

Co-authored-by: Olivia Jain <oljain@microsoft.com>

* Implement ScatterND for CUDA EP (#6184)

* Condition fix in Resize operator (#6193)

* Clean up checkpoint tests to use the new checkpoint functions (#6188)

* add deprecation warning for old checkpoint functions

* update all the distributed checkpoint tests to use new checkpoint functions

* Implement comparing outputs that are sequence of maps of strings to floats (#6180)

* Implement conversion from ortvalue to Itensor for string tensors and comparing sequence of maps of strings to floats

* PR comments

* Dockerfile to build onnxruntime with ROCm 4.0

* Add ability to skip GPU tests based on GPU adapter name (#6198)

* Implement conversion from ortvalue to Itensor for string tensors and comparing sequence of maps of strings to floats

* PR comments

* Add ability to skip gpu tests according to adapter description

* spacing

* spacing

* spacing

* Openvino ep 2021.2 (#6196)

* Enabling fasterrcnn variant and vehicle detector

* changes for 2021_2 branch

* yolov3_pytorch commit

* fixed braces in basic_backend.cc

* ci information added

* faster rcnn variant and vehicle detector changes were made in 2021.1 and not in 2021.2

* some changes to support unit tests

* disable some tests which are failing

* fix myriad tests for vehicle detector

* Did some cleanup
*cleaned up comments
*Disabled Add_Broadcast_0x1 and Add_Broadcast_1x0
tests on MYRIAD_FP16 backend due to a bug
*cleaned up capability_2021_2.cc file
*Removed extra conditions which were added
for some validation in backend_utils

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* yolov3 pytorch workaround to ensure that the output names are matched

* gemmoptest fixed on myriad

* Fixed MYRIADX CPP Test Failures

*Expand,GatherND,Range,Round op's
are only supported in model

*where op with float input data
types are not supported and fixed

*Scatter and ScatterElements op's with
negative axis are fixed

*Reshape op with 0 dim value are not
supported and fixed

*Disabled InstanceNorm_2 test on MYRIADX

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* make changes to yolov3 pytorch

* Fixed python unit tests
*Fixed failing python tests on vpu,
GPU and CPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixes POW op failures on GPU_FP16

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Clean up capability_2021_2.cc

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Updated docx for MultiThreading option
*Added extra info on setting the num_of_threads
option using the API and it's actual usage

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* fixed slice and removed extra prints

* Disabled failing python tests

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor changes added in capabilty_2021_2

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* made changes to slice to avoid failures

* Disabling FP16 support for GPU_FP32
->Inferencing an FP16 model on GPU_FP32
leads to accuracy mismatches. so, we would
rather use GPU_FP16 to infer an FP16 model
on GPU Device

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Updated docx for Inferencing a FP16 Model

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* fix for mask rcnn

* Script for installing openvino from source

* Updated with openvino 2021.2 online installation

* code comment fixes
fixed accuracy mismatch for div

* Update OpenvinoEP-ExecutionProvider.md

updated for 2021.2 branch

* Update README.md

updated dockerfile documentation

* Update BUILD.md

build.md update documentation

* permissiong change of install_openvino.sh

* made changes to align with microsoft onnxruntime changes

* Updated with ov 2021.2.200

Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: mohdansx <mohdx.ansari@intel.com>

* Fix a memory leak in test_inference.cc (#6201)

* Fix a memory leak in test_inference.cc

* Use TArray in AMD element-wise kernels, rather than manually copying memory to device.

* Remove most ROCm-specific element-wise code and reuse CUDA element-wise code.

* Minor change to improve performance for operator Pad. (#5537)

* small improvment for pad

* Support double for operators Log, Reciprocal, Sum (CPU) (#6032)

* Support double for operators Log, Reciprocal, Sum
* remove tesdt erf_double

* Support double for operators Where, LpNormalisation (#6034)

* Support double for operators Relu, Tanh, Sigmoid (#6221)

* Fix ImportError in build.py (#6231)

There is a possible ImportError where build.py can import the wrong 'util' package if there are others present in `sys.path` already

* Removed executor todo that looks dead. (#6234)

* Remove MKLML/openblas/jemalloc build config (#6212)

* Remove python 3.5

* Update the readme file

* Upgrade build.py to assert for python 3.6+

Upgrade build.py to assert for python 3.6+
as python 3.5 cannot build anymore todays master.

* Support MLFloat16 type in Pow opset-12 CUDA kernel (#6233)

* MLAS: handle MlasGemm(M/N/K==0) cases (#6238)

* Support double for operator TopK + fix one bug in TopK implementation for GPU for double (#6220)

* Support double for operator TopK
* add static classes for topk/double
* fix cast issue in topk

* Support double for operator Gemm + fix bug in gemm implementation for cuda, rocm when sizeof(type) != sizeof(float) (#6223)

* Support double for operator Gemm
* fix type size while copying data in gemm operator for GPU
* fix type in gemm implementation for rocm

* Support double for operator ReduceMean, ReduceLogSumExp (#6217)

* Support double for operators ReduceMean, ReduceLogSumExp

* Support double for operator ArgMin (#6222)

* Support double for operator ArgMin
* add test specifically for double
* add new test on pai-excluded-tests.txt

* Update BUILD.md

* Update manylinux docker image to the latest (#6242)

* Fix allocator issue for TensorRT IOBinding (#6240)

* Fix issue: https://github.com/microsoft/onnxruntime/issues/6094

Root cause: we didn't expose the OrtMemoryInfo for TRT, so it will cause issue if user want use IObinding for Tensorrt.

Short term fix, add the OrtMemoryInfo for TRT. Long term should unify the allocator for CUDA and TRT

* Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239)

* bias gelu grad use exp(...) instead

* update cuda to rocm

* missing semicolon

* comment

* remove dockerfile

* missing factor of two

* Refactor EP Perf Tool  (#6202)

* merge master, keep postprocess status commit

* download float16.py everytime

* using variables to reference eps

* adding ACL EP to ep perf tool

* accuracy with absolute tolerance configurable

* add acl to dict + remove commented line

* Documentation for distributed CI tests pipeline (#6140)

* Remove a debug log in provider_test_utils.cc (#6200)

* Add the Concat Slice Elimination transform, fix constant_folding transform (#5457)

* Add concat slice transform + test

* Cosmetic improvements in concat slice transform

* Remove unrelated file, fix comment, fix constant folding bug

* Add test onnx graph

* fix windows build

* Review comments

* review comment

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add MakeStringLite which uses current locale, update some MakeString call sites to use it instead. (#6252)

* Add MakeStringLite which uses current locale, update macros to use that to generate messages.

* Convert calls to MakeStringLite().

* Liqun/speech model loop to scan (#6070)

Provide a tool to convert Loop to Scan for Nuphar performance
Fix Nuphar CI pipeline failures.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* model parallel refinement (#6244)

* Megatron Transformation as a seperate step

* remove useless header

* clang formating

* Re-Structure megatron transformer for subsquent changes

* fix  comments

* Allow querying a GraphProto's doc_string as part of ModelMetadata (#6248)

* Fix Linux/Mac error message on input type mismatch (#6256)

* add bfloat16 to gathergrad type constrains (#6267)

Co-authored-by: Cheng Tang <chenta@microsoft.com>

* Fix VS 2017 build break (#6276)

* Deprecate Python global configuration functions [Part 2] (#6171)

Update Python API to allow more flexibility for setting providers and provider options.

The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.

Update some EP-specific option parsing to fail on unknown options.

Other clean up.

* Add script to preprocess python documentation before publishing (#6129)

* add script to preprocessing python documentation before publishing

* rename past to past_key_values for GPT-2 (#6269)

rename past to past_key_values for transformers 4.*

* Rename MakeString and ParseString functions. (#6272)

Rename MakeString to MakeStringWithClassicLocale, MakeStringLite to MakeString, *ParseString to *ParseStringWithClassicLocale.
Add missing pass-through versions of MakeStringWithClassicLocale for string types.

* Increase timeout for Linux GPU CUDA11 build. (#6280)

* Add helper to compare model with different precision (#6270)

* add parity_check_helper.py

* add real example

* remove lines

* Fix Min/Max CPU kernels for float16 type (#6205)

* fix data_ptr assertion error for past_sequence_length=0 in GPT-2 (#6284)

 fix io binding crash for past_sequence_length=0

* A list of changes in transformers tool (#6224)

* longformer fp16 e2e

* add fp16/fp32 parity check helper file

* excludes nodes with subgraph in profiling

* use onnxconverter_common to do fp32->fp16

* add version check for onnxconverter_common

* remove helper file

* add pkg installation on notebooks and script

* Workaround for static_cast<double>(half)

* Add workaround to remove ROCm-specific binary-elementwise files.

* Update nuget build (#6297)

1. Update the ProtoSrc path. The old one is not used anymore.
2. Regenerate OnnxMl.cs
3. Delete some unused code in tools/ci_build/build.py
4. Avoid set intra_op_param.thread_pool_size in ModelTests in OpenMP build.
5. Fix a typo in the C API pipeline.

* Enable ONNX backend test of SequenceProto input/output  (#6043)

* assert sequence tensor and remove skips

* update testdata json

* use ONNX 1.8 in cgmanifest.json

* use previous commit to workaround

* update ONNX commit ID in docker

* skip test_maxpool_2d_dilations test for now

* update function name

* add --sequence_lengths option (#6285)

* more dtype for Equal CUDA kernel (#6288)

Co-authored-by: Vincent Wang <weicwang@microsoft.com>

* Force reinstall onnx python package on Windows (#6309)

* update transformers required package versions (#6315)

* Remove abs in LpPool (#6303)

* Support 1D input for Conv + Mul/Add fusion optimizer with test (#6295)

* Support 1D input (N C H) for Conv + Mul/Add fusion optimizer with test cases and test models.

* Add longformer to  python package (#6314)

* add longformer to python package
* move test related script and data to a new folder

* Avoid false sharing on thread pool data structures (#6298)

Description: This change adds alignment and padding to avoid false sharing on fields in the thread pool. It also adds a new microbenchmark to profile thread-pool performance over short loops.

Motivation and Context
MobileNet on a 2*12-core system showed a performance gap between the ORT thread pool and OpenMP. One cause appeared to be false sharing on fields in the thread pool: ThreadPoolParallelSection::tasks_finished (which the main thread spins on waiting for workers to complete a loop), and the RunQueue::front_ and back_ fields (used respectively by the worker thread and the main thread).

The additional micro-benchmark BM_ThreadPoolSimpleParallelFor tests performance of loops of different sizes at different thread counts. The results below are on a machine with 2*14-core processors (E5-2690 v4) running with 1, 14, 15, and 28 threads. For each test, the microbenchmark has N threads run a loop with N iterations; hence a perfect result is for the time taken to be constant as additional threads are added (although we will also see power management effects helping at very low thread counts). The loop durations (100000, 10000, 1000) correspond roughly to 200us, 20us, and 2us on this machine.

Before change:
BM_ThreadPoolSimpleParallelFor/1/1/100000/real_time 17153 us 17154 us 32
BM_ThreadPoolSimpleParallelFor/14/14/100000/real_time 22553 us 22553 us 30
BM_ThreadPoolSimpleParallelFor/15/15/100000/real_time 21521 us 21521 us 29
BM_ThreadPoolSimpleParallelFor/28/28/100000/real_time 24111 us 24111 us 24
BM_ThreadPoolSimpleParallelFor/1/1/10000/real_time 1719 us 1719 us 407
BM_ThreadPoolSimpleParallelFor/14/14/10000/real_time 3409 us 3409 us 200
BM_ThreadPoolSimpleParallelFor/15/15/10000/real_time 3541 us 3541 us 201
BM_ThreadPoolSimpleParallelFor/28/28/10000/real_time 4576 us 4576 us 151
BM_ThreadPoolSimpleParallelFor/1/1/1000/real_time 174 us 174 us 4017
BM_ThreadPoolSimpleParallelFor/14/14/1000/real_time 1586 us 1586 us 402
BM_ThreadPoolSimpleParallelFor/15/15/1000/real_time 1586 us 1586 us 397
BM_ThreadPoolSimpleParallelFor/28/28/1000/real_time 2864 us 2864 us 232

After change:
BM_ThreadPoolSimpleParallelFor/1/1/100000/real_time 17160 us 17160 us 33
BM_ThreadPoolSimpleParallelFor/14/14/100000/real_time 20989 us 20989 us 31
BM_ThreadPoolSimpleParallelFor/15/15/100000/real_time 22286 us 22286 us 31
BM_ThreadPoolSimpleParallelFor/28/28/100000/real_time 24631 us 24631 us 25
BM_ThreadPoolSimpleParallelFor/1/1/10000/real_time 1718 us 1718 us 407
BM_ThreadPoolSimpleParallelFor/14/14/10000/real_time 2868 us 2868 us 242
BM_ThreadPoolSimpleParallelFor/15/15/10000/real_time 2907 us 2907 us 240
BM_ThreadPoolSimpleParallelFor/28/28/10000/real_time 3872 us 3872 us 186
BM_ThreadPoolSimpleParallelFor/1/1/1000/real_time 175 us 175 us 3938
BM_ThreadPoolSimpleParallelFor/14/14/1000/real_time 933 us 933 us 659
BM_ThreadPoolSimpleParallelFor/15/15/1000/real_time 912 us 912 us 591
BM_ThreadPoolSimpleParallelFor/28/28/1000/real_time 1976 us 1976 us 317

* fix opset imports for function body  (#6287)

* fix function opsets

* add tests and update onnx

* changes per review comments

* add comments

* plus updates

* build fix

* Remove false positive prefast warning from threadpool (#6324)

* Java: add Semmle to Java publishing pipelines (#6326)

Add Semmle to Java API pipeline
  Add security results publishing and add Java GPU.

* Quantization support for split operator with its NHWC support (#6107)

* Make split working for quantization.

* NHWC transformer support for split operator

* Refactor some according to Feedback. Will add test cases soon.

* Fix build error on windows.

* Add test case for split op on uint8_t support

* Add nhwc_transformer_test for split uint8_t support

* Some change according to PR feedbacks.

* Liqun/enable pipeline parallel test (#6331)

enable pipeline parallel test
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Use onnxruntime_USE_FULL_PROTOBUF=OFF for the cuda execution provider (#6340)

This removes a special case of the cuda EP.

* MLAS: add fallback implementation for quantized GEMM (#6335)

Add a non-vectorized version of the kernel used for the quantized version of MlasGemm.

* Delete float16.py (#6336)

No longer needed. Also doesn't pass policheck.

* Enable add + softmax fusion for Rocm platform (#6259)

* add bias softmax; tests appear to pass

* check fusion occurs for rocm as well

* check for rocm provider compatible as well

* build for cpu scenario as well

* try again; broader cope

* proper scope on kGpuExecutionProvider

* been editing wrong file

* remove commented #include lines

* try again due to mac os ci error

* try again

* test fusion both cuda and rocm to avoid mac ci error

* add external data support to tensor proto utils (#6257)

* update unpack tensor utilities to support loading external data

* more updates

* fix test

* fix nuphar build

* minor build fix

* add tests

* fix Android CI

* fix warning

* fix DML build failure and some warnings

* more updates

* more updates

* plus few updates

* plus some refactoring

* changes per review

* plus some change

* remove temp code

* plus updates to safeint usage

* build fix

* fix for safeint

* changed wording. (#6337)

* Remove OpSchema dummy definition. Only needed for Function now, and we can just exclude the method in Function (#6321)

* remove gemmlowp submodule (#6341)

* [NNAPI] Add pow support (#6310)

* Add support for running Android emulator from build.py on Windows. (#6317)

* fix the pipeline failure (#6346)

* Train BERT Using BFloat16 on A100 (#6090)

* traing bert using bf16

* Adam support bf16

* bugfix

* add fusedmatmul support

* fix after merge from master.

* bugfix

* bugfix after merge from master

* fast reduction for bf16.

* resolve comments

* fix win build

* bugfix

* change header file.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>

* Fix DerefNullPtr issues raised by SDLNativeRules. (#6348)

* update quantize to support basic optimization and e2e example for image classification (#6313)

update the resnet50-v1 to standard one from onnx zoo.
add an example for mobilenet
run basic optimization before quantization
fix a bug in Clip

* Enable graph save for orttrainer (#6333)

* Enable graph save for orttrainer

* Fix CI

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Add PREfast to python packaging pipeline (#6343)

* Add PREfast to python packaging pipeline

* fix longformer benchmark io_binding output_buffers (#6345)

* fix longformer benchmark io_binding output_buffers

* format

* import benchmark_helper from parent directory.

* Use readelf for minimal build binary size checks. (#6338)

* Use readelf for minimal build binary size checks.
The on-disk size grows in 4KB chunks which makes it hard to see how much growth an individual checkin causes.
Only downside is that the sum of the sections is larger than the on-disk size (assumably things get packed smaller on disk and some of the section alignment constraints can be ignored)

* Remove unused function

* Java: Set C language warnings to W4 and adjust JNI code (#6347)

Set /W3 for C language and fix up JNI warnings.

* Pipeline Parallel Experimental Python API (#5815)

* Add create session to WinML telemetry to track WinML Usage (#6356)

* Fix one more SDL warning (#6359)

* fix -Wdangling-gsl (#6357)

* Add python example of TensorRT INT8 inference on ResNet model (#6255)

* add trt int8 example on resnet model

* Update e2e_tensorrt_resnet_example.py

* remove keras dependency and update class names

* move ImageNetDataReader and ImageClassificationEvaluator to tensorrt resnet example

* simplify e2e_tensorrt_resnet_example.py

* Update preprocessing.py

* merge tensorrt_calibrate

* Update calibrate.py

* Update calibrate.py

* generalize calibrate

* Update calibrate.py

* fix issues

* fix formating

* remove augment_all

* This added telemetry isn't needed (#6363)

* Wezuo/memory analysis (#5658)

* merged alloc_plan

* pass compilation

* Start running, incorrect allocation memory info

* add in comments

* fix a bug of recording pattern too early.

* debugging lifetime

* fix lifetime

* passed mnist

* in process of visualization

* Add code to generate chrome trace for allocations.

* in process of collecting fragmentation

* before rebuild

* passed mnist

* passed bert tiny

* fix the inplace reuse

* fix the exception of weight in pinned memory

* add guards to ensure the tensor is in AllocPlan

* add customized profiling

* debugging

* debugging

* fix the reuse of differnt location type

* add rank

* add the rank

* add fragmentation

* add time_step_trace

* Add summary for each execution step (total bytes, used/free bytes).

* add top k

* change type of top k parameter

* remove prints

* change heap to set{

* add the name pattern

* add the useage for pattern

* add partition

* change to static class

* add custom group

* remove const

* update memory_info

* in process of adding it as runtime config

* change the memory profiling to be an argument

* add some comments

* add checks to recored meomry_info in traaining session

* set the "local rank setting" to correct argument.

* addressing comments

* format adjustment

* formatting

* remove alloc_interval

* update memory_info.cc to skip session when there is no tensor for a particular memory type

* fix memory_info multiple iteration seg-fault

* consolidate mainz changes

* fixed some minor errors

* guard by ORT_MINIMAL_BUILD

* add ORT_MEMORY_PROFILE flag

* added compiler flag to turn on/off memory profiling related code

* clean up the code regarding comments

* add comments

* revoke the onnx version

* clean up the code to match master

* clean up the code to match master

* clean up the code to match master

Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-yclzsf.eastus.cloudapp.azure.com>

* Support MLFloat16 in CumSum Cuda op for Opset 14 (#6355)

* Add CumSum-14 for Cuda

* fix convert_common version retrival (#6382)

* Refine auto_pad based pad computation in ConvTranspose (#6305)

* Fix SDL warning (#6390)

* Add max_norm for gradient clipping. (#6289)

* add max_norm as user option for gradient clipping

* add adam and lamb test cases for clip norm

* add frontend tests

* Add the custom op project information (#6334)

* Dont use default string marshalling in C# (#6219)

* Fix Windows x86 compiler warnings in the optimizers project  (#6377)

* [Perf] Optimize Tile CPU and CUDA kernels for a corner case (#6376)

* Unblock Android CI code coverage failure (#6393)

* fix build on cuda11 (#6394)

Co-authored-by: Vincent Wang <weicwang@microsoft.com>

* Load the model path correctly (#6369)

* Fix some compile warnings (#6316)

* OpenVino docker file changes to bypass privileged mode

Description: Builds and installs libusb without UDEV support, which is used for communicating with the VPU device.

Motivation and Context

This enables the resulting docker container to be run without '--privileged' and '--network host' options which may not be suitable in deployment environments.

* Megatron checkpointing (#6293)

* Add bart fairseq run script

* Add frontend change to enable megatron

* Initial changes for checkpointing

* Megatron optim state loading, checkpoint aggregation, frontend distributed tests for H, D+H

* Add load_checkpoint changes

* Fix CI

* Cleanup

* Fix CI

* review comments

* review comments

* review comments:

* Fix generate_submodule_cgmanifest.py Windows issues. (#6404)

* Continue memory planning when unknown shape tensor is encountered. (#6413)

* Reintroduce experimental api changes and fix remote build break (#6385)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* Add support for custom ops to minimal build. (#6228)

* Add support for custom ops to minimal build.
Cost is only ~8KB so including in base minimal build.

* enable pipeline to run quantization tests (#6416)

* enable pipeline to run quantization tests
setup test pipeline for quantization

* Minor cmake change (#6431)

* Liqun/liqun/enable pipeline parallel test2 (#6399)

* enable data and pipeline parallism test

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Farewell TrainableDropout (#5793)

* Deprecate TrainableDropout kernel.

* Update bert_toy_postprocessed.onnx to opset 12.

* Add more dropout tests.

* Fix BiasDropout kernel.

Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>

* fix null dereference warning (#6437)

* Expose graph ModelPath to TensorRT shared library (#6353)

* Update graph_viewer.cc

* Update tensorrt_execution_provider.cc

* Update graph_viewer.h

* Update tensorrt_execution_provider.cc

* Update tensorrt_execution_provider.cc

* Update provider_api.h

* Update provider_bridge_ort.cc

* Update provider_interfaces.h

* Update provider_interfaces.h

* expose GraphViewer ModelPath API to TRT shared lib

* add modelpath to compile

* update

* add model_path to onnx tensorrt parser

* use GenerateMetaDefId to generate unique TRT kernel name

* use GenerateMetaDefId to generate unique TRT engine name

* fix issue

* Update tensorrt_execution_provider.cc

* remove GetVecHash

* Update tensorrt_execution_provider.h

* convert wchar_t to char for tensorrt parser

* update tensorrt parser to include latest changes

* fix issues

* Update tensorrt_execution_provider.cc

* merge trt parser latest change

* add PROVIDER_DISALLOW_ALL(Path)

* add tool for generating test data for longformer (#6415)

* only build experimental api in redist (#6465)

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>

* Add an option to save the training graph after optimization (#6410)

* expose optimized_model_filepath in SessionOptions as `debug.graph_save_paths.model_with_training_graph_after_optimization_path` in `ORTTrainerOptions`

* Share allocator between CUDA EP & TRT EP. (#6332)

* Share allocator between CUDA EP & TRT EP.
limitation:
1. Does not cover the per-thread allocator created by CUDA EP, still need to figure out the way to remove it
2. Need to have more identifiers to make it able to share CPU allocator across all EPs

* fix max norm clipping test in python packaging pipeline test (#6468)

* fix python packaging pipeline

* make clip norm test compatabile with both V100 and M60 GPUs

* Initial version of CoreML EP (#6392)

* Bug 31463811: Servicing: Redist (Nuget) conflicts with Microsoft.AI.MachineLearning starting 21H1+ (#6460)

* update load library code to have the fullly qualified path

* make it work for syswow32

* git Revert "make it work for syswow32"

This reverts commit b9f594341b7cf07241b18d0c376af905edcabae3.

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>

* dequantize 1st input of lstm back if it is quantized (#6444)

* [java] Adds support for OrtEnvironment thread pools (#6406)

* Updates for Gradle 7.

* Adding support for OrtThreadingOptions into the Java API.

* Fixing a typo in the JNI code.

* Adding a test for the environment's thread pool.

* Fix cuda test, add comment to failure.

* Updating build.gradle

* fix SDL native rule warning #6246 (#6461)

* fix SDL rule (#6464)

* use tickcount64 (#6447)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* Update pypi package metadata (#6354)

* Update setup file data

* add missing comma

* remove python 3.5

* fix typo bracket

* Delete nuget extra configs (#6477)

* Op kernel type reduction infrastructure. (#6466)

Add infrastructure to support type reduction in Op kernel implementations.
Update Cast and IsInf CPU kernels to use it.

* Fixing a leak in OnnxSequences with String keys or values. (#6473)

* Increase the distributes tests pipeline timeout to 120 minutes (#6479)

* [CoreML EP] Add CI for CoreML EP (macOS) and add coreml_flags for EP options (#6481)

* Add macos coreml CI and coreml_flags

* Move save debuggubg model to use environment var

* Move pipeline off from macos CI template

* Fix an issue building using unix make, add parallel to build script

* Fixed build break for shared_lib and cmpile warning

* Fix a compile warning

* test

* Revert the accidental push from another branch

This reverts commit 472029ba25d50f9508474c9eeceb3454cead7877.

* Add ability to track per operator types in reduced build config. (#6428)

* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
  - Add python bindings for ORT format models
    - Add script to update bindings and help info
  - Add parsing of ORT format models
  - Add ability to enable type reduction to config generation
  - Update build.py to only allow operator/type reduction via config
    - simpler to require config to be generated first
    - can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
  - Add script to create reduced build config
  - Update CIs

* merge e2e with distributed pipeline (#6443)

merge e2e with distributed pipeline

* Fix test breaks in Windows ingestion pipeline (#6476)

* fix various build breaks with Windows build

* fix runtime errors loading libraries from system32

* add build_inbox check to winml_test_common

* use raw string

* cleanup

* fix dll load

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>

* Speed up the Mac CI runs (#6483)

* expose learningmodelpixelrange property (#5877)

* Fix of support api version bug for [de]quantize (#6492)

* SDL fixes: add proper casts/format specifiers (#6446)

* SDL annotation fixes (#6448)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* [OpenVINO-EP] Remove support for OpenVINO 2020.2 (#6493)

* Removed OpenVINO 2020.2 support

* Updated documentation and build.py

* Removed unnecessary libraries from setup.py

* Support pad operator in quantization and quantized nhwc transformer. Fix Pad operator bug. (#6325)

Support pad operator in quantization tool.
Support pad operator in quantized nhwc transformer.
Fix pad() operator bug when pad input's inner(right) most axis value is zero for Edge and Reflect mode, it copied wrong value to the cells to be padded. Note the Constant mode will not trigger this bug, as Edge/Reflect need copy value from the already copied array while Constant mode only fill specified value.
Add more test cases to cover pad() operator bug fixed here.
Fix quantization tools uint8/int8 value overflow issue when quantize weights in python.

* Improve work distribution for Expand operator, and sharded LoopCounter configuration (#6454)

Description: This PR makes two changes identified while looking at a PGAN model.

First, it uses ThreadPool::TryParallelFor for the main parallel loops in the Expand operator. This lets the thread pool decide on the granularity at which to distribute work (unlike TrySimpleParallelFor). Profiling showed high costs when running "simple" loops with 4M iterations each of which copied only 4 bytes.

Second, it updates the sharded loop counter in the thread pool so that the number of shards is capped by the number of threads. This helps make the performance of any other high-contention "simple" loops more robust at low thread counts by letting each thread work on its own "home" shard for longer.

Motivation and Context

Profiling showed a PGAN model taking 2x+ longer with the non-OpenMP build. The root cause was that the OpenMP build uses simple static scheduling of loop iterations, while the non-OpenMP build uses dynamic scheduling. The combination of large numbers of tiny iterations is less significant with static scheduling --- although still desirable to avoid, given that each iteration incurs a std::function invocation.

* Update document of transformer optimization (#6487)

* nuphar test to avoid test data download to improve passing rate (#6467)

nuphar test to avoid test data download to improve passing rate

* Fuse cuda conv with activation (#6351)

* optimize cuda conv by fused activation

* remove needless print out

* exclude test from cpu

* handle status error from cudnn 8.x

* add reference to base class

* add hipify

* [CoreML EP] Add support for some activations/Transpose, move some shared helpers from NNAPI to shared space (#6498)

* Init change

* Move some helper from nnapi ep to shared

* Add transpose support

* Fix trt ci build break

* Refine transformers profiler output (#6502)

* output nodes in the original order; grouped by node name
* add document for profiler

* Update to match new test setup. (#6496)

* Update to match new test setup.

* Add Gemm(7) manually for now.
Will fix properly on Monday. It's used by mnist.ort as that is created by optimizing mnist.onnx to level 1 causing 2 nodes to be replaced by a Gemm and the op to be missing from the required list as that is created using the original onnx model.

* Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)

* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case

* Optimize GatherGrad for AMD GPU (#6381)

* optimize gathergrad

* address comments

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* add explicit barriers for buffer overread and overrwrite (#6484)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* fix sdl bugs for uninitialized variables and returns (#6450)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* handle hr error conditions (#6449)

Co-authored-by: Ori Levari <orlevari@microsoft.com>

* Dnnl training (#6045)

* Add ReluGrad and ConvGrad ops for the dnnl provider

* the mnist sample is updated to add the --use_dnnl option that
will cause the sample to use the dnnl execution provider for
nodes that exist in dnnl provider.

* Added the ability to find forward ops. Dnnl backward gradient
ops require the forward primitive description and workspace
from the forward operation.

* Enable specifying the execution provider for Gradient Checker Tests

* Prevent memory leak when running dnnl_provider in training mode

Prevent creating a SubgraphPrimitivePool when the code is built with the
ENABLE_TRAINING build flag. Instead create a SubgraphPrimitive directly.

The SubgraphPrimitivePool was causing a pool of SubgraphPrimitives to be
stashed in a map for reuse. Due to the way the Training Loop uses threads
the pool of SubgraphPrimitives were not being reuse instead a new pool of
SubgraphPrimitives being created each run. The old pool was not instantly
freed. This behavior could be a language error when using thread_local
memory.

Signed-off-by: George Nash <george.nash@intel.com>

* Added fixes to maxpoolgrad and memory leak.

Maxpoolgrad will now pass all unit tests.
With the conv and convgrad disabled for dnnl, mnist is able to train till 95%

Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Fixed misc issues when testing training code with dnnl provider

* fix conv_grad dnnl tests with dilation to run dnnl execution provider

* update mnist training sample to accept convolution type models

  convolution models require the input shape to be {1, 28, 28}
  instead of the flat {728} image that is used for the gemm models

  this will enable models that require the different shape by adding
 `--model_type conv` to the command line when running the mnist sample.
 (while testing a workaround was used see #4762)

* Disable weight caching in dnnl conv operator when using training

  When training we can not use cached weights because the weight
  will be updated each run. This re-enables dnnl Conv and ConvGrad Ops.
  The weight caching was the source of the error from Conv when training.

* Fix issues found when building grad ops on Linux
  * The dnnl_convgrad code was over using the scope operator
    causing a compilation problem.
  * The dnnl_maxpoolgrad code had a logic error that is was
    comparing with the source description when it should have
    been comparing with the destination despription.

* Update BUILD.md so it shows DNNL for training
  * Updated the table of contents. Since the same providers
    are listed twice. Once for Infrance and again for Training
    an HTML anchor was added to distinguish the second header
    from the first for the TOC.

* Fix build failure when not using --enable-training build option

* reorganize the gradient operators so they are grouped together

* Fix issues found when running onnx_backend_test_series.py

* Pooling code only supports 2 outputs when built with --enable-training

* Address code review feedback
  * class member variables end in underscore_
  * use dst instead of dist to match pattern use elsewhere in DNNL code.

* Remove workaround that was introduced to handle problems running
  convolution based training models. See issue #4762

Signed-off-by: George Nash <george.nash@intel.com>

* Isolate training code and code cleanup

* Do not build if dnnl_gpu_runtime if enable_training is set training code
  does not support dnnl_gpu_runtime yet.
* Isolated Training code inside ifdefs so that they wont affect
  project if built without training enabled
* Inadvertant changes in whitespace were removed to make code review simpler
* Undid some code reordering that was not needed
* comments added to closing #endif statments to simplify reading complex ifdefs
* Modified the GetPrimitiveDesc functions to return shared_ptr instead of raw
  pointer. This matches what was done in Pool code and is safer memory code.

Signed-off-by: George Nash <george.nash@intel.com>

* Address code review issues

- whitespace changes caused by running clang-format on the code
- Several spelling errors fixed
- Removed/changed some ifdefs to improve readability
- other misc. changes in responce to code review.

Signed-off-by: George Nash <george.nash@intel.com>

* Code changes to address code review

- Simplify iteration code using `auto` keyword
- remove C style cast that was not needed
- remove instance variable that was not needed [relugrad.h]
- added the execution providers to `ComputeGradientErrorInternal()`
  and `ComputeTheoreticalJacobianTranspose()` instead of using
  a pointer to an instance varaible [gradient_checker.h/.cc]

Signed-off-by: George Nash <george.nash@intel.com>

* Combined the default gradient ops test and dnnl gradient ops test for ConvGrad and MaxPoolGrad into one function with the help of a helper function.
This will reduce repeated code.
Signed-off-by: Palangotu Keshava, Chethan's avatarChethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Replaced the stack used by convgrad to vector so that the vector(used as stack) can be easily cleared everytime the graph is created.
This will prevent memory leak from convolution kernels being pushed constantly onto the stack.
Signed-off-by: chethan.palangotu.keshava@intel.com

* Code clean up and formating updates

 - Removed empty else statment
 - updated indentation of code that was causing double curly brackets to look unususal
 - Changed check for NumDimensions to Size in Relu and ReluGrad error checking code.
 - isolated training code

Signed-off-by: George Nash <george.nash@intel.com>

* Restore inadvertantly removed ConvGrad tests

When combining the DNNL and CPU version of the ConvGrad
tests two test were inadvertantly excluded.  This adds
back the Conv3d and Conv3d with strides test cases.

Signed-off-by: George Nash <george.nash@intel.com>

* Add validation to ConvGrad

This validates the dimensions of the ConvGrad match the
passed in Convolution forward primitive description.

The current code for DNNL ConvGrad makes the assumption that the ConvGrad
nodes will be visited in the reverse order from the corresponding Conv nodes

The added validation will return an error if this assumption is not true.

Signed-off-by: George Nash <george.nash@intel.com>

* Do not create new execution providers in provider_test_utils

This removes the code that generated new execution providers in the
OpTester::Run function. This was added because the std::move was
leaving the `entry` value empty so subsequent calls would cause a
segfault.

Problem is this potentially changed the execution_provider because it
would create the default provider dropping any custom arguments.

When the now removed code was originally added the std::move was causing
crashes when the GradientChecker unit tests were run.  However, it is no
longer causing problems even with the code removed.

Signed-off-by: George Nash <george.nash@intel.com>

* Change the forward conv stack to a forward conv map

This changes how the forward conv kernel is mapped to the bwd ConvGrad
kernel the problematic stack is no longer used.

The convolution stack made the assumption that the corresponding
ConvGrad operator would be visited in reverse order of the forward
Conv operators.  This was always problematic and was unlikely to
work for inception models.

Important changes:
- The weight_name is added to the ConvGrad dnnl_node making it
  possible to use the weight_name as a lookup key to find the
  Conv forward Kernel
- the `std::vector fwd_conv_stack_` has been replaced with a
  `std::map fwd_conv_kernel_map_`
- Although it is not needed lock_guards were added when writing
  to and reading from the fwd_conv_kernel_map_ as well as the
  fwd_kernel_map_. These should always be accessed by a single
  thread when preparing the dnnl subgraphs so the guard should not
  be needed but its added just in case.
- Updated the comments ConvGrad.h code to no longer mention the
  stack. The error check is not removed. It will be good to verify
  there are no errors as we continue to test against more models.

Signed-off-by: George Nash <george.nash@intel.com>

Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
Co-authored-by: unknown <63478620+jeyblu@users.noreply.github.com>

* Lochi/refactor yolov3 quantization (#6290)

* Refactor the code and move data reader, preprocessing, evaluation to
E2E_example_mode

* Refactor the code.

Move data reader, preprocessing, evaluation to model specific example
under E2E_example_mode

* refactor code

* Move yolov3 example to specific folder and add additional pre/post
processing

* Print a warning message for using newer c_api header on old binary (#6507)

* Fix issues with ArmNN build setup (#6495)

* ArmNN build fixes
* Update BUILD.md to document that the ACL paths must be specified to build ArmNN
* Fix CUDA build error. We don't setup the link libraries correctly/consistently so improve that.

* Fix Windows CI builds by updating test scripts to work with numpy 1.20. (#6518)

* Update onnxruntime_test_python.py to work with numpy 1.20.

Some aliases are deprecated in favor of the built-in python types. See https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

np.array with bytes for entries and dtype of np.void no longer automatically pads. Change a test to adjust for that.

* Fix another test script

* Fix ORTModule branch for orttraining-* pipelines

* Update pytorch nightly version dependency

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Cecilia Liu <ziyue.liu7@gmail.com>
Co-authored-by: Ryan Hill <38674843+RyanUnderhill@users.noreply.github.com>
Co-authored-by: George Nash <george.nash@intel.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: Yateng Hong <toothache9010@gmail.com>
Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com>
Co-authored-by: Derek Murray <Derek.Murray@microsoft.com>
Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Tracy Sharpe <42477615+tracysh@users.noreply.github.com>
Co-authored-by: Juliana Franco <jufranc@microsoft.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Tixxx <tix@microsoft.com>
Co-authored-by: Jay Rodge <jayrodge@live.com>
Co-authored-by: Du Li <duli1@microsoft.com>
Co-authored-by: Du Li <duli@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: baijumeswani <bmeswani@microsoft.com>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
Co-authored-by: jingyanwangms <47403504+jingyanwangms@users.noreply.github.com>
Co-authored-by: satyajandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: S. Manohar Karlapalem <manohar.karlapalem@intel.com>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Olivia Jain <oljain@microsoft.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Ryan Lai <rylai@microsoft.com>
Co-authored-by: Jesse Benson <jesseb@microsoft.com>
Co-authored-by: sfatimar <64512376+sfatimar@users.noreply.github.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: mohdansx <mohdx.ansari@intel.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin@vols.utk.edu>
Co-authored-by: Michael Giba <michaelgiba@gmail.com>
Co-authored-by: William Tambellini <wtambellini@sdl.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: liqunfu <liqfu@microsoft.com>
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: Tang, Cheng <souptc@gmail.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Chun-Wei Chen <jacky82226@gmail.com>
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: Luyao Ren <375833274@qq.com>
Co-authored-by: Zhang Lei <zhang.huanning@hotmail.com>
Co-authored-by: Tim Harris <tiharr@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Alberto Magni <49027342+alberto-magni@users.noreply.github.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: wezuo <49965641+wezuo@users.noreply.github.com>
Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-yclzsf.eastus.cloudapp.azure.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Martin Man <supermt@gmail.com>
Co-authored-by: M. Zeeshan Siddiqui <mzs@microsoft.com>
Co-authored-by: Ori Levari <ori.levari@microsoft.com>
Co-authored-by: Ori Levari <orlevari@microsoft.com>
Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sheil Kumar <smk2007@gmail.com>
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Ryota Tomioka <ryoto@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: Yulong Wang <f.s@qq.com>
Co-authored-by: Faith Xu <faxu@microsoft.com>
Co-authored-by: Xiang Zhang <xianz@microsoft.com>
Co-authored-by: suryasidd <48925384+suryasidd@users.noreply.github.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
Co-authored-by: unknown <63478620+jeyblu@users.noreply.github.com>
2021-02-02 08:59:56 -08:00

2005 lines
85 KiB
Python

#!/usr/bin/env python3
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import argparse
import contextlib
import glob
import os
import re
import shutil
import subprocess
import sys
import hashlib
import platform
from amd_hipify import amd_hipify
from distutils.version import LooseVersion
SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__))
REPO_DIR = os.path.normpath(os.path.join(SCRIPT_DIR, "..", ".."))
sys.path.insert(0, os.path.join(REPO_DIR, "tools", "python"))
from util import ( # noqa: E402
run,
is_windows, is_macOS, is_linux,
get_logger)
import util.android as android # noqa: E402
log = get_logger("build")
class BaseError(Exception):
"""Base class for errors originating from build.py."""
pass
class BuildError(BaseError):
"""Error from running build steps."""
def __init__(self, *messages):
super().__init__("\n".join(messages))
class UsageError(BaseError):
"""Usage related error."""
def __init__(self, message):
super().__init__(message)
def _check_python_version():
# According to the BUILD.md, python 3.5+ is required:
# Python 2 is definitely not supported and it should be safer to consider
# it won't run with python 4:
if sys.version_info[0] != 3:
raise BuildError(
"Bad python major version: expecting python 3, found version "
"'{}'".format(sys.version))
if sys.version_info[1] < 6:
raise BuildError(
"Bad python minor version: expecting python 3.6+, found version "
"'{}'".format(sys.version))
def _str_to_bool(s):
"""Convert string to bool (in argparse context)."""
if s.lower() not in ['true', 'false']:
raise ValueError('Need bool; got %r' % s)
return {'true': True, 'false': False}[s.lower()]
_check_python_version()
def _openvino_verify_device_type(device_read):
choices = ["CPU_FP32", "GPU_FP32", "GPU_FP16", "VAD-M_FP16", "MYRIAD_FP16", "VAD-F_FP32"]
status_hetero = True
res = False
if (device_read in choices):
res = True
elif (device_read.startswith("HETERO:") or device_read.startswith("MULTI:")):
res = True
comma_separated_devices = device_read.split(":")
comma_separated_devices = comma_separated_devices[1].split(',')
if (len(comma_separated_devices) < 2):
print("Atleast two devices required in Hetero Mode")
status_hetero = False
dev_options = ["CPU", "GPU", "MYRIAD", "FPGA", "HDDL"]
for dev in comma_separated_devices:
if (dev not in dev_options):
status_hetero = False
break
def invalid_hetero_build():
print("\n" + "If trying to build Hetero or Multi, specifiy the supported devices along with it." + + "\n")
print("specify the keyword HETERO or MULTI followed by the devices ")
print("in the order of priority you want to build" + "\n")
print("The different hardware devices that can be added in HETERO or MULTI")
print("are ['CPU','GPU','MYRIAD','FPGA','HDDL']" + "\n")
print("An example of how to specify the hetero build type. Ex: HETERO:GPU,CPU" + "\n")
print("An example of how to specify the MULTI build type. Ex: MULTI:MYRIAD,CPU" + "\n")
sys.exit("Wrong Build Type selected")
if (res is False):
print("\n" + "You have selcted wrong configuration for the build.")
print("pick the build type for specific Hardware Device from following options: ", choices)
print("\n")
if not (device_read.startswith("HETERO:") or device_read.startswith("MULTI:")):
invalid_hetero_build()
sys.exit("Wrong Build Type selected")
if (status_hetero is False):
invalid_hetero_build()
return device_read
def parse_arguments():
parser = argparse.ArgumentParser(
description="ONNXRuntime CI build driver.",
usage=""" # noqa
Default behavior is --update --build --test for native architecture builds.
Default behavior is --update --build for cross-compiled builds.
The Update phase will update git submodules, and run cmake to generate makefiles.
The Build phase will build all projects.
The Test phase will run all unit tests, and optionally the ONNX tests.
Use the individual flags to only run the specified stages.
""")
# Main arguments
parser.add_argument(
"--build_dir", required=True, help="Path to the build directory.")
parser.add_argument(
"--config", nargs="+", default=["Debug"],
choices=["Debug", "MinSizeRel", "Release", "RelWithDebInfo"],
help="Configuration(s) to build.")
parser.add_argument(
"--update", action='store_true', help="Update makefiles.")
parser.add_argument("--build", action='store_true', help="Build.")
parser.add_argument(
"--clean", action='store_true',
help="Run 'cmake --build --target clean' for the selected config/s.")
parser.add_argument(
"--parallel", nargs='?', const='0', default='1', type=int,
help="Use parallel build. The optional value specifies the maximum number of parallel jobs. "
"If the optional value is 0 or unspecified, it is interpreted as the number of CPUs.")
parser.add_argument("--test", action='store_true', help="Run unit tests.")
parser.add_argument(
"--skip_tests", action='store_true', help="Skip all tests.")
# Training options
parser.add_argument(
"--enable_nvtx_profile", action='store_true', help="Enable NVTX profile in ORT.")
parser.add_argument(
"--enable_memory_profile", action='store_true', help="Enable memory profile in ORT.")
parser.add_argument(
"--enable_training", action='store_true', help="Enable training in ORT.")
parser.add_argument(
"--disable_nccl", action='store_true', help="Disable Nccl.")
parser.add_argument(
"--mpi_home", help="Path to MPI installation dir")
parser.add_argument(
"--nccl_home", help="Path to NCCL installation dir")
parser.add_argument(
"--use_mpi", nargs='?', default=True, const=True, type=_str_to_bool)
# enable ONNX tests
parser.add_argument(
"--enable_onnx_tests", action='store_true',
help="""When running the Test phase, run onnx_test_running against
available test data directories.""")
parser.add_argument("--path_to_protoc_exe", help="Path to protoc exe.")
parser.add_argument(
"--fuzz_testing", action='store_true', help="Enable Fuzz testing of the onnxruntime.")
parser.add_argument(
"--enable_symbolic_shape_infer_tests", action='store_true',
help="""When running the Test phase, run symbolic shape inference against
available test data directories.""")
# generate documentaiton
parser.add_argument(
"--gen_doc", action='store_true',
help="Generate documentation on contrib ops")
# CUDA related
parser.add_argument("--use_cuda", action='store_true', help="Enable CUDA.")
parser.add_argument(
"--cuda_version", help="The version of CUDA toolkit to use. "
"Auto-detect if not specified. e.g. 9.0")
parser.add_argument(
"--cuda_home", help="Path to CUDA home."
"Read from CUDA_HOME environment variable if --use_cuda is true and "
"--cuda_home is not specified.")
parser.add_argument(
"--cudnn_home", help="Path to CUDNN home. "
"Read from CUDNN_HOME environment variable if --use_cuda is true and "
"--cudnn_home is not specified.")
# Python bindings
parser.add_argument(
"--enable_pybind", action='store_true', help="Enable Python Bindings.")
parser.add_argument(
"--build_wheel", action='store_true', help="Build Python Wheel.")
parser.add_argument(
"--wheel_name_suffix", help="Suffix to append to created wheel names. "
"This value is currently only used for nightly builds.")
parser.add_argument(
"--numpy_version", help="Installs a specific version of numpy "
"before building the python binding.")
parser.add_argument(
"--skip-keras-test", action='store_true',
help="Skip tests with Keras if keras is installed")
# C-Sharp bindings
parser.add_argument(
"--build_csharp", action='store_true',
help="Build C#.Net DLL and NuGet package. This should be only used in CI pipelines. "
"For building C# bindings and packaging them into nuget package use --build_nuget arg.")
parser.add_argument(
"--build_nuget", action='store_true',
help="Build C#.Net DLL and NuGet package on the local machine. "
"Currently only Windows and Linux platforms are supported.")
# Java bindings
parser.add_argument(
"--build_java", action='store_true', help="Build Java bindings.")
# Node.js binding
parser.add_argument(
"--build_nodejs", action='store_true',
help="Build Node.js binding and NPM package.")
# Build a shared lib
parser.add_argument(
"--build_shared_lib", action='store_true',
help="Build a shared library for the ONNXRuntime.")
# Build options
parser.add_argument(
"--cmake_extra_defines", nargs="+",
help="Extra definitions to pass to CMake during build system "
"generation. These are just CMake -D options without the leading -D.")
parser.add_argument(
"--target",
help="Build a specific target, e.g. winml_dll")
parser.add_argument(
"--x86", action='store_true',
help="Create x86 makefiles. Requires --update and no existing cache "
"CMake setup. Delete CMakeCache.txt if needed")
parser.add_argument(
"--arm", action='store_true',
help="Create ARM makefiles. Requires --update and no existing cache "
"CMake setup. Delete CMakeCache.txt if needed")
parser.add_argument(
"--arm64", action='store_true',
help="Create ARM64 makefiles. Requires --update and no existing cache "
"CMake setup. Delete CMakeCache.txt if needed")
parser.add_argument(
"--msvc_toolset", help="MSVC toolset to use. e.g. 14.11")
parser.add_argument("--android", action='store_true', help='Build for Android')
parser.add_argument(
"--android_abi", default="arm64-v8a",
choices=["armeabi-v7a", "arm64-v8a", "x86", "x86_64"],
help="Specify the target Android Application Binary Interface (ABI)")
parser.add_argument("--android_api", type=int, default=27, help='Android API Level, e.g. 21')
parser.add_argument("--android_sdk_path", type=str, help='Path to the Android SDK')
parser.add_argument("--android_ndk_path", default="", help="Path to the Android NDK")
parser.add_argument("--android_cpp_shared", action="store_true",
help="Build with shared libc++ instead of the default static libc++.")
parser.add_argument("--android_run_emulator", action="store_true",
help="Start up an Android emulator if needed.")
parser.add_argument("--ios", action='store_true', help="build for ios")
parser.add_argument(
"--ios_sysroot", default="",
help="Specify the location name of the macOS platform SDK to be used")
parser.add_argument(
"--ios_toolchain_dir", default="",
help="Path to ios toolchain binaries")
parser.add_argument(
"--ios_toolchain_file", default="",
help="Path to ios toolchain file, "
"or cmake/onnxruntime_ios.toolchain.cmake will be used")
parser.add_argument(
"--xcode_code_signing_team_id", default="",
help="The development team ID used for code signing in Xcode")
parser.add_argument(
"--use_xcode", action='store_true',
help="Use Xcode as cmake generator, this is only supported on MacOS.")
parser.add_argument(
"--osx_arch",
default="arm64" if platform.machine() == "arm64" else "x86_64",
choices=["arm64", "x86_64"],
help="Specify the Target specific architectures for macOS and iOS, This is only supported on MacOS")
parser.add_argument(
"--apple_deploy_target", type=str,
help="Specify the minimum version of the target platform "
"(e.g. macOS or iOS)"
"This is only supported on MacOS")
# Arguments needed by CI
parser.add_argument(
"--cmake_path", default="cmake", help="Path to the CMake program.")
parser.add_argument(
"--ctest_path", default="ctest", help="Path to the CTest program.")
parser.add_argument(
"--skip_submodule_sync", action='store_true', help="Don't do a "
"'git submodule update'. Makes the Update phase faster.")
parser.add_argument(
"--use_vstest", action='store_true',
help="Use use_vstest for running unitests.")
parser.add_argument(
"--use_mimalloc", default=['none'],
choices=['none', 'stl', 'arena', 'all'], help="Use mimalloc.")
parser.add_argument(
"--use_dnnl", action='store_true', help="Build with DNNL.")
parser.add_argument(
"--dnnl_gpu_runtime", action='store', default='', type=str.lower,
help="e.g. --dnnl_gpu_runtime ocl")
parser.add_argument(
"--dnnl_opencl_root", action='store', default='',
help="Path to OpenCL SDK. "
"e.g. --dnnl_opencl_root \"C:/Program Files (x86)/IntelSWTools/sw_dev_tools/OpenCL/sdk\"")
parser.add_argument(
"--use_featurizers", action='store_true',
help="Build with ML Featurizer support.")
parser.add_argument(
"--use_openvino", nargs="?", const="CPU_FP32",
type=_openvino_verify_device_type,
help="Build with OpenVINO for specific hardware.")
parser.add_argument(
"--use_coreml", action='store_true', help="Build with CoreML support.")
parser.add_argument(
"--use_nnapi", action='store_true', help="Build with NNAPI support.")
parser.add_argument(
"--nnapi_min_api", type=int,
help="Minimum Android API level to enable NNAPI, should be no less than 27")
parser.add_argument(
"--use_rknpu", action='store_true', help="Build with RKNPU.")
parser.add_argument(
"--use_preinstalled_eigen", action='store_true',
help="Use pre-installed Eigen.")
parser.add_argument("--eigen_path", help="Path to pre-installed Eigen.")
parser.add_argument(
"--use_openmp", action='store_true', help="Build with OpenMP")
parser.add_argument(
"--enable_msinternal", action="store_true",
help="Enable for Microsoft internal builds only.")
parser.add_argument("--llvm_path", help="Path to llvm dir")
parser.add_argument(
"--use_vitisai", action='store_true', help="Build with Vitis-AI")
parser.add_argument(
"--use_nuphar", action='store_true', help="Build with nuphar")
parser.add_argument(
"--use_tensorrt", action='store_true', help="Build with TensorRT")
parser.add_argument(
"--tensorrt_home", help="Path to TensorRT installation dir")
parser.add_argument(
"--use_migraphx", action='store_true', help="Build with MIGraphX")
parser.add_argument(
"--migraphx_home", help="Path to MIGraphX installation dir")
parser.add_argument(
"--use_full_protobuf", action='store_true',
help="Use the full protobuf library")
parser.add_argument(
"--skip_onnx_tests", action='store_true', help="Explicitly disable "
"all onnx related tests. Note: Use --skip_tests to skip all tests.")
parser.add_argument(
"--skip_winml_tests", action='store_true',
help="Explicitly disable all WinML related tests")
parser.add_argument(
"--skip_nodejs_tests", action='store_true',
help="Explicitly disable all Node.js binding tests")
parser.add_argument(
"--enable_msvc_static_runtime", action='store_true',
help="Enable static linking of MSVC runtimes.")
parser.add_argument(
"--enable_language_interop_ops", action='store_true',
help="Enable operator implemented in language other than cpp")
parser.add_argument(
"--cmake_generator",
choices=['Visual Studio 15 2017', 'Visual Studio 16 2019', 'Ninja'],
default='Visual Studio 15 2017' if is_windows() else None,
help="Specify the generator that CMake invokes. "
"This is only supported on Windows")
parser.add_argument(
"--enable_multi_device_test", action='store_true',
help="Test with multi-device. Mostly used for multi-device GPU")
parser.add_argument(
"--use_dml", action='store_true', help="Build with DirectML.")
parser.add_argument(
"--use_winml", action='store_true', help="Build with WinML.")
parser.add_argument(
"--winml_root_namespace_override", type=str,
help="Specify the namespace that WinML builds into.")
parser.add_argument(
"--use_telemetry", action='store_true',
help="Only official builds can set this flag to enable telemetry.")
parser.add_argument(
"--enable_wcos", action='store_true',
help="Build for Windows Core OS.")
parser.add_argument(
"--enable_windows_store", action='store_true',
help="Build for Windows Store")
parser.add_argument(
"--enable_lto", action='store_true',
help="Enable Link Time Optimization")
parser.add_argument(
"--use_acl", nargs="?", const="ACL_1905",
choices=["ACL_1902", "ACL_1905", "ACL_1908", "ACL_2002"],
help="Build with ACL for ARM architectures.")
parser.add_argument(
"--acl_home", help="Path to ACL home dir")
parser.add_argument(
"--acl_libs", help="Path to ACL libraries")
parser.add_argument(
"--use_armnn", action='store_true',
help="Enable ArmNN Execution Provider.")
parser.add_argument(
"--armnn_relu", action='store_true',
help="Use the Relu operator implementation from the ArmNN EP.")
parser.add_argument(
"--armnn_bn", action='store_true',
help="Use the Batch Normalization operator implementation from the ArmNN EP.")
parser.add_argument(
"--armnn_home", help="Path to ArmNN home dir")
parser.add_argument(
"--armnn_libs", help="Path to ArmNN libraries")
parser.add_argument(
"--build_micro_benchmarks", action='store_true',
help="Build ONNXRuntime micro-benchmarks.")
# options to reduce binary size
parser.add_argument("--minimal_build", action='store',
const='on', default='off', nargs='?', type=str.lower,
help="Create a build that only supports ORT format models. "
"See /docs/ONNX_Runtime_Format_Model_Usage.md for more information. "
"RTTI is automatically disabled in a minimal build. "
"To enable execution providers that compile kernels at runtime (e.g. NNAPI) pass 'extended' "
"as a parameter. e.g. '--minimal_build extended'.")
parser.add_argument("--include_ops_by_config", type=str,
help="include ops from config file. "
"See /docs/Reduced_Operator_Kernel_build.md for more information.")
parser.add_argument("--enable_reduced_operator_type_support", action='store_true',
help='If --include_ops_by_config is specified, and the configuration file was created from ORT '
'format models with type reduction enabled, limit the types individual operators support '
'where possible to further reduce the build size. '
'See /docs/Reduced_Operator_Kernel_build.md for more information.')
parser.add_argument("--disable_contrib_ops", action='store_true',
help="Disable contrib ops (reduces binary size)")
parser.add_argument("--disable_ml_ops", action='store_true',
help="Disable traditional ML ops (reduces binary size)")
parser.add_argument("--disable_rtti", action='store_true', help="Disable RTTI (reduces binary size)")
parser.add_argument("--disable_exceptions", action='store_true',
help="Disable exceptions to reduce binary size. Requires --minimal_build.")
parser.add_argument("--disable_ort_format_load", action='store_true',
help='Disable support for loading ORT format models in a non-minimal build.')
parser.add_argument("--use_rocm", action='store_true', help="Build with ROCm")
parser.add_argument("--rocm_home", help="Path to ROCm installation dir")
# Code coverage
parser.add_argument("--code_coverage", action='store_true',
help="Generate code coverage when targetting Android (only).")
return parser.parse_args()
def resolve_executable_path(command_or_path):
"""Returns the absolute path of an executable."""
executable_path = shutil.which(command_or_path)
if executable_path is None:
raise BuildError("Failed to resolve executable path for "
"'{}'.".format(command_or_path))
return os.path.realpath(executable_path)
def get_linux_distro():
try:
with open('/etc/os-release', 'r') as f:
dist_info = dict(
line.strip().split('=', 1) for line in f.readlines())
return dist_info.get('NAME', '').strip('"'), dist_info.get(
'VERSION', '').strip('"')
except (IOError, ValueError):
return '', ''
def is_ubuntu_1604():
dist, ver = get_linux_distro()
return dist == 'Ubuntu' and ver.startswith('16.04')
def get_config_build_dir(build_dir, config):
# build directory per configuration
return os.path.join(build_dir, config)
def run_subprocess(args, cwd=None, capture_stdout=False, dll_path=None,
shell=False, env={}):
if isinstance(args, str):
raise ValueError("args should be a sequence of strings, not a string")
my_env = os.environ.copy()
if dll_path:
if is_windows():
my_env["PATH"] = dll_path + os.pathsep + my_env["PATH"]
else:
if "LD_LIBRARY_PATH" in my_env:
my_env["LD_LIBRARY_PATH"] += os.pathsep + dll_path
else:
my_env["LD_LIBRARY_PATH"] = dll_path
my_env.update(env)
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
def update_submodules(source_dir):
run_subprocess(["git", "submodule", "sync", "--recursive"], cwd=source_dir)
run_subprocess(["git", "submodule", "update", "--init", "--recursive"],
cwd=source_dir)
def is_docker():
path = '/proc/self/cgroup'
return (
os.path.exists('/.dockerenv') or
os.path.isfile(path) and any('docker' in line for line in open(path))
)
def install_python_deps(numpy_version=""):
dep_packages = ['setuptools', 'wheel', 'pytest']
dep_packages.append('numpy=={}'.format(numpy_version) if numpy_version
else 'numpy>=1.16.6')
dep_packages.append('sympy>=1.1')
dep_packages.append('packaging')
dep_packages.append('cerberus')
run_subprocess([sys.executable, '-m', 'pip', 'install', '--trusted-host',
'files.pythonhosted.org'] + dep_packages)
# We need to install Torch to test certain functionalities of the ORT Python package
def install_torch():
# Command works for both Windows
run_subprocess([sys.executable, '-m', 'pip', 'install', '--trusted-host',
'files.pythonhosted.org', 'torch===1.5.1+cu101', 'torchvision===0.6.1+cu101',
'-f', 'https://download.pytorch.org/whl/torch_stable.html'])
def check_md5(filename, expected_md5):
if not os.path.exists(filename):
return False
hash_md5 = hashlib.md5()
BLOCKSIZE = 1024 * 64
with open(filename, "rb") as f:
buf = f.read(BLOCKSIZE)
while len(buf) > 0:
hash_md5.update(buf)
buf = f.read(BLOCKSIZE)
hex = hash_md5.hexdigest()
if hex != expected_md5:
log.info('md5 mismatch, expect %s, got %s' % (expected_md5, hex))
os.remove(filename)
return False
return True
def setup_test_data(build_dir, configs):
# create a shortcut for test models if there is a 'models'
# folder in build_dir
if is_windows():
src_model_dir = os.path.join(build_dir, 'models')
if os.path.exists('C:\\local\\models') and not os.path.exists(
src_model_dir):
log.debug("creating shortcut %s -> %s" % (
'C:\\local\\models', src_model_dir))
run_subprocess(['mklink', '/D', '/J', src_model_dir,
'C:\\local\\models'], shell=True)
for config in configs:
config_build_dir = get_config_build_dir(build_dir, config)
os.makedirs(config_build_dir, exist_ok=True)
dest_model_dir = os.path.join(config_build_dir, 'models')
if os.path.exists('C:\\local\\models') and not os.path.exists(
dest_model_dir):
log.debug("creating shortcut %s -> %s" % (
'C:\\local\\models', dest_model_dir))
run_subprocess(['mklink', '/D', '/J', dest_model_dir,
'C:\\local\\models'], shell=True)
elif os.path.exists(src_model_dir) and not os.path.exists(
dest_model_dir):
log.debug("creating shortcut %s -> %s" % (
src_model_dir, dest_model_dir))
run_subprocess(['mklink', '/D', '/J', dest_model_dir,
src_model_dir], shell=True)
def use_dev_mode(args):
if args.use_acl:
return 'OFF'
if args.use_armnn:
return 'OFF'
if args.ios and is_macOS():
return 'OFF'
return 'ON'
def generate_build_tree(cmake_path, source_dir, build_dir, cuda_home, cudnn_home, rocm_home,
mpi_home, nccl_home, tensorrt_home, migraphx_home, acl_home, acl_libs, armnn_home, armnn_libs,
path_to_protoc_exe, configs, cmake_extra_defines, args, cmake_extra_args):
log.info("Generating CMake build tree")
cmake_dir = os.path.join(source_dir, "cmake")
cmake_args = [
cmake_path, cmake_dir,
"-Donnxruntime_RUN_ONNX_TESTS=" + (
"ON" if args.enable_onnx_tests else "OFF"),
"-Donnxruntime_BUILD_WINML_TESTS=" + (
"OFF" if args.skip_winml_tests else "ON"),
"-Donnxruntime_GENERATE_TEST_REPORTS=ON",
"-Donnxruntime_DEV_MODE=" + use_dev_mode(args),
"-DPYTHON_EXECUTABLE=" + sys.executable,
"-Donnxruntime_USE_CUDA=" + ("ON" if args.use_cuda else "OFF"),
"-Donnxruntime_CUDNN_HOME=" + (cudnn_home if args.use_cuda else ""),
"-Donnxruntime_USE_FEATURIZERS=" + (
"ON" if args.use_featurizers else "OFF"),
"-Donnxruntime_CUDA_HOME=" + (cuda_home if args.use_cuda else ""),
"-Donnxruntime_USE_MIMALLOC_STL_ALLOCATOR=" + (
"ON" if args.use_mimalloc == "stl" or
args.use_mimalloc == "all" else "OFF"),
"-Donnxruntime_USE_MIMALLOC_ARENA_ALLOCATOR=" + (
"ON" if args.use_mimalloc == "arena" or
args.use_mimalloc == "all" else "OFF"),
"-Donnxruntime_ENABLE_PYTHON=" + (
"ON" if args.enable_pybind else "OFF"),
"-Donnxruntime_BUILD_CSHARP=" + ("ON" if args.build_csharp else "OFF"),
"-Donnxruntime_BUILD_JAVA=" + ("ON" if args.build_java else "OFF"),
"-Donnxruntime_BUILD_NODEJS=" + ("ON" if args.build_nodejs else "OFF"),
"-Donnxruntime_BUILD_SHARED_LIB=" + (
"ON" if args.build_shared_lib else "OFF"),
"-Donnxruntime_USE_DNNL=" + ("ON" if args.use_dnnl else "OFF"),
"-Donnxruntime_DNNL_GPU_RUNTIME=" + (args.dnnl_gpu_runtime if args.use_dnnl else ""),
"-Donnxruntime_DNNL_OPENCL_ROOT=" + (args.dnnl_opencl_root if args.use_dnnl else ""),
"-Donnxruntime_USE_NNAPI_BUILTIN=" + ("ON" if args.use_nnapi else "OFF"),
"-Donnxruntime_USE_RKNPU=" + ("ON" if args.use_rknpu else "OFF"),
"-Donnxruntime_USE_OPENMP=" + (
"ON" if args.use_openmp and not (
args.use_nnapi or
args.android or (args.ios and is_macOS())
or args.use_rknpu)
else "OFF"),
"-Donnxruntime_USE_TVM=" + ("ON" if args.use_nuphar else "OFF"),
"-Donnxruntime_USE_LLVM=" + ("ON" if args.use_nuphar else "OFF"),
"-Donnxruntime_ENABLE_MICROSOFT_INTERNAL=" + (
"ON" if args.enable_msinternal else "OFF"),
"-Donnxruntime_USE_VITISAI=" + ("ON" if args.use_vitisai else "OFF"),
"-Donnxruntime_USE_NUPHAR=" + ("ON" if args.use_nuphar else "OFF"),
"-Donnxruntime_USE_TENSORRT=" + ("ON" if args.use_tensorrt else "OFF"),
"-Donnxruntime_TENSORRT_HOME=" + (
tensorrt_home if args.use_tensorrt else ""),
# set vars for migraphx
"-Donnxruntime_USE_MIGRAPHX=" + ("ON" if args.use_migraphx else "OFF"),
"-Donnxruntime_MIGRAPHX_HOME=" + (migraphx_home if args.use_migraphx else ""),
# By default - we currently support only cross compiling for
# ARM/ARM64 (no native compilation supported through this
# script).
"-Donnxruntime_CROSS_COMPILING=" + (
"ON" if args.arm64 or args.arm else "OFF"),
"-Donnxruntime_DISABLE_CONTRIB_OPS=" + ("ON" if args.disable_contrib_ops else "OFF"),
"-Donnxruntime_DISABLE_ML_OPS=" + ("ON" if args.disable_ml_ops else "OFF"),
"-Donnxruntime_DISABLE_RTTI=" + ("ON" if args.disable_rtti else "OFF"),
"-Donnxruntime_DISABLE_EXCEPTIONS=" + ("ON" if args.disable_exceptions else "OFF"),
"-Donnxruntime_DISABLE_ORT_FORMAT_LOAD=" + ("ON" if args.disable_ort_format_load else "OFF"),
"-Donnxruntime_MINIMAL_BUILD=" + ("ON" if args.minimal_build != 'off' else "OFF"),
"-Donnxruntime_EXTENDED_MINIMAL_BUILD=" + ("ON" if args.minimal_build == 'extended' else "OFF"),
"-Donnxruntime_REDUCED_OPS_BUILD=" + ("ON" if args.include_ops_by_config else "OFF"),
"-Donnxruntime_MSVC_STATIC_RUNTIME=" + (
"ON" if args.enable_msvc_static_runtime else "OFF"),
# enable pyop if it is nightly build
"-Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=" + (
"ON" if args.enable_language_interop_ops else "OFF"),
"-Donnxruntime_USE_DML=" + ("ON" if args.use_dml else "OFF"),
"-Donnxruntime_USE_WINML=" + ("ON" if args.use_winml else "OFF"),
"-Donnxruntime_USE_TELEMETRY=" + (
"ON" if args.use_telemetry else "OFF"),
"-Donnxruntime_ENABLE_LTO=" + ("ON" if args.enable_lto else "OFF"),
"-Donnxruntime_USE_ACL=" + ("ON" if args.use_acl else "OFF"),
"-Donnxruntime_USE_ACL_1902=" + (
"ON" if args.use_acl == "ACL_1902" else "OFF"),
"-Donnxruntime_USE_ACL_1905=" + (
"ON" if args.use_acl == "ACL_1905" else "OFF"),
"-Donnxruntime_USE_ACL_1908=" + (
"ON" if args.use_acl == "ACL_1908" else "OFF"),
"-Donnxruntime_USE_ACL_2002=" + (
"ON" if args.use_acl == "ACL_2002" else "OFF"),
"-Donnxruntime_USE_ARMNN=" + (
"ON" if args.use_armnn else "OFF"),
"-Donnxruntime_ARMNN_RELU_USE_CPU=" + (
"OFF" if args.armnn_relu else "ON"),
"-Donnxruntime_ARMNN_BN_USE_CPU=" + (
"OFF" if args.armnn_bn else "ON"),
# Training related flags
"-Donnxruntime_ENABLE_NVTX_PROFILE=" + (
"ON" if args.enable_nvtx_profile else "OFF"),
"-Donnxruntime_ENABLE_TRAINING=" + (
"ON" if args.enable_training else "OFF"),
# Enable advanced computations such as AVX for some traininig related ops.
"-Donnxruntime_ENABLE_CPU_FP16_OPS=" + (
"ON" if args.enable_training else "OFF"),
"-Donnxruntime_USE_NCCL=" + (
"OFF" if args.disable_nccl else "ON"),
"-Donnxruntime_BUILD_BENCHMARKS=" + (
"ON" if args.build_micro_benchmarks else "OFF"),
"-Donnxruntime_USE_ROCM=" + ("ON" if args.use_rocm else "OFF"),
"-Donnxruntime_ROCM_HOME=" + (rocm_home if args.use_rocm else ""),
"-DOnnxruntime_GCOV_COVERAGE=" + ("ON" if args.code_coverage else "OFF"),
"-Donnxruntime_USE_MPI=" + (
"ON" if args.use_mpi else "OFF"),
"-Donnxruntime_ENABLE_MEMORY_PROFILE=" + (
"ON" if args.enable_memory_profile else "OFF"),
]
if acl_home and os.path.exists(acl_home):
cmake_args += ["-Donnxruntime_ACL_HOME=" + acl_home]
if acl_libs and os.path.exists(acl_libs):
cmake_args += ["-Donnxruntime_ACL_LIBS=" + acl_libs]
if armnn_home and os.path.exists(armnn_home):
cmake_args += ["-Donnxruntime_ARMNN_HOME=" + armnn_home]
if armnn_libs and os.path.exists(armnn_libs):
cmake_args += ["-Donnxruntime_ARMNN_LIBS=" + armnn_libs]
if mpi_home and os.path.exists(mpi_home):
if args.use_mpi:
cmake_args += ["-Donnxruntime_MPI_HOME=" + mpi_home]
else:
log.warning("mpi_home is supplied but use_mpi is set to false."
" Build will continue without linking MPI libraries.")
if nccl_home and os.path.exists(nccl_home):
cmake_args += ["-Donnxruntime_NCCL_HOME=" + nccl_home]
if args.winml_root_namespace_override:
cmake_args += ["-Donnxruntime_WINML_NAMESPACE_OVERRIDE=" +
args.winml_root_namespace_override]
if args.use_openvino:
cmake_args += ["-Donnxruntime_USE_OPENVINO=ON",
"-Donnxruntime_USE_OPENVINO_MYRIAD=" + (
"ON" if args.use_openvino == "MYRIAD_FP16" else "OFF"),
"-Donnxruntime_USE_OPENVINO_GPU_FP32=" + (
"ON" if args.use_openvino == "GPU_FP32" else "OFF"),
"-Donnxruntime_USE_OPENVINO_GPU_FP16=" + (
"ON" if args.use_openvino == "GPU_FP16" else "OFF"),
"-Donnxruntime_USE_OPENVINO_CPU_FP32=" + (
"ON" if args.use_openvino == "CPU_FP32" else "OFF"),
"-Donnxruntime_USE_OPENVINO_VAD_M=" + (
"ON" if args.use_openvino == "VAD-M_FP16" else "OFF"),
"-Donnxruntime_USE_OPENVINO_VAD_F=" + (
"ON" if args.use_openvino == "VAD-F_FP32" else "OFF"),
"-Donnxruntime_USE_OPENVINO_HETERO=" + (
"ON" if args.use_openvino.startswith("HETERO") else "OFF"),
"-Donnxruntime_USE_OPENVINO_DEVICE=" + (args.use_openvino),
"-Donnxruntime_USE_OPENVINO_MULTI=" + (
"ON" if args.use_openvino.startswith("MULTI") else "OFF")]
# TensorRT and OpenVINO providers currently only supports
# full_protobuf option.
if (args.use_full_protobuf or args.use_tensorrt or
args.use_openvino or args.use_vitisai or args.gen_doc):
cmake_args += [
"-Donnxruntime_USE_FULL_PROTOBUF=ON",
"-DProtobuf_USE_STATIC_LIBS=ON"
]
if args.use_nuphar and args.llvm_path is not None:
cmake_args += ["-DLLVM_DIR=%s" % args.llvm_path]
if args.use_cuda and not is_windows():
nvml_stub_path = cuda_home + "/lib64/stubs"
cmake_args += ["-DCUDA_CUDA_LIBRARY=" + nvml_stub_path]
if args.use_preinstalled_eigen:
cmake_args += ["-Donnxruntime_USE_PREINSTALLED_EIGEN=ON",
"-Deigen_SOURCE_PATH=" + args.eigen_path]
if args.nnapi_min_api:
cmake_args += ["-Donnxruntime_NNAPI_MIN_API=" + str(args.nnapi_min_api)]
if args.android:
cmake_args += [
"-DCMAKE_TOOLCHAIN_FILE=" + args.android_ndk_path + "/build/cmake/android.toolchain.cmake",
"-DANDROID_PLATFORM=android-" + str(args.android_api),
"-DANDROID_ABI=" + str(args.android_abi)
]
if args.android_cpp_shared:
cmake_args += ["-DANDROID_STL=c++_shared"]
if is_macOS() and not args.android:
cmake_args += ["-DCMAKE_OSX_ARCHITECTURES=" + args.osx_arch]
# since cmake 3.19, it uses the xcode latest buildsystem, which is not supported by this project.
cmake_verstr = subprocess.check_output(['cmake', '--version']).decode('utf-8').split()[2]
if args.use_xcode and LooseVersion(cmake_verstr) >= LooseVersion('3.19.0'):
cmake_args += ["-T", "buildsystem=1"]
if args.apple_deploy_target:
cmake_args += ["-DCMAKE_OSX_DEPLOYMENT_TARGET=" + args.apple_deploy_target]
if args.use_coreml:
if not is_macOS():
raise BuildError("Build CoreML EP requires macOS")
cmake_args += ["-Donnxruntime_USE_COREML=ON"]
if args.ios:
if is_macOS():
needed_args = [
args.use_xcode,
args.ios_sysroot,
args.apple_deploy_target,
]
arg_names = [
"--use_xcode " +
"<need use xcode to cross build iOS on MacOS>",
"--ios_sysroot " +
"<the location or name of the macOS platform SDK>",
"--apple_deploy_target " +
"<the minimum version of the target platform>",
]
if not all(needed_args):
raise BuildError(
"iOS build on MacOS canceled due to missing arguments: " +
', '.join(
val for val, cond in zip(arg_names, needed_args)
if not cond))
cmake_args += [
"-DCMAKE_SYSTEM_NAME=iOS",
"-Donnxruntime_BUILD_SHARED_LIB=ON",
"-DCMAKE_OSX_SYSROOT=" + args.ios_sysroot,
"-DCMAKE_OSX_DEPLOYMENT_TARGET=" + args.apple_deploy_target,
# we do not need protoc binary for ios cross build
"-Dprotobuf_BUILD_PROTOC_BINARIES=OFF",
"-DCMAKE_TOOLCHAIN_FILE=" + (
args.ios_toolchain_file if args.ios_toolchain_file
else "../cmake/onnxruntime_ios.toolchain.cmake")
]
# Code sign the binaries, if the code signing development team id is provided
if args.xcode_code_signing_team_id:
cmake_args += ["-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=" + args.xcode_code_signing_team_id]
else:
# TODO: the cross compiling on Linux is not officially supported by Apple
# and is already broken with the latest codebase, so it should be removed.
# We are cross compiling on Linux
needed_args = [
args.ios_sysroot,
args.arm64 or args.arm,
args.ios_toolchain_dir
]
arg_names = [
"--ios_sysroot <path to sysroot>",
"--arm or --arm64",
"--ios_toolchain_dir <path to toolchain>"
]
if not all(needed_args):
raise BuildError(
"iOS build canceled due to missing arguments: " +
', '.join(
val for val, cond in zip(arg_names, needed_args)
if not cond))
compilers = sorted(
glob.glob(args.ios_toolchain_dir + "/bin/*-clang*"))
os.environ["PATH"] = os.path.join(
args.ios_toolchain_dir, "bin") + os.pathsep + os.environ.get(
"PATH", "")
os.environ["LD_LIBRARY_PATH"] = os.path.join(
args.ios_toolchain_dir, "/lib") + os.pathsep + os.environ.get(
"LD_LIBRARY_PATH", "")
if len(compilers) != 2:
raise BuildError(
"error identifying compilers in ios_toolchain_dir")
cmake_args += [
"-DCMAKE_OSX_ARCHITECTURES=" +
("arm64" if args.arm64 else "arm"),
"-DCMAKE_SYSTEM_NAME=iOSCross",
"-Donnxruntime_BUILD_UNIT_TESTS=OFF",
"-DCMAKE_OSX_SYSROOT=" + args.ios_sysroot,
"-DCMAKE_C_COMPILER=" + compilers[0],
"-DCMAKE_CXX_COMPILER=" + compilers[1]
]
if path_to_protoc_exe:
cmake_args += [
"-DONNX_CUSTOM_PROTOC_EXECUTABLE=%s" % path_to_protoc_exe]
if args.fuzz_testing:
if not (args.build_shared_lib and
is_windows() and
args.cmake_generator == 'Visual Studio 16 2019' and
args.use_full_protobuf):
raise BuildError(
"Fuzz test has only be tested with build shared libs option using MSVC on windows")
cmake_args += [
"-Donnxruntime_BUILD_UNIT_TESTS=ON",
"-Donnxruntime_FUZZ_TEST=ON",
"-Donnxruntime_USE_FULL_PROTOBUF=ON"]
if args.gen_doc:
cmake_args += ["-Donnxruntime_PYBIND_EXPORT_OPSCHEMA=ON"]
else:
cmake_args += ["-Donnxruntime_PYBIND_EXPORT_OPSCHEMA=OFF"]
cmake_args += ["-D{}".format(define) for define in cmake_extra_defines]
cmake_args += cmake_extra_args
# ADO pipelines will store the pipeline build number
# (e.g. 191101-2300.1.master) and source version in environment
# variables. If present, use these values to define the
# WinML/ORT DLL versions.
build_number = os.getenv('Build_BuildNumber')
source_version = os.getenv('Build_SourceVersion')
if build_number and source_version:
build_matches = re.fullmatch(
r"(\d\d)(\d\d)(\d\d)(\d\d)\.(\d+)", build_number)
if build_matches:
YY = build_matches.group(2)
MM = build_matches.group(3)
DD = build_matches.group(4)
# Get ORT major and minor number
with open(os.path.join(source_dir, 'VERSION_NUMBER')) as f:
first_line = f.readline()
ort_version_matches = re.match(r"(\d+).(\d+)", first_line)
if not ort_version_matches:
raise BuildError("Couldn't read version from VERSION_FILE")
ort_major = ort_version_matches.group(1)
ort_minor = ort_version_matches.group(2)
# Example (BuildNumber: 191101-2300.1.master,
# SourceVersion: 0bce7ae6755c792eda558e5d27ded701707dc404)
# MajorPart = 1
# MinorPart = 0
# BuildPart = 1911
# PrivatePart = 123
# String = 191101-2300.1.master.0bce7ae
cmake_args += [
"-DVERSION_MAJOR_PART={}".format(ort_major),
"-DVERSION_MINOR_PART={}".format(ort_minor),
"-DVERSION_BUILD_PART={}".format(YY),
"-DVERSION_PRIVATE_PART={}{}".format(MM, DD),
"-DVERSION_STRING={}.{}.{}.{}".format(
ort_major, ort_minor, build_number,
source_version[0:7])
]
for config in configs:
config_build_dir = get_config_build_dir(build_dir, config)
os.makedirs(config_build_dir, exist_ok=True)
if args.use_nuphar:
os.environ["PATH"] = os.path.join(
config_build_dir, "external", "tvm",
config) + os.pathsep + os.path.dirname(sys.executable) + os.pathsep + os.environ["PATH"]
run_subprocess(
cmake_args + [
"-Donnxruntime_ENABLE_MEMLEAK_CHECKER=" +
("ON" if config.lower() == 'debug' and not args.use_nuphar and not
args.use_openvino and not
args.enable_msvc_static_runtime
else "OFF"), "-DCMAKE_BUILD_TYPE={}".format(config)],
cwd=config_build_dir)
def clean_targets(cmake_path, build_dir, configs):
for config in configs:
log.info("Cleaning targets for %s configuration", config)
build_dir2 = get_config_build_dir(build_dir, config)
cmd_args = [cmake_path,
"--build", build_dir2,
"--config", config,
"--target", "clean"]
run_subprocess(cmd_args)
def build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, target=None):
for config in configs:
log.info("Building targets for %s configuration", config)
build_dir2 = get_config_build_dir(build_dir, config)
cmd_args = [cmake_path,
"--build", build_dir2,
"--config", config]
if target:
cmd_args.extend(['--target', target])
build_tool_args = []
if num_parallel_jobs != 1:
if is_windows() and args.cmake_generator != 'Ninja':
build_tool_args += [
"/maxcpucount:{}".format(num_parallel_jobs),
# if nodeReuse is true, msbuild processes will stay around for a bit after the build completes
"/nodeReuse:False",
]
elif (is_macOS() and args.use_xcode):
# CMake will generate correct build tool args for Xcode
cmd_args += ["--parallel", str(num_parallel_jobs)]
else:
build_tool_args += ["-j{}".format(num_parallel_jobs)]
if build_tool_args:
cmd_args += ["--"]
cmd_args += build_tool_args
env = {}
if args.android:
env['ANDROID_SDK_ROOT'] = args.android_sdk_path
run_subprocess(cmd_args, env=env)
def add_dir_if_exists(directory, dir_list):
if os.path.isdir(directory):
dir_list.append(directory)
def setup_cuda_vars(args):
cuda_home = ""
cudnn_home = ""
if args.use_cuda:
cuda_home = args.cuda_home if args.cuda_home else os.getenv(
"CUDA_HOME")
cudnn_home = args.cudnn_home if args.cudnn_home else os.getenv(
"CUDNN_HOME")
cuda_home_valid = (cuda_home is not None and os.path.exists(cuda_home))
cudnn_home_valid = (cudnn_home is not None and os.path.exists(
cudnn_home))
if not cuda_home_valid or not cudnn_home_valid:
raise BuildError(
"cuda_home and cudnn_home paths must be specified and valid.",
"cuda_home='{}' valid={}. cudnn_home='{}' valid={}"
.format(
cuda_home, cuda_home_valid, cudnn_home, cudnn_home_valid))
return cuda_home, cudnn_home
def setup_tensorrt_vars(args):
tensorrt_home = ""
if args.use_tensorrt:
tensorrt_home = (args.tensorrt_home if args.tensorrt_home
else os.getenv("TENSORRT_HOME"))
tensorrt_home_valid = (tensorrt_home is not None and
os.path.exists(tensorrt_home))
if not tensorrt_home_valid:
raise BuildError(
"tensorrt_home paths must be specified and valid.",
"tensorrt_home='{}' valid={}."
.format(tensorrt_home, tensorrt_home_valid))
# Set maximum workspace size in byte for
# TensorRT (1GB = 1073741824 bytes).
os.environ["ORT_TENSORRT_MAX_WORKSPACE_SIZE"] = "1073741824"
# Set maximum number of iterations to detect unsupported nodes
# and partition the models for TensorRT.
os.environ["ORT_TENSORRT_MAX_PARTITION_ITERATIONS"] = "1000"
# Set minimum subgraph node size in graph partitioning
# for TensorRT.
os.environ["ORT_TENSORRT_MIN_SUBGRAPH_SIZE"] = "1"
# Set FP16 flag
os.environ["ORT_TENSORRT_FP16_ENABLE"] = "0"
return tensorrt_home
def setup_migraphx_vars(args):
migraphx_home = None
if (args.use_migraphx):
print("migraphx_home = {}".format(args.migraphx_home))
migraphx_home = args.migraphx_home or os.getenv("MIGRAPHX_HOME") or None
migraphx_home_not_valid = (migraphx_home and not os.path.exists(migraphx_home))
if (migraphx_home_not_valid):
raise BuildError("migraphx_home paths must be specified and valid.",
"migraphx_home='{}' valid={}."
.format(migraphx_home, migraphx_home_not_valid))
return migraphx_home or ''
def setup_dml_build(args, cmake_path, build_dir, configs):
if args.use_dml:
for config in configs:
# Run the RESTORE_PACKAGES target to perform the initial
# NuGet setup.
cmd_args = [cmake_path,
"--build", get_config_build_dir(build_dir, config),
"--config", config,
"--target", "RESTORE_PACKAGES"]
run_subprocess(cmd_args)
def setup_rocm_build(args, configs):
rocm_home = None
if (args.use_rocm):
print("rocm_home = {}".format(args.rocm_home))
rocm_home = args.rocm_home or None
rocm_home_not_valid = (rocm_home and not os.path.exists(rocm_home))
if (rocm_home_not_valid):
raise BuildError("rocm_home paths must be specified and valid.",
"rocm_home='{}' valid={}."
.format(rocm_home, rocm_home_not_valid))
for config in configs:
amd_hipify(get_config_build_dir(args.build_dir, config))
return rocm_home or ''
def run_android_tests(args, source_dir, config, cwd):
sdk_tool_paths = android.get_sdk_tool_paths(args.android_sdk_path)
device_dir = '/data/local/tmp'
def adb_push(src, dest, **kwargs):
return run_subprocess([sdk_tool_paths.adb, 'push', src, dest], **kwargs)
def adb_shell(*args, **kwargs):
return run_subprocess([sdk_tool_paths.adb, 'shell', *args], **kwargs)
def run_adb_shell(cmd):
# GCOV_PREFIX_STRIP specifies the depth of the directory hierarchy to strip and
# GCOV_PREFIX specifies the root directory
# for creating the runtime code coverage files.
if args.code_coverage:
adb_shell(
'cd {0} && GCOV_PREFIX={0} GCOV_PREFIX_STRIP={1} {2}'.format(
device_dir, cwd.count(os.sep) + 1, cmd))
else:
adb_shell('cd {} && {}'.format(device_dir, cmd))
if args.android_abi == 'x86_64':
with contextlib.ExitStack() as context_stack:
if args.android_run_emulator:
avd_name = "ort_android"
system_image = "system-images;android-{};google_apis;{}".format(
args.android_api, args.android_abi)
android.create_virtual_device(sdk_tool_paths, system_image, avd_name)
emulator_proc = context_stack.enter_context(
android.start_emulator(
sdk_tool_paths=sdk_tool_paths,
avd_name=avd_name,
extra_args=[
"-partition-size", "2047",
"-wipe-data"]))
context_stack.callback(android.stop_emulator, emulator_proc)
adb_push('testdata', device_dir, cwd=cwd)
adb_push(
os.path.join(source_dir, 'cmake', 'external', 'onnx', 'onnx', 'backend', 'test'),
device_dir, cwd=cwd)
adb_push('onnxruntime_test_all', device_dir, cwd=cwd)
adb_shell('chmod +x {}/onnxruntime_test_all'.format(device_dir))
adb_push('onnx_test_runner', device_dir, cwd=cwd)
adb_shell('chmod +x {}/onnx_test_runner'.format(device_dir))
run_adb_shell('{0}/onnxruntime_test_all'.format(device_dir))
if args.use_nnapi:
adb_shell('cd {0} && {0}/onnx_test_runner -e nnapi {0}/test'.format(device_dir))
else:
adb_shell('cd {0} && {0}/onnx_test_runner {0}/test'.format(device_dir))
# run shared_lib_test if necessary
if args.build_shared_lib:
adb_push('libonnxruntime.so', device_dir, cwd=cwd)
adb_push('onnxruntime_shared_lib_test', device_dir, cwd=cwd)
adb_shell('chmod +x {}/onnxruntime_shared_lib_test'.format(device_dir))
run_adb_shell(
'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{0} && {0}/onnxruntime_shared_lib_test'.format(
device_dir))
def run_ios_tests(args, source_dir, config, cwd):
cpr = run_subprocess(["xcodebuild", "test", "-project", "./onnxruntime.xcodeproj",
"-configuration", config,
"-scheme", "onnxruntime_test_all_xc", "-destination",
"platform=iOS Simulator,OS=latest,name=iPhone SE (2nd generation)"], cwd=cwd)
if cpr.returncode == 0:
cpr = run_subprocess(["xcodebuild", "test", "-project", "./onnxruntime.xcodeproj",
"-configuration", config,
"-scheme", "onnxruntime_shared_lib_test_xc", "-destination",
"platform=iOS Simulator,OS=latest,name=iPhone SE (2nd generation)"], cwd=cwd)
cpr.check_returncode()
def run_orttraining_test_orttrainer_frontend_separately(cwd):
class TestNameCollecterPlugin:
def __init__(self):
self.collected = set()
def pytest_collection_modifyitems(self, items):
for item in items:
print('item.name: ', item.name)
test_name = item.name
start = test_name.find('[')
if start > 0:
test_name = test_name[:start]
self.collected.add(test_name)
import pytest
plugin = TestNameCollecterPlugin()
test_script_filename = os.path.join(cwd, "orttraining_test_orttrainer_frontend.py")
pytest.main(['--collect-only', test_script_filename], plugins=[plugin])
for test_name in plugin.collected:
run_subprocess([
sys.executable, '-m', 'pytest',
'orttraining_test_orttrainer_frontend.py', '-v', '-k', test_name], cwd=cwd)
def run_training_python_frontend_tests(cwd):
run_subprocess([sys.executable, 'onnxruntime_test_ort_trainer.py'], cwd=cwd)
run_subprocess([sys.executable, 'onnxruntime_test_training_unit_tests.py'], cwd=cwd)
run_subprocess([
sys.executable, 'orttraining_test_transformers.py',
'BertModelTest.test_for_pretraining_full_precision_list_input'], cwd=cwd)
run_subprocess([
sys.executable, 'orttraining_test_transformers.py',
'BertModelTest.test_for_pretraining_full_precision_dict_input'], cwd=cwd)
run_subprocess([
sys.executable, 'orttraining_test_transformers.py',
'BertModelTest.test_for_pretraining_full_precision_list_and_dict_input'], cwd=cwd)
# TODO: use run_orttraining_test_orttrainer_frontend_separately to work around a sporadic segfault.
# shall revert to run_subprocess call once the segfault issue is resolved.
run_orttraining_test_orttrainer_frontend_separately(cwd)
# run_subprocess([sys.executable, '-m', 'pytest', '-sv', 'orttraining_test_orttrainer_frontend.py'], cwd=cwd)
run_subprocess([sys.executable, '-m', 'pytest', '-sv', 'orttraining_test_orttrainer_bert_toy_onnx.py'], cwd=cwd)
run_subprocess([sys.executable, '-m', 'pytest', '-sv', 'orttraining_test_checkpoint_storage.py'], cwd=cwd)
run_subprocess([
sys.executable, '-m', 'pytest', '-sv', 'orttraining_test_orttrainer_checkpoint_functions.py'], cwd=cwd)
def run_training_python_frontend_e2e_tests(cwd):
# frontend tests are to be added here:
log.info("Running python frontend e2e tests.")
run_subprocess(
[sys.executable, 'orttraining_run_frontend_batch_size_test.py', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
import torch
ngpus = torch.cuda.device_count()
if ngpus > 1:
bert_pretrain_script = 'orttraining_run_bert_pretrain.py'
# TODO: this test will be replaced with convergence test ported from backend
log.debug('RUN: mpirun -n {} ''-x' 'NCCL_DEBUG=INFO'' {} {} {}'.format(
ngpus, sys.executable, bert_pretrain_script, 'ORTBertPretrainTest.test_pretrain_convergence'))
run_subprocess([
'mpirun', '-n', str(ngpus), '-x', 'NCCL_DEBUG=INFO', sys.executable,
bert_pretrain_script, 'ORTBertPretrainTest.test_pretrain_convergence'], cwd=cwd)
log.debug('RUN: mpirun -n {} {} orttraining_run_glue.py'.format(ngpus, sys.executable))
run_subprocess([
'mpirun', '-n', str(ngpus), '-x', 'NCCL_DEBUG=INFO', sys.executable, 'orttraining_run_glue.py'], cwd=cwd)
# with orttraining_run_glue.py.
# 1. we like to force to use single GPU (with CUDA_VISIBLE_DEVICES)
# for fine-tune tests.
# 2. need to run test separately (not to mix between fp16
# and full precision runs. this need to be investigated).
run_subprocess(
[sys.executable, 'orttraining_run_glue.py', 'ORTGlueTest.test_bert_with_mrpc', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
run_subprocess(
[sys.executable, 'orttraining_run_glue.py', 'ORTGlueTest.test_bert_fp16_with_mrpc', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
run_subprocess(
[sys.executable, 'orttraining_run_glue.py', 'ORTGlueTest.test_roberta_with_mrpc', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
run_subprocess(
[sys.executable, 'orttraining_run_glue.py', 'ORTGlueTest.test_roberta_fp16_with_mrpc', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
run_subprocess(
[sys.executable, 'orttraining_run_multiple_choice.py', 'ORTMultipleChoiceTest.test_bert_fp16_with_swag', '-v'],
cwd=cwd, env={'CUDA_VISIBLE_DEVICES': '0'})
run_subprocess([sys.executable, 'onnxruntime_test_ort_trainer_with_mixed_precision.py'], cwd=cwd)
run_subprocess([
sys.executable, 'orttraining_test_transformers.py',
'BertModelTest.test_for_pretraining_mixed_precision'], cwd=cwd)
def run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs):
for config in configs:
log.info("Running tests for %s configuration", config)
cwd = get_config_build_dir(build_dir, config)
cwd = os.path.abspath(cwd)
if args.android:
run_android_tests(args, source_dir, config, cwd)
continue
elif args.ios:
run_ios_tests(args, source_dir, config, cwd)
continue
dll_path_list = []
if args.use_nuphar:
dll_path_list.append(os.path.join(
build_dir, config, "external", "tvm", config))
if args.use_tensorrt:
dll_path_list.append(os.path.join(args.tensorrt_home, 'lib'))
dll_path = None
if len(dll_path_list) > 0:
dll_path = os.pathsep.join(dll_path_list)
if ctest_path is None:
# Get the "Google Test Adapter" for vstest.
if not os.path.exists(os.path.join(cwd,
'googletestadapter.0.17.1')):
run_subprocess(
['nuget.exe', 'restore',
os.path.join(source_dir, 'packages.config'),
'-ConfigFile', os.path.join(source_dir, 'NuGet.config'),
'-PackagesDirectory', cwd])
cwd2 = os.path.join(cwd, config)
executables = ['onnxruntime_test_all.exe']
if args.build_shared_lib:
executables.append('onnxruntime_shared_lib_test.exe')
executables.append('onnxruntime_global_thread_pools_test.exe')
run_subprocess(
['vstest.console.exe', '--parallel',
'--TestAdapterPath:..\\googletestadapter.0.17.1\\build\\_common', # noqa
'/Logger:trx', '/Enablecodecoverage', '/Platform:x64',
"/Settings:%s" % os.path.join(
source_dir, 'cmake\\codeconv.runsettings')] + executables,
cwd=cwd2, dll_path=dll_path)
else:
ctest_cmd = [ctest_path, "--build-config", config, "--verbose", "--timeout", "3600"]
run_subprocess(ctest_cmd, cwd=cwd, dll_path=dll_path)
if args.enable_pybind:
# Disable python tests for TensorRT because many tests are
# not supported yet.
if args.use_tensorrt:
return
# Disable python tests in a reduced build as we don't know which ops have been included and which
# models can run
if args.include_ops_by_config or args.minimal_build != 'off':
return
if is_windows():
cwd = os.path.join(cwd, config)
run_subprocess([sys.executable, 'onnxruntime_test_python.py'], cwd=cwd, dll_path=dll_path)
if args.enable_symbolic_shape_infer_tests:
run_subprocess([sys.executable, 'onnxruntime_test_python_symbolic_shape_infer.py'],
cwd=cwd, dll_path=dll_path)
# For CUDA enabled builds test IOBinding feature
if args.use_cuda:
# We need to have Torch installed to test the IOBinding feature
# which currently uses Torch's allocator to allocate GPU memory for testing
log.info("Testing IOBinding feature")
run_subprocess([sys.executable, 'onnxruntime_test_python_iobinding.py'], cwd=cwd, dll_path=dll_path)
if not args.disable_ml_ops:
run_subprocess([sys.executable, 'onnxruntime_test_python_mlops.py'], cwd=cwd, dll_path=dll_path)
if args.enable_training and args.use_cuda:
# run basic frontend tests
run_training_python_frontend_tests(cwd=cwd)
try:
import onnx # noqa
onnx_test = True
except ImportError as error:
log.exception(error)
log.warning("onnx is not installed. The ONNX tests will be skipped.")
onnx_test = False
if onnx_test:
run_subprocess([sys.executable, 'onnxruntime_test_python_backend.py'], cwd=cwd, dll_path=dll_path)
run_subprocess([sys.executable, '-m', 'unittest', 'discover', '-s', 'quantization'],
cwd=cwd, dll_path=dll_path)
if not args.disable_ml_ops:
run_subprocess([sys.executable, 'onnxruntime_test_python_backend_mlops.py'],
cwd=cwd, dll_path=dll_path)
run_subprocess([sys.executable,
os.path.join(source_dir, 'onnxruntime', 'test', 'onnx', 'gen_test_models.py'),
'--output_dir', 'test_models'], cwd=cwd)
if not args.skip_onnx_tests:
run_subprocess([os.path.join(cwd, 'onnx_test_runner'), 'test_models'], cwd=cwd)
if config != 'Debug':
run_subprocess([sys.executable, 'onnx_backend_test_series.py'], cwd=cwd, dll_path=dll_path)
if not args.skip_keras_test:
try:
import onnxmltools # noqa
import keras # noqa
onnxml_test = True
except ImportError:
log.warning(
"onnxmltools and keras are not installed. "
"The keras tests will be skipped.")
onnxml_test = False
if onnxml_test:
run_subprocess(
[sys.executable, 'onnxruntime_test_python_keras.py'],
cwd=cwd, dll_path=dll_path)
def nuphar_run_python_tests(build_dir, configs):
"""nuphar temporary function for running python tests separately
as it requires ONNX 1.5.0
"""
for config in configs:
if config == 'Debug':
continue
cwd = get_config_build_dir(build_dir, config)
if is_windows():
cwd = os.path.join(cwd, config)
dll_path = os.path.join(build_dir, config, "external", "tvm", config)
# install onnx for shape inference in testing Nuphar scripts
# this needs to happen after onnx_test_data preparation which
# uses onnx 1.3.0
run_subprocess(
[sys.executable, '-m', 'pip', 'install', '--user', 'onnx==1.5.0'])
run_subprocess(
[sys.executable, 'onnxruntime_test_python_nuphar.py'],
cwd=cwd, dll_path=dll_path)
def run_nodejs_tests(nodejs_binding_dir):
args = ['npm', 'test', '--', '--timeout=2000']
if is_windows():
args = ['cmd', '/c'] + args
run_subprocess(args, cwd=nodejs_binding_dir)
def build_python_wheel(
source_dir, build_dir, configs, use_cuda, use_dnnl,
use_tensorrt, use_openvino, use_nuphar, use_vitisai, use_acl, use_armnn, use_dml,
wheel_name_suffix, enable_training, nightly_build=False, featurizers_build=False, use_ninja=False):
for config in configs:
cwd = get_config_build_dir(build_dir, config)
if is_windows() and not use_ninja:
cwd = os.path.join(cwd, config)
args = [sys.executable, os.path.join(source_dir, 'setup.py'),
'bdist_wheel']
# We explicitly override the platform tag in the name of the generated build wheel
# so that we can install the wheel on Mac OS X versions 10.12+.
# Without this explicit override, we will something like this while building on MacOS 10.14 -
# [WARNING] MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.12)
# than the version on which the Python interpreter was compiled (10.14) and will be ignored.
# Since we need to support 10.12+, we explicitly override the platform tag.
# See PR #3626 for more details
if is_macOS():
args += ['-p', 'macosx_10_12_x86_64']
# Any combination of the following arguments can be applied
if nightly_build:
args.append('--nightly_build')
if featurizers_build:
args.append("--use_featurizers")
if wheel_name_suffix:
args.append('--wheel_name_suffix={}'.format(wheel_name_suffix))
if enable_training:
args.append("--enable_training")
# The following arguments are mutually exclusive
if use_tensorrt:
args.append('--use_tensorrt')
elif use_cuda:
args.append('--use_cuda')
elif use_openvino:
args.append('--use_openvino')
elif use_dnnl:
args.append('--use_dnnl')
elif use_nuphar:
args.append('--use_nuphar')
elif use_vitisai:
args.append('--use_vitisai')
elif use_acl:
args.append('--use_acl')
elif use_armnn:
args.append('--use_armnn')
elif use_dml:
args.append('--use_dml')
run_subprocess(args, cwd=cwd)
def derive_linux_build_property():
if is_windows():
return "/p:IsLinuxBuild=\"false\""
else:
return "/p:IsLinuxBuild=\"true\""
def build_nuget_package(source_dir, build_dir, configs, use_cuda, use_openvino, use_tensorrt, use_dnnl):
if not (is_windows() or is_linux()):
raise BuildError(
'Currently csharp builds and nuget package creation is only supportted '
'on Windows and Linux platforms.')
csharp_build_dir = os.path.join(source_dir, 'csharp')
is_linux_build = derive_linux_build_property()
# derive package name and execution provider based on the build args
execution_provider = "/p:ExecutionProvider=\"None\""
package_name = "/p:OrtPackageId=\"Microsoft.ML.OnnxRuntime\""
if use_openvino:
execution_provider = "/p:ExecutionProvider=\"openvino\""
package_name = "/p:OrtPackageId=\"Microsoft.ML.OnnxRuntime.OpenVino\""
elif use_tensorrt:
execution_provider = "/p:ExecutionProvider=\"tensorrt\""
package_name = "/p:OrtPackageId=\"Microsoft.ML.OnnxRuntime.TensorRT\""
elif use_dnnl:
execution_provider = "/p:ExecutionProvider=\"dnnl\""
package_name = "/p:OrtPackageId=\"Microsoft.ML.OnnxRuntime.DNNL\""
elif use_cuda:
package_name = "/p:OrtPackageId=\"Microsoft.ML.OnnxRuntime.Gpu\""
else:
pass
# set build directory based on build_dir arg
native_dir = os.path.normpath(os.path.join(source_dir, build_dir))
ort_build_dir = "/p:OnnxRuntimeBuildDirectory=\"" + native_dir + "\""
# dotnet restore
cmd_args = ["dotnet", "restore", "OnnxRuntime.CSharp.sln", "--configfile", "Nuget.CSharp.config"]
run_subprocess(cmd_args, cwd=csharp_build_dir)
# build csharp bindings and create nuget package for each config
for config in configs:
if is_linux():
native_build_dir = os.path.join(native_dir, config)
cmd_args = ["make", "install", "DESTDIR=.//nuget-staging"]
run_subprocess(cmd_args, cwd=native_build_dir)
configuration = "/p:Configuration=\"" + config + "\""
cmd_args = ["dotnet", "msbuild", "OnnxRuntime.CSharp.sln", configuration, package_name, is_linux_build,
ort_build_dir]
run_subprocess(cmd_args, cwd=csharp_build_dir)
cmd_args = [
"dotnet", "msbuild", "OnnxRuntime.CSharp.proj", "/t:CreatePackage",
package_name, configuration, execution_provider, is_linux_build, ort_build_dir]
run_subprocess(cmd_args, cwd=csharp_build_dir)
def run_csharp_tests(source_dir, build_dir, use_cuda, use_openvino, use_tensorrt, use_dnnl):
# Currently only running tests on windows.
if not is_windows():
return
csharp_source_dir = os.path.join(source_dir, 'csharp')
is_linux_build = derive_linux_build_property()
# define macros based on build args
macros = ""
if use_openvino:
macros += "USE_OPENVINO;"
if use_tensorrt:
macros += "USE_TENSORRT;"
if use_dnnl:
macros += "USE_DNNL;"
if use_cuda:
macros += "USE_CUDA;"
define_constants = ""
if macros != "":
define_constants = "/p:DefineConstants=\"" + macros + "\""
# set build directory based on build_dir arg
native_build_dir = os.path.normpath(os.path.join(source_dir, build_dir))
ort_build_dir = "/p:OnnxRuntimeBuildDirectory=\"" + native_build_dir + "\""
# Skip pretrained models test. Only run unit tests as part of the build
# add "--verbosity", "detailed" to this command if required
cmd_args = ["dotnet", "test", "test\\Microsoft.ML.OnnxRuntime.Tests\\Microsoft.ML.OnnxRuntime.Tests.csproj",
"--filter", "FullyQualifiedName!=Microsoft.ML.OnnxRuntime.Tests.InferenceTest.TestPreTrainedModels",
is_linux_build, define_constants, ort_build_dir]
run_subprocess(cmd_args, cwd=csharp_source_dir)
def is_cross_compiling_on_apple(args):
if not is_macOS():
return False
if args.ios:
return True
if args.osx_arch != platform.machine():
return True
return False
def build_protoc_for_host(cmake_path, source_dir, build_dir, args):
if (args.arm or args.arm64 or args.enable_windows_store) and \
not (is_windows() or is_cross_compiling_on_apple(args)):
raise BuildError(
'Currently only support building protoc for Windows host while '
'cross-compiling for ARM/ARM64/Store and linux cross-compiling iOS')
log.info(
"Building protoc for host to be used in cross-compiled build process")
protoc_build_dir = os.path.join(os.getcwd(), build_dir, 'host_protoc')
os.makedirs(protoc_build_dir, exist_ok=True)
# Generate step
cmd_args = [
cmake_path,
os.path.join(source_dir, 'cmake', 'external', 'protobuf', 'cmake'),
'-Dprotobuf_BUILD_TESTS=OFF',
'-Dprotobuf_WITH_ZLIB_DEFAULT=OFF',
'-Dprotobuf_BUILD_SHARED_LIBS=OFF'
]
is_ninja = args.cmake_generator == 'Ninja'
if args.cmake_generator is not None and not (is_macOS() and args.use_xcode):
cmd_args += ['-G', args.cmake_generator]
if is_windows():
if not is_ninja:
cmd_args += ['-T', 'host=x64']
elif is_macOS():
if args.use_xcode:
cmd_args += ['-G', 'Xcode']
# CMake < 3.18 has a bug setting system arch to arm64 (if not specified) for Xcode 12,
# protoc for host should be built using host architecture
# Explicitly specify the CMAKE_OSX_ARCHITECTURES for x86_64 Mac.
cmd_args += ["-DCMAKE_OSX_ARCHITECTURES={}".format(
'arm64' if platform.machine() == 'arm64' else 'x86_64')]
run_subprocess(cmd_args, cwd=protoc_build_dir)
# Build step
cmd_args = [cmake_path,
"--build", protoc_build_dir,
"--config", "Release",
"--target", "protoc"]
run_subprocess(cmd_args)
# Absolute protoc path is needed for cmake
config_dir = ''
suffix = ''
if (is_windows() and not is_ninja) or (is_macOS() and args.use_xcode):
config_dir = 'Release'
if is_windows():
suffix = '.exe'
expected_protoc_path = os.path.join(protoc_build_dir, config_dir, 'protoc' + suffix)
if not os.path.exists(expected_protoc_path):
raise BuildError("Couldn't find {}. Host build of protoc failed.".format(expected_protoc_path))
return expected_protoc_path
def generate_documentation(source_dir, build_dir, configs):
operator_doc_path = os.path.join(source_dir, 'docs', 'ContribOperators.md')
opkernel_doc_path = os.path.join(source_dir, 'docs', 'OperatorKernels.md')
for config in configs:
# Copy the gen_contrib_doc.py.
shutil.copy(
os.path.join(source_dir, 'tools', 'python', 'gen_contrib_doc.py'),
os.path.join(build_dir, config))
shutil.copy(
os.path.join(source_dir, 'tools', 'python', 'gen_opkernel_doc.py'),
os.path.join(build_dir, config))
run_subprocess(
[sys.executable,
'gen_contrib_doc.py',
'--output_path', operator_doc_path],
cwd=os.path.join(build_dir, config))
run_subprocess(
[sys.executable,
'gen_opkernel_doc.py',
'--output_path', opkernel_doc_path],
cwd=os.path.join(build_dir, config))
docdiff = ''
try:
docdiff = subprocess.check_output(['git', 'diff', opkernel_doc_path])
except subprocess.CalledProcessError:
print('git diff returned non-zero error code')
if len(docdiff) > 0:
# Show warning instead of throwing exception, because it is
# dependent on build configuration for including
# execution propviders
log.warning(
'The updated opkernel document file ' + str(opkernel_doc_path) +
' is different from the checked in version. Consider '
'regenerating the file with CPU, DNNL and CUDA providers enabled.')
log.debug('diff:\n' + str(docdiff))
docdiff = ''
try:
docdiff = subprocess.check_output(['git', 'diff', operator_doc_path])
except subprocess.CalledProcessError:
print('git diff returned non-zero error code')
if len(docdiff) > 0:
raise BuildError(
'The updated operator document file ' +
str(operator_doc_path) + ' must be checked in.\n diff:\n' +
str(docdiff))
def main():
args = parse_arguments()
cmake_extra_defines = (args.cmake_extra_defines
if args.cmake_extra_defines else [])
cross_compiling = args.arm or args.arm64 or args.android
# If there was no explicit argument saying what to do, default
# to update, build and test (for native builds).
if not (args.update or args.clean or args.build or args.test):
log.debug(
"Defaulting to running update, build "
"[and test for native builds].")
args.update = True
args.build = True
if cross_compiling:
args.test = args.android_abi == 'x86_64' or args.android_abi == 'arm64-v8a'
else:
args.test = True
if args.skip_tests:
args.test = False
if args.include_ops_by_config:
from exclude_unused_ops_and_types import exclude_unused_ops_and_types
exclude_unused_ops_and_types(args.include_ops_by_config,
args.enable_reduced_operator_type_support,
args.use_cuda)
if args.use_tensorrt:
args.use_cuda = True
if args.build_wheel or args.gen_doc:
args.enable_pybind = True
if args.build_csharp or args.build_nuget or args.build_java or args.build_nodejs:
args.build_shared_lib = True
if args.build_nuget and cross_compiling:
raise BuildError('Currently nuget package creation is not supported while cross-compiling')
if args.enable_pybind and args.disable_exceptions:
raise BuildError('Python bindings require exceptions to be enabled.')
if args.minimal_build and args.disable_ort_format_load:
raise BuildError('Minimal build requires loading ORT format models.')
if args.nnapi_min_api:
if not args.use_nnapi:
raise BuildError("Using --nnapi_min_api requires --use_nnapi")
if args.nnapi_min_api < 27:
raise BuildError("--nnapi_min_api should be 27+")
if args.code_coverage and not args.android:
raise BuildError("Using --code_coverage requires --android")
# Disabling unit tests for VAD-F as FPGA only supports
# models with NCHW layout
if args.use_openvino == "VAD-F_FP32":
args.test = False
configs = set(args.config)
# setup paths and directories
cmake_path = resolve_executable_path(args.cmake_path)
ctest_path = None if args.use_vstest else resolve_executable_path(
args.ctest_path)
build_dir = args.build_dir
script_dir = os.path.realpath(os.path.dirname(__file__))
source_dir = os.path.normpath(os.path.join(script_dir, "..", ".."))
# if using cuda, setup cuda paths and env vars
cuda_home, cudnn_home = setup_cuda_vars(args)
mpi_home = args.mpi_home
nccl_home = args.nccl_home
acl_home = args.acl_home
acl_libs = args.acl_libs
armnn_home = args.armnn_home
armnn_libs = args.armnn_libs
# if using tensorrt, setup tensorrt paths
tensorrt_home = setup_tensorrt_vars(args)
# if using migraphx, setup migraphx paths
migraphx_home = setup_migraphx_vars(args)
# if using rocm, setup rocm paths
rocm_home = setup_rocm_build(args, configs)
os.makedirs(build_dir, exist_ok=True)
log.info("Build started")
if args.update:
cmake_extra_args = []
path_to_protoc_exe = args.path_to_protoc_exe
if not args.skip_submodule_sync:
update_submodules(source_dir)
if is_windows():
if args.cmake_generator == 'Ninja':
if args.x86 or args.arm or args.arm64:
raise BuildError(
"To cross-compile with Ninja, load the toolset "
"environment for the target processor (e.g. Cross "
"Tools Command Prompt for VS)")
cmake_extra_args = ['-G', args.cmake_generator]
elif args.x86:
cmake_extra_args = [
'-A', 'Win32', '-T', 'host=x64', '-G', args.cmake_generator
]
elif args.arm or args.arm64:
# Cross-compiling for ARM(64) architecture
# First build protoc for host to use during cross-compilation
if path_to_protoc_exe is None:
path_to_protoc_exe = build_protoc_for_host(
cmake_path, source_dir, build_dir, args)
if args.arm:
cmake_extra_args = ['-A', 'ARM']
else:
cmake_extra_args = ['-A', 'ARM64']
cmake_extra_args += ['-G', args.cmake_generator]
# Cannot test on host build machine for cross-compiled
# builds (Override any user-defined behaviour for test if any)
if args.test:
log.warning(
"Cannot test on host build machine for cross-compiled "
"ARM(64) builds. Will skip test running after build.")
args.test = False
else:
if (args.msvc_toolset == '14.16' and
args.cmake_generator == 'Visual Studio 16 2019'):
# CUDA 10.0 requires _MSC_VER >= 1700 and
# _MSC_VER < 1920, aka Visual Studio version
# in [2012, 2019). In VS2019, we have to use
# Side-by-side minor version MSVC toolsets from
# Visual Studio 2017 14.16 is MSVC version
# 141 is MSVC Toolset Version
# Cuda VS extension should be installed to
# C:\Program Files (x86)\Microsoft Visual
# Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\BuildCustomizations # noqa
toolset = 'v141,host=x64,version=' + args.msvc_toolset
elif args.msvc_toolset:
toolset = 'host=x64,version=' + args.msvc_toolset
else:
toolset = 'host=x64'
if args.cuda_version:
toolset += ',cuda=' + args.cuda_version
cmake_extra_args = [
'-A', 'x64', '-T', toolset, '-G', args.cmake_generator
]
if args.enable_windows_store:
cmake_extra_args.append(
'-DCMAKE_TOOLCHAIN_FILE=' + os.path.join(
source_dir, 'cmake', 'store_toolchain.cmake'))
if args.enable_wcos:
cmake_extra_args.append('-DCMAKE_USER_MAKE_RULES_OVERRIDE=wcos_rules_override.cmake')
elif args.cmake_generator is not None and not (is_macOS() and args.use_xcode):
cmake_extra_args += ['-G', args.cmake_generator]
elif is_macOS():
if args.use_xcode:
cmake_extra_args += ['-G', 'Xcode']
if not args.ios and not args.android and \
args.osx_arch == 'arm64' and platform.machine() == 'x86_64':
if args.test:
log.warning(
"Cannot test ARM64 build on X86_64. Will skip test running after build.")
args.test = False
if (args.android or args.ios or args.enable_windows_store
or is_cross_compiling_on_apple(args)) and args.path_to_protoc_exe is None:
# Cross-compiling for Android and iOS
path_to_protoc_exe = build_protoc_for_host(
cmake_path, source_dir, build_dir, args)
if is_ubuntu_1604():
if (args.arm or args.arm64):
raise BuildError(
"Only Windows ARM(64) cross-compiled builds supported "
"currently through this script")
if not is_docker() and not args.use_acl and not args.use_armnn:
install_python_deps()
if args.enable_pybind and is_windows():
install_python_deps(args.numpy_version)
if args.enable_onnx_tests:
setup_test_data(build_dir, configs)
generate_build_tree(
cmake_path, source_dir, build_dir, cuda_home, cudnn_home, rocm_home, mpi_home, nccl_home,
tensorrt_home, migraphx_home, acl_home, acl_libs, armnn_home, armnn_libs,
path_to_protoc_exe, configs, cmake_extra_defines, args, cmake_extra_args)
if args.clean:
clean_targets(cmake_path, build_dir, configs)
# if using DML, perform initial nuget package restore
setup_dml_build(args, cmake_path, build_dir, configs)
if args.build:
if args.parallel < 0:
raise BuildError("Invalid parallel job count: {}".format(args.parallel))
num_parallel_jobs = os.cpu_count() if args.parallel == 0 else args.parallel
build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
if args.test:
run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs)
# run nuphar python tests last, as it installs ONNX 1.5.0
if args.enable_pybind and not args.skip_onnx_tests and args.use_nuphar:
nuphar_run_python_tests(build_dir, configs)
# run node.js binding tests
if args.build_nodejs and not args.skip_nodejs_tests:
nodejs_binding_dir = os.path.normpath(os.path.join(source_dir, "nodejs"))
run_nodejs_tests(nodejs_binding_dir)
if args.build:
if args.build_wheel:
nightly_build = bool(os.getenv('NIGHTLY_BUILD') == '1')
build_python_wheel(
source_dir,
build_dir,
configs,
args.use_cuda,
args.use_dnnl,
args.use_tensorrt,
args.use_openvino,
args.use_nuphar,
args.use_vitisai,
args.use_acl,
args.use_armnn,
args.use_dml,
args.wheel_name_suffix,
args.enable_training,
nightly_build=nightly_build,
featurizers_build=args.use_featurizers,
use_ninja=(args.cmake_generator == 'Ninja')
)
if args.build_nuget:
build_nuget_package(
source_dir,
build_dir,
configs,
args.use_cuda,
args.use_openvino,
args.use_tensorrt,
args.use_dnnl
)
if args.test and args.build_nuget:
run_csharp_tests(
source_dir,
build_dir,
args.use_cuda,
args.use_openvino,
args.use_tensorrt,
args.use_dnnl)
if args.gen_doc and (args.build or args.test):
generate_documentation(source_dir, build_dir, configs)
log.info("Build complete")
if __name__ == "__main__":
try:
sys.exit(main())
except BaseError as e:
log.error(str(e))
sys.exit(1)