### Description
* Build cuda nhwc ops by default.
* Deprecate `--enable_cuda_nhwc_ops` in build.py and add
`--disable_cuda_nhwc_ops` option
Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops
will be disabled automatically.
### Motivation and Context
In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with
Tensor Cores, and this could improve performance for vision models.
This is the first step to prefer NHWC for CUDA in 1.21 release. Next
step is to do some tests on popular vision models. If it help in most
models and devices, set `prefer_nhwc=1` as default cuda provider option.
### Description
Refactor the cmake code that is related to delay loading. Provide a
cmake option to control if delay loading should be enabled or not.
Disabling the option when python is enabled, due to a known issue.
### Motivation and Context
ONNX Runtime's python package depends on DirectML.dll, but supposedly
the DLL should be delay loaded.
This PR only refactor the code. It doesn't change the behavior.
1. Add python 3.13 to our python packaging pipelines
2. Because numpy 2.0.0 doesn't support thread free python, this PR also
upgrades numpy to the latest
3. Delete some unused files.
### Description
Similar to #20786 . The last PR was able to update all pipelines and all
docker files. This is a follow-up to that PR.
### Motivation and Context
1. To extract the common part as a reusable build infra among different
ONNX Runtime projects.
2. Avoid hitting docker hub's limit: 429 Too Many Requests - Server
message: toomanyrequests: You have reached your pull rate limit. You may
increase the limit by authenticating and upgrading:
https://www.docker.com/increase-rate-limit