This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.
This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency). The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790
Approved by: https://github.com/yangw-dev
Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829
Approved by: https://github.com/jeffdaily
- Added another workflow to run the mi300 jobs post-merge.
- Updated rocm.yml to use mi200s instead of mi300s.
- Required to get an idea of how PRs are landing on our mi200s and mi300s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb
* XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there
* Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep.
The vision update failures looks more complex though and I would need to take a closer look. So, I will keep it in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314
Approved by: https://github.com/izaitsevfb
The benchmark is failing with the following error
```
File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module>
main(output_file=args.output, only_model=args.only)
File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main
lst = func(device)
File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu
us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
return fn(self, *args, **kwargs)
TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs'
```
An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555
I also assign `oncall: pt2` as the owner of this job going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235
Approved by: https://github.com/nmacchioni
This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574
Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"`
Before it printed
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': ''}}
```
After
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}}
```
Fixes https://github.com/pytorch/pytorch/issues/144970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987
Approved by: https://github.com/Skylion007, https://github.com/zou3519
This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214. After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.
So, let's try @malfet suggestion by removing the prefix altogether to keep it simple. After this land, I will circle back on this to see if there is any improvements. Otherwise, it's still a simple BE change I guess.
Here is the query I'm using to gather build time data for reference:
```
with jobs as (
select
id,
name,
DATE_DIFF('minute', created_at, completed_at) as duration,
DATE_TRUNC('week', created_at) as bucket
from
workflow_job
where
name like '%/ build'
and html_url like concat('%', {repo: String }, '%')
and conclusion = 'success'
and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
select
--groupArray(duration) as durations,
--quantiles(0.9)(duration),
avg(duration),
bucket
from
jobs
group by
bucket
)
select
*
from
aggregated_jobs_in_bucket
order by
bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager.
For example, code below:
```python
import torch
import os
def foo(x):
return x[:,::2].sin() + x[:, 1::2].cos()
if __name__ == "__main__":
os.environ["MTL_CAPTURE_ENABLED"] = "1"
x = torch.rand(32, 1024, device="mps")
with torch.mps.profiler.metal_capture("compiled_shader"):
torch.compile(foo)(x)
```
should capture the execution of a `torch.compile` generated shader
<img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561
Approved by: https://github.com/manuelcandales
ghstack dependencies: #144559, #144560
Periodically run testsuite for s390x
**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.
```
# pip3 install z3-solver==4.12.2.0
...
In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
| ^~~~~~~~~
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
82 | m_curr_ptr = ALIGN(char *, new_curr_ptr);
| ^~~~~
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
56 | #include "util/page.h"
+++ |+#include <cstdint>
57 |
```
**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```
Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`
New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```
```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```
`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:
```
CMake Error at CMakeLists.txt:9 (find_package):
By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Torch", but
CMake did not find one.
```
https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401
**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)
60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.
We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes
We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.
That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.
Nightly binaries build + nightly tests = 3050 minutes.
25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.
**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**
Quantization is not fully supported on s390x in pytorch yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Sometimes job is cancelled during nested docker container creation.
This leads to nested docker container not being stopped and worker hanging forever in the job.
Improve nested docker containers cleanup for these cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144149
Approved by: https://github.com/seemethere
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989
Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2
Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1
Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda
Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)
Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda
Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>