Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer`
After `convert_pt2e`, the pattern will be
```
x
|
linear <- to_fp32 <- to_fp16 <- w
```
**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph.
In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks).
We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node.
Test Plan:
CI
```
buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack
```
Differential Revision: D66393517
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400
Approved by: https://github.com/angelayi
**About this PR**
This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode.
- `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor.
- `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
- `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output.
**Test plan**
`python test/test_quantization.py -k test_linear_dynamic_fp16_onednn`
**Implementation**
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.
**Correctness and performance**
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects:
1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent.
2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #138691, #138806
**MOTIVATION**
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.
**CHANGES**
- Add support for HPU devices within the test_move_exported_model_bn using TEST_HPU flag
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137863
Approved by: https://github.com/jerryzh168
Summary:
as title
also change it in `prepare_pt2e()` docstring
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```
Differential Revision: D63345059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.
Test Plan: unit tests
Reviewed By: masnesral
Differential Revision: D63545169
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.
Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.
Test Plan: Added a test for this case in `test_numeric_debugger`.
Reviewed By: jerryzh168
Differential Revision: D62898297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.
Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.
Test Plan: Added a test for this case in `test_numeric_debugger`.
Reviewed By: jerryzh168
Differential Revision: D62898297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
Summary:
Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent
this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max
This PR added checks for other fields as well
Test Plan:
regression tests
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736
Approved by: https://github.com/sxu
Summary:
Fixes https://github.com/pytorch/pytorch/issues/134778
The previous D62304294 broke some executorch tests. It has already been reverted.
In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing.
Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle
```
Differential Revision: D62514208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720
Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168
Summary:
Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir.
Remove BLOCK_LIST since it's empty.
Now all internal unittests will use training ir.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
```
Differential Revision: D62387987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729
Approved by: https://github.com/tugsbayasgalan
Summary: as title
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86
```
CI
Differential Revision: D62448302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623
Approved by: https://github.com/tugsbayasgalan
Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore.
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata
```
Reviewed By: angelayi
Differential Revision: D62219974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268
Approved by: https://github.com/angelayi