[Qlinearsoftmax] contrib cpu (#12177)

* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
This commit is contained in:
Cheng 2022-08-10 10:52:02 +08:00 committed by GitHub
parent 0c6037b5ab
commit 64e991a9fc
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
20 changed files with 1014 additions and 6 deletions

View file

@ -55,6 +55,7 @@ Do not modify directly.*
* <a href="#com.microsoft.QLinearMul">com.microsoft.QLinearMul</a>
* <a href="#com.microsoft.QLinearReduceMean">com.microsoft.QLinearReduceMean</a>
* <a href="#com.microsoft.QLinearSigmoid">com.microsoft.QLinearSigmoid</a>
* <a href="#com.microsoft.QLinearSoftmax">com.microsoft.QLinearSoftmax</a>
* <a href="#com.microsoft.QuantizeLinear">com.microsoft.QuantizeLinear</a>
* <a href="#com.microsoft.Range">com.microsoft.Range</a>
* <a href="#com.microsoft.ReduceSumInteger">com.microsoft.ReduceSumInteger</a>
@ -2771,7 +2772,7 @@ This version of the operator has been available since version 1 of the 'com.micr
### <a name="com.microsoft.QLinearSigmoid"></a><a name="com.microsoft.qlinearsigmoid">**com.microsoft.QLinearSigmoid**</a>
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
(Tensor<T>) where the function `f(x) = quantize(Sigmoid(dequantize(x)))`, is applied to the data tensor elementwise.
Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))`
@ -2809,6 +2810,58 @@ This version of the operator has been available since version 1 of the 'com.micr
</dl>
### <a name="com.microsoft.QLinearSoftmax"></a><a name="com.microsoft.qlinearsoftmax">**com.microsoft.QLinearSoftmax**</a>
QLinearSoftmax computes the normalized exponential values for the given input:
Softmax(input, axis) = Exp(input) / ReduceSum(Exp(input), axis=axis, keepdims=1)
The input does not need to explicitly be a 2D vector. The "axis" attribute
indicates the dimension along which QLinearSoftmax will be performed for onnx v.13+.
or the dimension coerced to NxD Matrix for onnx v.12-.
The output tensor has the same shape.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>axis</tt> : int</dt>
<dd>apply softmax to elements for dimensions axis,or all dims along with axis according to op-version</dd>
<dt><tt>opset</tt> : int (required)</dt>
<dd>opset version of corresponding SoftMax.</dd>
</dl>
#### Inputs
<dl>
<dt><tt>X</tt> : T</dt>
<dd>The input tensor</dd>
<dt><tt>X_scale</tt> : tensor(float)</dt>
<dd>Scale of quantized input 'X'. It must be a scalar.</dd>
<dt><tt>x_zero_point</tt> (optional) : T</dt>
<dd>Zero point tensor for input 'X'.It must be a scalar.</dd>
<dt><tt>y_scale</tt> : tensor(float)</dt>
<dd>Scale of quantized output 'Y'. It must be a scalar.</dd>
<dt><tt>y_zero_point</tt> : T</dt>
<dd>Zero point tensor for output 'Y'. It must be a scalar.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output data tensor from pooling across the input tensor. The output tensor has the same rank as the input. </dd>
</dl>
#### Type Constraints
<dl>
<dt><tt>T</tt> : tensor(uint8), tensor(int8)</dt>
<dd>Constrain input and output types to singed/unsigned int8 tensors.</dd>
</dl>
### <a name="com.microsoft.QuantizeLinear"></a><a name="com.microsoft.quantizelinear">**com.microsoft.QuantizeLinear**</a>
The linear quantization operator. It consumes a full precision data, a scale, a zero point to compute the low precision / quantized tensor.

View file

@ -430,6 +430,7 @@ Do not modify directly.*
|QLinearLeakyRelu|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* X_zero_point:**T**<br> *in* Y_scale:**tensor(float)**<br> *in* Y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearMul|*in* A:**T**<br> *in* A_scale:**tensor(float)**<br> *in* A_zero_point:**T**<br> *in* B:**T**<br> *in* B_scale:**tensor(float)**<br> *in* B_zero_point:**T**<br> *in* C_scale:**tensor(float)**<br> *in* C_zero_point:**T**<br> *out* C:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearSigmoid|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* X_zero_point:**T**<br> *in* Y_scale:**tensor(float)**<br> *in* Y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearSoftmax|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* x_zero_point:**T**<br> *in* y_scale:**tensor(float)**<br> *in* y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QuantizeLinear|*in* x:**T1**<br> *in* y_scale:**T1**<br> *in* y_zero_point:**T2**<br> *out* y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|Range|*in* start:**T**<br> *in* limit:**T**<br> *in* delta:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
|SampleOp|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|

View file

@ -193,6 +193,9 @@ using BuildKernelCreateInfoFn = KernelCreateInfo (*)();
#define ONNX_CPU_OPERATOR_ML_KERNEL(name, ver, builder, ...) \
ONNX_OPERATOR_KERNEL_EX(name, kMLDomain, ver, kCpuExecutionProvider, builder, __VA_ARGS__)
#define ONNX_CPU_OPERATOR_MS_KERNEL(name, ver, builder, ...) \
ONNX_OPERATOR_KERNEL_EX(name, kMSDomain, ver, kCpuExecutionProvider, builder, __VA_ARGS__)
#define ONNX_OPERATOR_KERNEL_EX(name, domain, ver, provider, builder, ...) \
class ONNX_OPERATOR_KERNEL_CLASS_NAME(provider, domain, ver, name); \
template <> \

View file

@ -55,6 +55,7 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearSigmoid);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearSoftmax);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearAdd);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearAdd);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearMul);
@ -151,6 +152,7 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearSigmoid)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearSoftmax)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearAdd)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearAdd)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearMul)>,

View file

@ -0,0 +1,343 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.
#include "contrib_ops/cpu/quantization/qlinear_softmax.h"
#include <cstdint>
#include <type_traits>
#include <utility>
#include "core/common/common.h"
#include "core/framework/tensorprotoutils.h"
#include "core/providers/common.h"
#include "core/providers/cpu/tensor/transpose.h"
#include "core/mlas/inc/mlas.h"
#include "core/platform/threadpool.h"
#include "gsl/gsl-lite.hpp"
namespace onnxruntime {
namespace contrib {
constexpr int OPSET13 = 13;
namespace {
void QlinearBuildLookupTableUint32(gsl::span<uint32_t> table,
const float x_scale,
size_t reduce_len, bool is_signed) {
const double qscale =
fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff));
for (int32_t i = 0; i < 256; i++) {
double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));
// we can't get the real max value of input tensor here, so we just assume 255.
// in the function of `QlinearSoftmaxCPU`,
// all numbers will have a shift (255-max_value) if its max value is not 255
//
// if is_signed index = [1 2 3 ......126 127 -128 -127 ..... -3 -2 -1]
// else [0 1 2 3 4 ..... 256]
uint8_t index = static_cast<uint8_t>(is_signed ? i - 128 : i);
table[index] = static_cast<uint32_t>(lrint(scaled_exp_xi));
}
}
void BuildLookupTableIfFixed(const OpKernelInfo& info, std::vector<uint32_t>& fixed_lookup_table,
size_t reduce_len, bool is_signed) {
const Tensor* tensor_x_scale = nullptr;
bool get_x_scale = info.TryGetConstantInput(1, &tensor_x_scale);
ORT_ENFORCE(tensor_x_scale == nullptr || IsScalarOr1ElementVector(tensor_x_scale),
"QlinearBuildLookupTable : input X_scale must be a scalar or 1D tensor of size 1");
bool is_fixed_parameters = get_x_scale;
if (is_fixed_parameters) {
fixed_lookup_table.resize(256);
const float X_scale = *(tensor_x_scale->Data<float>());
QlinearBuildLookupTableUint32(fixed_lookup_table, X_scale, reduce_len, is_signed);
}
}
} // namespace
QLinearSoftmax::QLinearSoftmax(const OpKernelInfo& info)
: OpKernel(info) {
const auto& node = info.node();
auto input_defs = node.InputDefs();
auto input_type = input_defs[0]->TypeAsProto()->tensor_type().elem_type();
is_signed_ = (input_type == ONNX_NAMESPACE::TensorProto_DataType_INT8);
const auto* x_shape = input_defs[0]->Shape();
ORT_ENFORCE(x_shape != nullptr && x_shape->dim_size() > 0, "input_shape of QLinearSoftmax must be existed");
int rank = x_shape->dim_size();
int64_t opset = -1;
Status status = info.GetAttr<int64_t>("opset", &opset);
ORT_ENFORCE(status.IsOK(), "opset must be existed in attributes of QlinearSoftmax");
opset_ = gsl::narrow_cast<int>(opset);
int64_t axis = -1;
status = info.GetAttr<int64_t>("axis", &axis);
if (status.IsOK()) {
axis_ = gsl::narrow_cast<int>(axis);
} else {
// opset-12 and below, the default axis value is 1
// opset-13, the default axis value is -1
axis_ = opset_ < OPSET13 ? 1 : -1;
}
axis_ = static_cast<int>(HandleNegativeAxis(axis_, int64_t(rank)));
auto input_shape = utils::GetTensorShapeFromTensorShapeProto(*x_shape);
int64_t reduce_size = opset_ < OPSET13 ? input_shape.SizeFromDimension(axis_) : input_shape[axis_];
// reduce_size could be negative if input-shape has a dynamic axis
if (reduce_size > 0) {
BuildLookupTableIfFixed(info, fixed_lookup_table_, reduce_size, is_signed_);
}
}
// compute method of Softmax
Status QLinearSoftmax::Compute(OpKernelContext* ctx) const {
const auto* X = ctx->Input<Tensor>(0);
const auto& X_shape = X->Shape();
auto* Y = ctx->Output(0, X_shape);
// edge case. one or more dims with value of 0. nothing to do
if (X_shape.Size() == 0) {
return Status::OK();
}
concurrency::ThreadPool* thread_pool = ctx->GetOperatorThreadPool();
const size_t D = opset_ < OPSET13 ? X_shape.SizeFromDimension(axis_): X_shape[axis_];
uint32_t tmp_lookup_table[256];
gsl::span<const uint32_t> lookup_table = GetLookupTable(ctx, tmp_lookup_table, D);
if (opset_ < OPSET13) {
return ComputeInternal(ctx, *X, *Y, lookup_table, axis_, thread_pool);
} else {
return ComputeImplOpset13(ctx, *X, *Y, lookup_table, thread_pool);
}
}
template <typename T>
common::Status QlinearSoftmaxCPU(size_t N,
size_t D,
const T* x_data,
T* y_data,
const uint32_t* lookup_table,
uint32_t y_scale,
T yzp,
onnxruntime::concurrency::ThreadPool* thread_pool);
template <>
common::Status QlinearSoftmaxCPU<uint8_t>(size_t N,
size_t D,
const uint8_t* x_data,
uint8_t* y_data,
const uint32_t* lookup_table,
uint32_t y_scale,
uint8_t yzp,
onnxruntime::concurrency::ThreadPool* thread_pool) {
using onnxruntime::TensorOpCost;
using onnxruntime::concurrency::ThreadPool;
ThreadPool::TryParallelFor(
thread_pool, N,
// Read 3*N (max,sum,div) write N (div), computation=Read
TensorOpCost{static_cast<double>(D * 3),
static_cast<double>(D),
static_cast<double>(D * 3)},
[x_data, y_data, D, y_scale, yzp, &lookup_table](std::ptrdiff_t first, std::ptrdiff_t last) {
const auto c_y_scale = y_scale;
const auto c_y_zp = yzp;
const uint8_t* x_t = x_data + first * D;
uint8_t* y_t = y_data + first * D;
for (; first < last; first++) {
// reduceMaxUint8
uint8_t xmax = *std::max_element(x_t, x_t + D);
// we want the xmas to align with 255 for higher precision.
// as we build a lookup table with X-255. So we could use the adjustment here
// to let all numbers have a shift in the lookup table.
// 1 2 3 4 5 ...........................254 255
// 1 3 5 ... 10
// after the shift --->
// 235 237 239 .. 255
const uint32_t* shifted_lookuptable = lookup_table + 255 - xmax;
size_t elements_n = D;
// reduceSumUin8ToUint32: need speedup
// vsum = \sum_i{e^x_i}
uint32_t vsum = 0;
const uint8_t* x_t_cur = x_t;
do {
const size_t vx = *x_t_cur++;
vsum += shifted_lookuptable[vx];
} while (--elements_n != 0);
if (vsum == 0) {
return;
}
elements_n = D;
x_t_cur = x_t;
// elementwise div, y_i=\frac{x_i}{vsum}
const uint32_t vrounding = (vsum >> 1);
do {
const size_t vx = *x_t_cur++;
const uint32_t vt = shifted_lookuptable[vx];
// simulate round function, and re-quant to uint8
const uint32_t vq = ((vt * c_y_scale) + vrounding) / vsum + c_y_zp;
const uint8_t vy = vq > 255 ? static_cast<uint8_t>(255) : static_cast<uint8_t>(vq);
*y_t++ = vy;
} while (--elements_n != 0);
x_t = x_t_cur;
}
});
return Status::OK();
}
template <>
common::Status QlinearSoftmaxCPU<int8_t>(size_t N,
size_t D,
const int8_t* x_data,
int8_t* y_data,
const uint32_t* lookup_table,
uint32_t y_scale,
int8_t yzp,
onnxruntime::concurrency::ThreadPool* thread_pool) {
using onnxruntime::TensorOpCost;
using onnxruntime::concurrency::ThreadPool;
ThreadPool::TryParallelFor(
thread_pool, N,
// Read 3*N (max,sum,div) write N (div), computation=Read
TensorOpCost{static_cast<double>(D * 3),
static_cast<double>(D),
static_cast<double>(D * 3)},
[x_data, y_data, D, y_scale, yzp, &lookup_table](std::ptrdiff_t first, std::ptrdiff_t last) {
const auto c_y_scale = y_scale;
const auto c_y_zp = yzp;
const int8_t* x_t = x_data + first * D;
int8_t* y_t = y_data + first * D;
for (; first < last; first++) {
// reduceMaxInt8
int8_t xmax = *std::max_element(x_t, x_t + D);
const size_t adjustment = 127 - xmax;
const uint32_t* shifted_lookuptable = lookup_table;
size_t elements_n = D;
// reduceSumUin8ToUint32: need speedup
uint32_t vsum = 0;
const int8_t* x_t_cur = x_t;
do {
const size_t vx = uint8_t(adjustment + (*x_t_cur++));
vsum += shifted_lookuptable[vx];
} while (--elements_n != 0);
if (vsum == 0) {
return;
}
elements_n = D;
x_t_cur = x_t;
// elementwise div
const uint32_t vrounding = (vsum >> 1);
do {
const size_t vx = uint8_t(adjustment + (*x_t_cur++));
const uint32_t vt = shifted_lookuptable[vx];
// simulate round function, and re-quant to Int8
const uint32_t vq = ((vt * c_y_scale) + vrounding) / vsum + c_y_zp;
const int8_t vy = static_cast<int32_t>(vq) > 255 ? static_cast<int8_t>(255) : static_cast<int8_t>(vq);
*y_t++ = vy;
} while (--elements_n != 0);
x_t = x_t_cur;
}
});
return Status::OK();
}
gsl::span<const uint32_t> QLinearSoftmax::GetLookupTable(OpKernelContext* context,
gsl::span<uint32_t> lookup_table_span,
size_t reduce_len) const {
gsl::span<const uint32_t> lookup_table = fixed_lookup_table_;
if (fixed_lookup_table_.size() == 0) {
lookup_table = lookup_table_span;
const float X_scale = *(context->Input<Tensor>(1)->Data<float>());
QlinearBuildLookupTableUint32(lookup_table_span, X_scale, reduce_len, is_signed_);
}
return lookup_table;
}
// opset-12 and below
Status QLinearSoftmax::ComputeInternal(OpKernelContext* context, const Tensor& input, Tensor& output,
gsl::span<const uint32_t> lookup_table, int axis,
concurrency::ThreadPool* thread_pool) const {
const auto* Y_scale_tensor = context->Input<Tensor>(3);
const auto* Y_zp_tensor = context->Input<Tensor>(4);
const auto Y_scale = gsl::narrow_cast<uint32_t>(1.0F / (*(Y_scale_tensor->Data<float>())));
const auto& X_shape = input.Shape();
const size_t N = X_shape.SizeToDimension(axis);
const size_t D = X_shape.SizeFromDimension(axis);
common::Status status;
if (is_signed_) {
using T = int8_t;
const T Y_zp = Y_zp_tensor ? *(Y_zp_tensor->Data<T>()) : 0;
status = QlinearSoftmaxCPU<T>(N, D, input.Data<T>(), output.MutableData<T>(),
lookup_table.data(), Y_scale, Y_zp, thread_pool);
} else {
using T = uint8_t;
const T Y_zp = Y_zp_tensor ? *(Y_zp_tensor->Data<T>()) : 0;
status = QlinearSoftmaxCPU<T>(N, D, input.Data<T>(), output.MutableData<T>(),
lookup_table.data(), Y_scale, Y_zp, thread_pool);
}
return status;
}
// opset-13 and above
Status QLinearSoftmax::ComputeImplOpset13(OpKernelContext* context,
const Tensor& input, Tensor& output,
gsl::span<const uint32_t> lookup_table,
concurrency::ThreadPool* thread_pool) const {
const auto& X_shape = input.Shape();
size_t rank = X_shape.NumDimensions();
bool is_transpose_required = (size_t(axis_) != (rank - 1));
Tensor transposed_input;
Tensor intermediate_output; // output that the softmax implementation will write into while using transposed input
std::vector<size_t> permutation(rank);
if (is_transpose_required) {
AllocatorPtr alloc;
ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));
std::iota(std::begin(permutation), std::end(permutation), 0);
// swap the innermost dim with the dim corresponding to axis
permutation[axis_] = rank - 1;
permutation[rank - 1] = axis_;
std::vector<int64_t> transposed_input_dims(rank);
std::transform(permutation.cbegin(), permutation.cend(),
transposed_input_dims.begin(), [&X_shape](size_t e) { return X_shape[e]; });
// Allocate a temporary tensor to hold transposed input
transposed_input = Tensor(input.DataType(), TensorShape(transposed_input_dims), alloc);
// Perform the transpose
ORT_RETURN_IF_ERROR(TransposeBase::DoTranspose(permutation, input, transposed_input));
// Allocate memory for the intermediate output
intermediate_output = Tensor(output.DataType(), TensorShape(transposed_input_dims), alloc);
}
common::Status status;
const auto& input_tensor = is_transpose_required ? transposed_input : input;
auto& output_tensor = is_transpose_required ? intermediate_output : output;
ORT_RETURN_IF_ERROR(ComputeInternal(context, input_tensor, output_tensor, lookup_table, int(rank - 1), thread_pool));
if (is_transpose_required) {
// Perform the transpose to get the axes back to the original ordering
status = (TransposeBase::DoTranspose(permutation, intermediate_output, output));
}
return status;
}
ONNX_CPU_OPERATOR_MS_KERNEL(
QLinearSoftmax,
1,
KernelDefBuilder().TypeConstraint(
"T",
{DataTypeImpl::GetTensorType<uint8_t>(),
DataTypeImpl::GetTensorType<int8_t>()}),
QLinearSoftmax)
} // namespace contrib
} // namespace onnxruntime

View file

@ -0,0 +1,34 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.
#pragma once
#include <vector>
#include "core/framework/op_kernel.h"
namespace onnxruntime {
namespace contrib {
class QLinearSoftmax final : public OpKernel {
public:
QLinearSoftmax(const OpKernelInfo& info);
Status Compute(OpKernelContext* context) const override;
private:
gsl::span<const uint32_t> GetLookupTable(OpKernelContext* context, gsl::span<uint32_t> lookup_table_span, size_t reduce_len) const;
Status ComputeInternal(OpKernelContext* context, const Tensor& input, Tensor& output, gsl::span<const uint32_t> lookup_table, int axis, concurrency::ThreadPool* thread_pool) const;
Status ComputeImplOpset13(OpKernelContext* context, const Tensor& input, Tensor& output,
gsl::span<const uint32_t> lookup_table, concurrency::ThreadPool* thread_pool) const;
private:
std::vector<uint32_t> fixed_lookup_table_;
int axis_ = -1;
int opset_ = 1;
bool is_signed_{false};
};
} // namespace contrib
} // namespace onnxruntime

View file

@ -29,6 +29,7 @@ class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearLeakyRelu);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearMul);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearReduceMean);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSigmoid);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSoftmax);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QuantizeLinear);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, ReduceSumInteger);
@ -98,6 +99,7 @@ class OpSet_Microsoft_ver1 {
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearMul)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearReduceMean)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSigmoid)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSoftmax)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QuantizeLinear)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, ReduceSumInteger)>());

View file

@ -559,7 +559,7 @@ and produces one output data (Tensor<T>) where the function `f(x) = quantize(alp
.TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
const char* QLinearSigmoidDoc_ver1 = R"DOC(
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
(Tensor<T>) where the function `f(x) = quantize(Sigmoid(dequantize(x)))`, is applied to the data tensor elementwise.
Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))` )DOC";
@ -585,6 +585,62 @@ Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))` )DOC";
"Constrain input and output types to 8 bit tensors.")
.TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
ONNX_MS_OPERATOR_SET_SCHEMA(QLinearSoftmax, 1, OpSchema().SetDoc(R"DOC(
QLinearSoftmax computes the normalized exponential values for the given input:
Softmax(input, axis) = Exp(input) / ReduceSum(Exp(input), axis=axis, keepdims=1)
The input does not need to explicitly be a 2D vector. The "axis" attribute
indicates the dimension along which QLinearSoftmax will be performed for onnx v.13+.
or the dimension coerced to NxD Matrix for onnx v.12-.
The output tensor has the same shape.
)DOC")
.Attr("axis", "apply softmax to elements for dimensions axis,"
"or all dims along with axis according to op-version", AttributeProto::INT, static_cast<int64_t>(-1))
.Attr("opset", "opset version of corresponding SoftMax.", AttributeProto::INT)
.Input(0, "X", "The input tensor", "T")
.Input(1, "X_scale", "Scale of quantized input 'X'. It must be a scalar.", "tensor(float)")
.Input(2, "x_zero_point",
"Zero point tensor for input 'X'."
"It must be a scalar.",
"T", OpSchema::Optional)
.Input(3, "y_scale", "Scale of quantized output 'Y'. It must be a scalar.", "tensor(float)")
.Input(4, "y_zero_point",
"Zero point tensor for output 'Y'. "
"It must be a scalar.",
"T")
.Output(0, "Y",
"Output data tensor from pooling across the input "
"tensor. The output tensor has the same rank as the input. ",
"T")
.TypeConstraint("T", {"tensor(uint8)", "tensor(int8)"},
"Constrain input and output types to singed/unsigned int8 tensors.")
.TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
// Type inference
propagateElemTypeFromInputToOutput(ctx, 0, 0);
// Shape inference starts
if (!hasNInputShapes(ctx, 1)) {
return;
}
// Validate the value of 'axis'
const ONNX_NAMESPACE::TensorShapeProto& input_shape =
ctx.getInputType(0)->tensor_type().shape();
int r = input_shape.dim_size();
int axis = static_cast<int>(getAttribute(ctx, "axis", -1));
if (axis < -r || axis >= r) {
fail_shape_inference(
"'axis' must be in [",
-r,
" , ",
(r - 1),
"]. Its actual value is: ",
axis);
}
// Shape inference
propagateShapeFromInputToOutput(ctx, 0, 0);
}));
ONNX_MS_OPERATOR_SET_SCHEMA(DynamicQuantizeLSTM, 1, OpSchema()
.Attr(
"direction",

View file

@ -4,7 +4,7 @@
#include "core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h"
#include "core/optimizer/qdq_transformer/qdq_util.h"
#include "core/graph/node_attr_utils.h"
namespace onnxruntime {
namespace QDQ {
@ -195,6 +195,15 @@ UnaryReplaceWithQLinear::UnaryReplaceWithQLinear(std::string domain)
: ReplaceWithQLinear(std::move(domain), UnaryMoves()) {
}
NodeAttributes UnaryReplaceWithQLinear::ExtraAttributes(const RuntimeState& state) const {
const auto& target = state.selected_nodes.Target();
NodeAttributes attr;
if (target.OpType() == "Softmax") {
attr["opset"] = utils::MakeAttribute(std::string("opset"), int64_t(target.SinceVersion()));
}
return attr;
}
BinaryReplaceWithQLinear::BinaryReplaceWithQLinear(std::string domain)
: ReplaceWithQLinear(std::move(domain), BinaryMoves()) {
}

View file

@ -43,6 +43,9 @@ struct ReplaceWithQLinear : public QDQReplaceWithNew {
struct UnaryReplaceWithQLinear : ReplaceWithQLinear {
UnaryReplaceWithQLinear(std::string domain);
private:
NodeAttributes ExtraAttributes(const RuntimeState& state) const override;
};
struct BinaryReplaceWithQLinear : ReplaceWithQLinear {

View file

@ -82,7 +82,8 @@ void UnaryOpQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
{{"AveragePool", {}},
{"LeakyRelu", {}},
{"GlobalAveragePool", {}},
{"Sigmoid", {}}},
{"Sigmoid", {}},
{"Softmax", {}}},
std::move(selector),
std::move(action));
#else

View file

@ -132,6 +132,8 @@ class ONNXQuantizer:
# some output from nodes will be quantized, yet itself should be treat as existing so
# no dequantized will be applied when needed later
self.generated_value_names = self.model.get_non_initializer_inputs()
# to store specified scale and zeropoint instead of calculated value, tensor_name->(scale, zeropoint)
self.used_scale_zp_map = {}
# routines for subgraph support
def quantize_subgraph(self, subgraph, graph_key):
@ -625,6 +627,18 @@ class ONNXQuantizer:
self.quantized_value_map[input_name] = QuantizedValue(input_name, output_name, scale_name, zp_name, qType)
return nodes + [qlinear_node]
def set_quant_scale_zp(self, tensor_name, value):
assert isinstance(value, tuple) and len(value) == 2, "value must be scale(float) and zeropoint"
assert tensor_name not in self.used_scale_zp_map, f"{tensor_name} has been setted before"
self.used_scale_zp_map[tensor_name] = value
def find_quant_scale_zp(self, input_name):
if input_name in self.used_scale_zp_map:
return self.used_scale_zp_map[input_name]
if self.parent is not None:
return self.parent.find_quantized_value(input_name)
return (None, None)
def find_quantized_value(self, input_name):
if input_name in self.quantized_value_map:
return self.quantized_value_map[input_name]

View file

@ -0,0 +1,86 @@
import onnx
from ..quant_utils import QuantizedValue, QuantizedValueType, attribute_to_kwarg, ms_domain
from .base_operator import QuantOperatorBase
from .qdq_base_operator import QDQOperatorBase
class QLinearSoftmax(QuantOperatorBase):
def quantize(self):
node = self.node
# set limitations for softmax output scale and zp, because the output of softmax is always 0-1
if self.quantizer.input_qType == onnx.onnx_pb.TensorProto.UINT8:
out_scale = 1 / 256.0
out_zero_point = 0
else:
out_scale = 1 / 256.0
out_zero_point = -128
# only try to quantize when given quantization parameters for it
(
data_found,
output_scale_name,
output_zp_name,
_,
_,
) = self.quantizer._get_quantization_params(node.output[0], out_scale, out_zero_point)
# get quantized input tensor names, quantize input if needed
(
quantized_input_names,
input_zero_point_names,
input_scale_names,
nodes,
) = self.quantizer.quantize_inputs(node, [0])
if not data_found or quantized_input_names is None:
return super().quantize()
# Create an entry for output quantized value.
qlinear_output_name = node.output[0] + "_quantized"
quantized_output_value = QuantizedValue(
node.output[0],
qlinear_output_name,
output_scale_name,
output_zp_name,
QuantizedValueType.Input,
)
self.quantizer.quantized_value_map[node.output[0]] = quantized_output_value
# Create qlinear softmax node for given type
kwargs = {}
for attribute in node.attribute:
kwargs.update(attribute_to_kwarg(attribute))
kwargs["domain"] = ms_domain
# make qlinearsoft has the real opset_version, its default SinceVersion would be 1
kwargs["opset"] = self.quantizer.opset_version
qlinear_node_name = node.name + "_quant" if node.name != "" else ""
qnode = onnx.helper.make_node(
"QLinear" + node.op_type,
[
quantized_input_names[0],
input_scale_names[0],
input_zero_point_names[0],
output_scale_name,
output_zp_name,
],
[qlinear_output_name],
qlinear_node_name,
**kwargs,
)
# add all newly created nodes
nodes.append(qnode)
self.quantizer.new_nodes += nodes
return None
class QDQSoftmax(QDQOperatorBase):
def quantize(self):
super().quantize()
if self.quantizer.input_qType == onnx.onnx_pb.TensorProto.UINT8:
out_scale = 1 / 256.0
out_zero_point = 0
else:
out_scale = 1 / 256.0
out_zero_point = -128
self.quantizer.set_quant_scale_zp(self.node.output[0], (out_scale, out_zero_point))

View file

@ -340,7 +340,10 @@ class QDQQuantizer(ONNXQuantizer):
if initializer:
self._add_qdq_pair_for_weight(initializer, tensor_info.axis)
else:
data_found, scale_name, zp_name, _, _ = self._get_quantization_params(tensor_name)
used_scale, used_zp = self.find_quant_scale_zp(tensor_name)
data_found, scale_name, zp_name, _, _ = self._get_quantization_params(
tensor_name, used_scale, used_zp
)
if not data_found:
raise ValueError(

View file

@ -17,6 +17,7 @@ from .operators.pad import QPad
from .operators.pooling import QLinearPool
from .operators.qdq_base_operator import QDQOperatorBase
from .operators.resize import QDQResize, QResize
from .operators.softmax import QDQSoftmax, QLinearSoftmax
from .operators.split import QDQSplit, QSplit
from .quant_utils import QuantizationMode
@ -55,6 +56,7 @@ QLinearOpsRegistry = {
"Resize": QResize,
"AveragePool": QLinearPool,
"Concat": QLinearConcat,
"Softmax": QLinearSoftmax,
}
QLinearOpsRegistry.update(CommonOpsRegistry)
@ -73,6 +75,7 @@ QDQRegistry = {
"MatMul": QDQMatMul,
"Split": QDQSplit,
"Gather": QDQGather,
"Softmax": QDQSoftmax,
}

View file

@ -115,5 +115,153 @@ TEST(QLinearLookupTableBasedOperatorTests, QLinearSigmoid_UInt8_0_Y_ZP) {
run_test(true);
}
/*
\brief data is generated by pytorch script
\details model defines
```
input(int8/uint8)
x = self.dequant(x)
x = self.softmax(x)
x = self.quant2(x)
output(int8/uint8)
```
\see then followed by the [DOC](https://pytorch.org/docs/stable/quantization.html)
*/
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8_v12) {
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
test.AddAttribute<int64_t>("axis", -2);
test.AddAttribute<int64_t>("opset", 12);
float X_scale = 0.166099221f;
//
uint8_t X_zero_point = 128;
float Y_scale = 1.0f / 256.0f;
uint8_t Y_zero_point = 0;
//
std::vector<int64_t> dims = {2, 4, 5};
auto x_in = std::vector<uint8_t>{50, 67, 58, 68, 46, 69, 77, 91, 62, 74, 67, 72, 71, 70, 83, 88, 75, 54, 74, 88};
auto y_out = std::vector<uint8_t> { 0, 2, 0, 2, 0, 2, 8, 86, 1, 5, 2, 4, 3, 3, 23, 52, 6, 0, 5, 52 };
for (int64_t i = 1; i < dims[0]; i++) {
for (int64_t j = 0; j < dims[1] * dims[2]; j++) {
x_in.push_back(x_in[j]);
y_out.push_back(y_out[j]);
}
}
test.AddInput<uint8_t>("X", dims, x_in);
test.AddInput<float>("X_scale", {}, {X_scale});
test.AddInput<uint8_t>("X_zero_point", {}, {X_zero_point});
test.AddInput<float>("Y_scale", {}, {Y_scale});
test.AddInput<uint8_t>("Y_zero_point", {}, {Y_zero_point});
test.AddOutput<uint8_t>("Y", dims, y_out);
auto origin_round_mode = std::fegetround();
std::fesetround(FE_TONEAREST);
test.Run();
std::fesetround(origin_round_mode);
}
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8_v13) {
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
test.AddAttribute<int64_t>("axis", -2);
test.AddAttribute<int64_t>("opset", 13);
float X_scale = 0.0304f;
//
uint8_t X_zero_point = 128;
float Y_scale = 0.0059f;
uint8_t Y_zero_point = 0;
//
std::vector<int64_t> dims = {4, 4, 4};
auto x_in = std::vector<uint8_t> {
62, 50, 71, 37, 68, 88, 64, 51, 59, 95, 41, 54, 55, 20, 77, 32, 92,
63, 43, 13, 76, 82, 53, 43, 60, 18, 73, 74, 22, 89, 44, 106, 17,
95, 27, 35, 47, 57, 0, 78, 97, 66, 56, 28, 127, 33, 106, 71, 119,
64, 16, 0, 16, 79, 27, 89, 110, 126, 88, 90, 67, 11, 4, 90};
auto y_out = std::vector<uint8_t> {
43, 20, 50, 33, 52, 63, 40, 51, 39, 78,
20, 56, 35, 8, 59, 29, 80, 32, 29, 6, 49, 57, 39, 16, 30, 8, 72, 40,
10, 71, 30, 107, 4, 90, 11, 20, 10, 28, 5, 74, 45, 37, 27, 16, 111, 14,
125, 59, 84, 18, 14, 4, 4, 28, 20, 54, 64, 119, 126, 56, 17, 4, 10, 56};
test.AddInput<uint8_t>("X", dims, x_in);
test.AddInput<float>("X_scale", {}, {X_scale});
test.AddInput<uint8_t>("X_zero_point", {}, {X_zero_point});
test.AddInput<float>("Y_scale", {}, {Y_scale});
test.AddInput<uint8_t>("Y_zero_point", {}, {Y_zero_point});
test.AddOutput<uint8_t>("Y", dims, y_out);
auto origin_round_mode = std::fegetround();
std::fesetround(FE_TONEAREST);
test.Run();
std::fesetround(origin_round_mode);
}
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_Int8_v13) {
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
test.AddAttribute<int64_t>("axis", -2);
test.AddAttribute<int64_t>("opset", 13);
float X_scale = 0.0304F;
//
int8_t X_zero_point = 0;
float Y_scale = 0.0059F;
int8_t Y_zero_point = -128;
//
std::vector<int64_t> dims = {4, 4, 4};
auto x_in = std::vector<int8_t> {
-4, -16, 5, -29, 2, 22, -2, -15, -7, 29, -25, -12, -11, -46, 11, -34, 26,
-3, -23, -53, 10, 16, -13, -23, -6, -48, 7, 8, -44, 23, -22, 40, -49, 29, -39, -31, -19, -9,
-72, 12, 31, 0, -10, -38, 61, -33, 40, 5, 53, -2, -50, -66, -50, 13, -39, 23, 44, 60, 22, 24,
1, -55, -62, 24};
auto y_out = std::vector<int8_t> {
-85, -108, -78, -95, -76, -65, -88, -77, -89, -50, -108, -72, -93,
-120, -69, -99, -48, -96, -99, -122, -79, -71, -89, -112, -98, -120, -56, -88, -118, -57, -98,
-21, -124, -38, -117, -108, -118, -100, -124, -54, -83, -91, -100, -112, -17, -114, -2, -69, -44,
-110, -114, -124, -124, -100, -108, -74, -64, -9, -2, -72, -111, -124, -118, -72};
test.AddInput<int8_t>("X", dims, x_in);
test.AddInput<float>("X_scale", {}, {X_scale});
test.AddInput<int8_t>("X_zero_point", {}, {X_zero_point});
test.AddInput<float>("Y_scale", {}, {Y_scale});
test.AddInput<int8_t>("Y_zero_point", {}, {Y_zero_point});
test.AddOutput<int8_t>("Y", dims, y_out);
auto origin_round_mode = std::fegetround();
std::fesetround(FE_TONEAREST);
test.Run();
std::fesetround(origin_round_mode);
}
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_Int8_v12) {
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
test.AddAttribute<int64_t>("axis", -2);
test.AddAttribute<int64_t>("opset", 12);
float X_scale = 0.166099221f;
//
int8_t X_zero_point = 0;
float Y_scale = 1.0f / 128.0f;
int8_t Y_zero_point = 0;
//
std::vector<int64_t> dims = {2, 4, 5};
auto x_in = std::vector<int8_t>{-28, -4, -4, -7, 3, -26, 4, -16, 23, 14, -7, 26, -8, 19, -16, -13, 7, 17, 27, 5};
auto y_out = std::vector<int8_t>{0, 0, 0, 0, 1, 0, 1, 0, 22, 5, 0, 35, 0, 11, 0, 0, 2, 8, 42, 1};
for (int64_t i = 1; i < dims[0]; i++) {
for (int64_t j = 0; j < dims[1] * dims[2]; j++) {
x_in.push_back(x_in[j]);
y_out.push_back(y_out[j]);
}
}
test.AddInput<int8_t>("X", dims, x_in);
test.AddInput<float>("X_scale", {}, {X_scale});
test.AddInput<int8_t>("X_zero_point", {}, {X_zero_point});
test.AddInput<float>("Y_scale", {}, {Y_scale});
test.AddInput<int8_t>("Y_zero_point", {}, {Y_zero_point});
test.AddOutput<int8_t>("Y", dims, y_out);
auto origin_round_mode = std::fegetround();
std::fesetround(FE_TONEAREST);
test.Run();
std::fesetround(origin_round_mode);
}
} // namespace test
} // namespace onnxruntime

View file

@ -1857,6 +1857,67 @@ TEST(QDQTransformerTests, Concat) {
test_case({{1, 6, 36}, {1, 6, 8}, {1, 6, 2}}, 2, false, false, true);
}
template <typename InputType, typename OutputType>
void QDQTransformerSoftmaxTests() {
auto test_case = [&](const std::vector<int64_t>& input_shape, int64_t axis) {
auto build_test_case = [&](ModelTestBuilder& builder) {
auto* input_arg = builder.MakeInput<float>(input_shape, -5.f, 5.f);
auto* output_arg = builder.MakeOutput();
// add QDQ + Softmax
auto* dq_output = AddQDQNodePair<InputType>(builder, input_arg, .105f,
(std::numeric_limits<OutputType>::max() / 255 * 255) / 2);
auto* softmax_output = builder.MakeIntermediate();
auto& softmax_node = builder.AddNode("Softmax", {dq_output}, {softmax_output});
softmax_node.AddAttribute("axis", axis);
// add QDQ output
auto* q_output = builder.MakeIntermediate();
builder.AddQuantizeLinearNode<OutputType>(softmax_output,
1.0f / (std::numeric_limits<OutputType>::max() + 1),
0,
q_output);
builder.AddDequantizeLinearNode<OutputType>(q_output,
1.0f / (std::numeric_limits<OutputType>::max() + 1),
0,
output_arg);
};
auto check_graph = [&](InferenceSessionWrapper& session) {
auto op_to_count = CountOpsInGraph(session.GetGraph());
if constexpr (std::is_same<InputType, OutputType>::value) {
EXPECT_EQ(op_to_count["com.microsoft.QLinearSoftmax"], 1);
EXPECT_EQ(op_to_count["Softmax"], 0);
EXPECT_EQ(op_to_count["QuantizeLinear"], 1);
EXPECT_EQ(op_to_count["DequantizeLinear"], 1);
} else {
EXPECT_EQ(op_to_count["com.microsoft.QLinearSoftmax"], 0);
EXPECT_EQ(op_to_count["Softmax"], 1);
EXPECT_EQ(op_to_count["QuantizeLinear"], 2);
EXPECT_EQ(op_to_count["DequantizeLinear"], 2);
}
};
TransformerTester(build_test_case,
check_graph,
TransformerLevel::Level1,
TransformerLevel::Level2,
12 /*opset_version*/,
0.01 /*per_sample_tolerance*/,
0.01 /*relative_per_sample_tolerance*/,
std::make_unique<QDQSelectorActionTransformer>(QDQIsInt8Allowed()));
};
test_case({1, 12, 37}, -1);
test_case({1, 23, 13, 13}, -2);
}
TEST(QDQTransformerTests, Softmax_S8S8) {
QDQTransformerSoftmaxTests<int8_t, int8_t>();
}
TEST(QDQTransformerTests, Softmax_U8U8) {
QDQTransformerSoftmaxTests<uint8_t, uint8_t>();
}
#endif // !defined(DISABLE_CONTRIB_OPS)
TEST(QDQTransformerTests, QDQPropagation_QBackward) {

View file

@ -74,6 +74,8 @@ def check_model_correctness(testcase, model_path_origin, model_path_to_check, in
model_path_origin, sess_options=sess_options, providers=["CPUExecutionProvider"]
)
origin_results = origin_sess.run([], inputs)
# enable QDQ transformers
# sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
target_sess = onnxruntime.InferenceSession(
model_path_to_check,
sess_options=sess_options,

View file

@ -0,0 +1,180 @@
#!/usr/bin/env python
"""
Softmax quantization test case
"""
# coding: utf-8
# -------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for
# license information.
# --------------------------------------------------------------------------
import unittest
import numpy as np
import onnx
from onnx import TensorProto, helper
from op_test_utils import TestDataFeeds, check_model_correctness, check_op_type_count, check_qtype_by_node_type
from onnxruntime.quantization import QuantFormat, QuantType, quantize_static
class TestOpSoftmax(unittest.TestCase):
"""_summary_
unittest (softmax): quantization of QDQ and Qop with u8 and s8
"""
def input_feeds(self, n_repeat, name2shape):
input_data_list = []
for _ in range(n_repeat):
inputs = {}
for name, shape in name2shape.items():
inputs.update({name: np.random.randint(-1, 2, shape).astype(np.float32)})
input_data_list.extend([inputs])
data_r = TestDataFeeds(input_data_list)
return data_r
def construct_model_conv_softmax(
self,
output_model_path,
conv_input_shape,
conv_weight_shape,
softmax_input_shape,
softmax_attributes,
output_shape,
):
# (input)
# \
# Conv
# / \
# Identity Softmax
# / \
# (identity_out) (output)
input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, conv_input_shape)
conv_weight_arr = np.random.randint(-1, 2, conv_weight_shape).astype(np.float32)
conv_weight_initializer = onnx.numpy_helper.from_array(conv_weight_arr, name="conv1_weight")
conv_node = onnx.helper.make_node("Conv", ["input", "conv1_weight"], ["conv_output"], name="conv_node")
identity_out = helper.make_tensor_value_info("identity_out", TensorProto.FLOAT, softmax_input_shape)
identity_node = helper.make_node("Identity", ["conv_output"], ["identity_out"], name="IdentityNode")
initializers = [conv_weight_initializer]
output_tensor = helper.make_tensor_value_info("output", TensorProto.FLOAT, output_shape)
softmax_node = helper.make_node(
"Softmax", ["conv_output"], ["output"], name="softmax_node", **softmax_attributes
)
graph = helper.make_graph(
[conv_node, identity_node, softmax_node],
"TestOpQuantizersoftmax_test_model",
[input_tensor],
[identity_out, output_tensor],
initializer=initializers,
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 13)])
model.ir_version = 7 # use stable onnx ir version
onnx.save(model, output_model_path)
def quantize_softmax_test(self, activation_type, weight_type, extra_options={}):
np.random.seed(1)
model_fp32_path = "softmax_fp32.onnx"
self.construct_model_conv_softmax(
model_fp32_path,
[1, 2, 26, 42],
[3, 2, 3, 3],
[1, 3, 24, 40],
{"axis": -2},
[1, 3, 24, 40],
)
data_reader = self.input_feeds(1, {"input": [1, 2, 26, 42]})
activation_proto_qtype = TensorProto.UINT8 if activation_type == QuantType.QUInt8 else TensorProto.INT8
activation_type_str = "u8" if (activation_type == QuantType.QUInt8) else "s8"
weight_type_str = "u8" if (weight_type == QuantType.QUInt8) else "s8"
model_q8_path = f"softmax_{activation_type_str}{weight_type_str}.onnx"
model_q8_qdq_path = f"softmax_qdq_{activation_type_str}{weight_type_str}.onnx"
# Verify QOperator mode
data_reader.rewind()
quantize_static(
model_fp32_path,
model_q8_path,
data_reader,
quant_format=QuantFormat.QOperator,
activation_type=activation_type,
weight_type=weight_type,
extra_options=extra_options,
)
qnode_counts = {
"QLinearConv": 1,
"QuantizeLinear": 1,
"DequantizeLinear": 2,
"QLinearSoftmax": 1,
"Softmax": 0,
}
check_op_type_count(self, model_q8_path, **qnode_counts)
qnode_io_qtypes = {
"QuantizeLinear": [
["i", 2, activation_proto_qtype],
["o", 0, activation_proto_qtype],
]
}
qnode_io_qtypes.update(
{
"QLinearConv": [
["i", 2, activation_proto_qtype],
["i", 7, activation_proto_qtype],
["o", 0, activation_proto_qtype],
]
}
)
qnode_io_qtypes.update(
{"QLinearSoftmax": [["i", 4, activation_proto_qtype]]}
) # shape info note workig on custome ops
check_qtype_by_node_type(self, model_q8_path, qnode_io_qtypes)
data_reader.rewind()
check_model_correctness(self, model_fp32_path, model_q8_path, data_reader.get_next())
# Verify QDQ mode
data_reader.rewind()
quantize_static(
model_fp32_path,
model_q8_qdq_path,
data_reader,
quant_format=QuantFormat.QDQ,
activation_type=activation_type,
weight_type=weight_type,
extra_options=extra_options,
)
qdqnode_counts = {
"Conv": 1,
"QuantizeLinear": 3,
"DequantizeLinear": 4,
"Softmax": 1,
}
check_op_type_count(self, model_q8_qdq_path, **qdqnode_counts)
qnode_io_qtypes = {
"QuantizeLinear": [
["i", 2, activation_proto_qtype],
["o", 0, activation_proto_qtype],
]
}
check_qtype_by_node_type(self, model_q8_qdq_path, qnode_io_qtypes)
data_reader.rewind()
check_model_correctness(self, model_fp32_path, model_q8_qdq_path, data_reader.get_next())
def test_quantize_softmax(self):
self.quantize_softmax_test(QuantType.QUInt8, QuantType.QUInt8)
def test_quantize_softmax_s8s8(self):
self.quantize_softmax_test(
QuantType.QInt8,
QuantType.QInt8,
extra_options={"ActivationSymmetric": True},
)
if __name__ == "__main__":
unittest.main()

View file

@ -298,5 +298,9 @@
[
"QGemm com.microsoft CPUExecutionProvider",
13737193491843065240
],
[
"QLinearSoftmax com.microsoft CPUExecutionProvider",
10339195975968977840
]
]
]