mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-06-06 00:03:22 +00:00
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md
This commit is contained in:
parent
0c6037b5ab
commit
64e991a9fc
20 changed files with 1014 additions and 6 deletions
|
|
@ -55,6 +55,7 @@ Do not modify directly.*
|
|||
* <a href="#com.microsoft.QLinearMul">com.microsoft.QLinearMul</a>
|
||||
* <a href="#com.microsoft.QLinearReduceMean">com.microsoft.QLinearReduceMean</a>
|
||||
* <a href="#com.microsoft.QLinearSigmoid">com.microsoft.QLinearSigmoid</a>
|
||||
* <a href="#com.microsoft.QLinearSoftmax">com.microsoft.QLinearSoftmax</a>
|
||||
* <a href="#com.microsoft.QuantizeLinear">com.microsoft.QuantizeLinear</a>
|
||||
* <a href="#com.microsoft.Range">com.microsoft.Range</a>
|
||||
* <a href="#com.microsoft.ReduceSumInteger">com.microsoft.ReduceSumInteger</a>
|
||||
|
|
@ -2771,7 +2772,7 @@ This version of the operator has been available since version 1 of the 'com.micr
|
|||
|
||||
### <a name="com.microsoft.QLinearSigmoid"></a><a name="com.microsoft.qlinearsigmoid">**com.microsoft.QLinearSigmoid**</a>
|
||||
|
||||
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
|
||||
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
|
||||
(Tensor<T>) where the function `f(x) = quantize(Sigmoid(dequantize(x)))`, is applied to the data tensor elementwise.
|
||||
Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))`
|
||||
|
||||
|
|
@ -2809,6 +2810,58 @@ This version of the operator has been available since version 1 of the 'com.micr
|
|||
</dl>
|
||||
|
||||
|
||||
### <a name="com.microsoft.QLinearSoftmax"></a><a name="com.microsoft.qlinearsoftmax">**com.microsoft.QLinearSoftmax**</a>
|
||||
|
||||
QLinearSoftmax computes the normalized exponential values for the given input:
|
||||
Softmax(input, axis) = Exp(input) / ReduceSum(Exp(input), axis=axis, keepdims=1)
|
||||
The input does not need to explicitly be a 2D vector. The "axis" attribute
|
||||
indicates the dimension along which QLinearSoftmax will be performed for onnx v.13+.
|
||||
or the dimension coerced to NxD Matrix for onnx v.12-.
|
||||
The output tensor has the same shape.
|
||||
|
||||
#### Version
|
||||
|
||||
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
|
||||
|
||||
#### Attributes
|
||||
|
||||
<dl>
|
||||
<dt><tt>axis</tt> : int</dt>
|
||||
<dd>apply softmax to elements for dimensions axis,or all dims along with axis according to op-version</dd>
|
||||
<dt><tt>opset</tt> : int (required)</dt>
|
||||
<dd>opset version of corresponding SoftMax.</dd>
|
||||
</dl>
|
||||
|
||||
#### Inputs
|
||||
|
||||
<dl>
|
||||
<dt><tt>X</tt> : T</dt>
|
||||
<dd>The input tensor</dd>
|
||||
<dt><tt>X_scale</tt> : tensor(float)</dt>
|
||||
<dd>Scale of quantized input 'X'. It must be a scalar.</dd>
|
||||
<dt><tt>x_zero_point</tt> (optional) : T</dt>
|
||||
<dd>Zero point tensor for input 'X'.It must be a scalar.</dd>
|
||||
<dt><tt>y_scale</tt> : tensor(float)</dt>
|
||||
<dd>Scale of quantized output 'Y'. It must be a scalar.</dd>
|
||||
<dt><tt>y_zero_point</tt> : T</dt>
|
||||
<dd>Zero point tensor for output 'Y'. It must be a scalar.</dd>
|
||||
</dl>
|
||||
|
||||
#### Outputs
|
||||
|
||||
<dl>
|
||||
<dt><tt>Y</tt> : T</dt>
|
||||
<dd>Output data tensor from pooling across the input tensor. The output tensor has the same rank as the input. </dd>
|
||||
</dl>
|
||||
|
||||
#### Type Constraints
|
||||
|
||||
<dl>
|
||||
<dt><tt>T</tt> : tensor(uint8), tensor(int8)</dt>
|
||||
<dd>Constrain input and output types to singed/unsigned int8 tensors.</dd>
|
||||
</dl>
|
||||
|
||||
|
||||
### <a name="com.microsoft.QuantizeLinear"></a><a name="com.microsoft.quantizelinear">**com.microsoft.QuantizeLinear**</a>
|
||||
|
||||
The linear quantization operator. It consumes a full precision data, a scale, a zero point to compute the low precision / quantized tensor.
|
||||
|
|
|
|||
|
|
@ -430,6 +430,7 @@ Do not modify directly.*
|
|||
|QLinearLeakyRelu|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* X_zero_point:**T**<br> *in* Y_scale:**tensor(float)**<br> *in* Y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|
||||
|QLinearMul|*in* A:**T**<br> *in* A_scale:**tensor(float)**<br> *in* A_zero_point:**T**<br> *in* B:**T**<br> *in* B_scale:**tensor(float)**<br> *in* B_zero_point:**T**<br> *in* C_scale:**tensor(float)**<br> *in* C_zero_point:**T**<br> *out* C:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|
||||
|QLinearSigmoid|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* X_zero_point:**T**<br> *in* Y_scale:**tensor(float)**<br> *in* Y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|
||||
|QLinearSoftmax|*in* X:**T**<br> *in* X_scale:**tensor(float)**<br> *in* x_zero_point:**T**<br> *in* y_scale:**tensor(float)**<br> *in* y_zero_point:**T**<br> *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|
||||
|QuantizeLinear|*in* x:**T1**<br> *in* y_scale:**T1**<br> *in* y_zero_point:**T2**<br> *out* y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|
||||
|Range|*in* start:**T**<br> *in* limit:**T**<br> *in* delta:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
|
||||
|SampleOp|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
|
||||
|
|
|
|||
|
|
@ -193,6 +193,9 @@ using BuildKernelCreateInfoFn = KernelCreateInfo (*)();
|
|||
#define ONNX_CPU_OPERATOR_ML_KERNEL(name, ver, builder, ...) \
|
||||
ONNX_OPERATOR_KERNEL_EX(name, kMLDomain, ver, kCpuExecutionProvider, builder, __VA_ARGS__)
|
||||
|
||||
#define ONNX_CPU_OPERATOR_MS_KERNEL(name, ver, builder, ...) \
|
||||
ONNX_OPERATOR_KERNEL_EX(name, kMSDomain, ver, kCpuExecutionProvider, builder, __VA_ARGS__)
|
||||
|
||||
#define ONNX_OPERATOR_KERNEL_EX(name, domain, ver, provider, builder, ...) \
|
||||
class ONNX_OPERATOR_KERNEL_CLASS_NAME(provider, domain, ver, name); \
|
||||
template <> \
|
||||
|
|
|
|||
|
|
@ -55,6 +55,7 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
|
|||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu);
|
||||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid);
|
||||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearSigmoid);
|
||||
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearSoftmax);
|
||||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearAdd);
|
||||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearAdd);
|
||||
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearMul);
|
||||
|
|
@ -151,6 +152,7 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
|
|||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearLeakyRelu)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearSigmoid)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearSigmoid)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, QLinearSoftmax)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearAdd)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QLinearAdd)>,
|
||||
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QLinearMul)>,
|
||||
|
|
|
|||
343
onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc
Normal file
343
onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc
Normal file
|
|
@ -0,0 +1,343 @@
|
|||
// Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
// Licensed under the MIT License.
|
||||
|
||||
#include "contrib_ops/cpu/quantization/qlinear_softmax.h"
|
||||
|
||||
#include <cstdint>
|
||||
#include <type_traits>
|
||||
#include <utility>
|
||||
|
||||
#include "core/common/common.h"
|
||||
#include "core/framework/tensorprotoutils.h"
|
||||
#include "core/providers/common.h"
|
||||
#include "core/providers/cpu/tensor/transpose.h"
|
||||
|
||||
#include "core/mlas/inc/mlas.h"
|
||||
#include "core/platform/threadpool.h"
|
||||
#include "gsl/gsl-lite.hpp"
|
||||
|
||||
namespace onnxruntime {
|
||||
namespace contrib {
|
||||
|
||||
constexpr int OPSET13 = 13;
|
||||
|
||||
namespace {
|
||||
|
||||
void QlinearBuildLookupTableUint32(gsl::span<uint32_t> table,
|
||||
const float x_scale,
|
||||
size_t reduce_len, bool is_signed) {
|
||||
const double qscale =
|
||||
fmin(static_cast<double>(UINT32_MAX) / static_cast<double>(reduce_len), static_cast<double>(0x7fffff));
|
||||
for (int32_t i = 0; i < 256; i++) {
|
||||
double scaled_exp_xi = qscale * exp(static_cast<double>(i - 255) * static_cast<double>(x_scale));
|
||||
// we can't get the real max value of input tensor here, so we just assume 255.
|
||||
// in the function of `QlinearSoftmaxCPU`,
|
||||
// all numbers will have a shift (255-max_value) if its max value is not 255
|
||||
//
|
||||
// if is_signed index = [1 2 3 ......126 127 -128 -127 ..... -3 -2 -1]
|
||||
// else [0 1 2 3 4 ..... 256]
|
||||
uint8_t index = static_cast<uint8_t>(is_signed ? i - 128 : i);
|
||||
table[index] = static_cast<uint32_t>(lrint(scaled_exp_xi));
|
||||
}
|
||||
}
|
||||
|
||||
void BuildLookupTableIfFixed(const OpKernelInfo& info, std::vector<uint32_t>& fixed_lookup_table,
|
||||
size_t reduce_len, bool is_signed) {
|
||||
const Tensor* tensor_x_scale = nullptr;
|
||||
|
||||
bool get_x_scale = info.TryGetConstantInput(1, &tensor_x_scale);
|
||||
ORT_ENFORCE(tensor_x_scale == nullptr || IsScalarOr1ElementVector(tensor_x_scale),
|
||||
"QlinearBuildLookupTable : input X_scale must be a scalar or 1D tensor of size 1");
|
||||
bool is_fixed_parameters = get_x_scale;
|
||||
|
||||
if (is_fixed_parameters) {
|
||||
fixed_lookup_table.resize(256);
|
||||
const float X_scale = *(tensor_x_scale->Data<float>());
|
||||
QlinearBuildLookupTableUint32(fixed_lookup_table, X_scale, reduce_len, is_signed);
|
||||
}
|
||||
}
|
||||
} // namespace
|
||||
|
||||
QLinearSoftmax::QLinearSoftmax(const OpKernelInfo& info)
|
||||
: OpKernel(info) {
|
||||
const auto& node = info.node();
|
||||
auto input_defs = node.InputDefs();
|
||||
auto input_type = input_defs[0]->TypeAsProto()->tensor_type().elem_type();
|
||||
is_signed_ = (input_type == ONNX_NAMESPACE::TensorProto_DataType_INT8);
|
||||
const auto* x_shape = input_defs[0]->Shape();
|
||||
ORT_ENFORCE(x_shape != nullptr && x_shape->dim_size() > 0, "input_shape of QLinearSoftmax must be existed");
|
||||
int rank = x_shape->dim_size();
|
||||
|
||||
int64_t opset = -1;
|
||||
Status status = info.GetAttr<int64_t>("opset", &opset);
|
||||
ORT_ENFORCE(status.IsOK(), "opset must be existed in attributes of QlinearSoftmax");
|
||||
opset_ = gsl::narrow_cast<int>(opset);
|
||||
|
||||
int64_t axis = -1;
|
||||
status = info.GetAttr<int64_t>("axis", &axis);
|
||||
if (status.IsOK()) {
|
||||
axis_ = gsl::narrow_cast<int>(axis);
|
||||
} else {
|
||||
// opset-12 and below, the default axis value is 1
|
||||
// opset-13, the default axis value is -1
|
||||
axis_ = opset_ < OPSET13 ? 1 : -1;
|
||||
}
|
||||
|
||||
axis_ = static_cast<int>(HandleNegativeAxis(axis_, int64_t(rank)));
|
||||
auto input_shape = utils::GetTensorShapeFromTensorShapeProto(*x_shape);
|
||||
int64_t reduce_size = opset_ < OPSET13 ? input_shape.SizeFromDimension(axis_) : input_shape[axis_];
|
||||
// reduce_size could be negative if input-shape has a dynamic axis
|
||||
if (reduce_size > 0) {
|
||||
BuildLookupTableIfFixed(info, fixed_lookup_table_, reduce_size, is_signed_);
|
||||
}
|
||||
}
|
||||
|
||||
// compute method of Softmax
|
||||
Status QLinearSoftmax::Compute(OpKernelContext* ctx) const {
|
||||
const auto* X = ctx->Input<Tensor>(0);
|
||||
const auto& X_shape = X->Shape();
|
||||
auto* Y = ctx->Output(0, X_shape);
|
||||
|
||||
// edge case. one or more dims with value of 0. nothing to do
|
||||
if (X_shape.Size() == 0) {
|
||||
return Status::OK();
|
||||
}
|
||||
concurrency::ThreadPool* thread_pool = ctx->GetOperatorThreadPool();
|
||||
const size_t D = opset_ < OPSET13 ? X_shape.SizeFromDimension(axis_): X_shape[axis_];
|
||||
uint32_t tmp_lookup_table[256];
|
||||
gsl::span<const uint32_t> lookup_table = GetLookupTable(ctx, tmp_lookup_table, D);
|
||||
|
||||
if (opset_ < OPSET13) {
|
||||
return ComputeInternal(ctx, *X, *Y, lookup_table, axis_, thread_pool);
|
||||
} else {
|
||||
return ComputeImplOpset13(ctx, *X, *Y, lookup_table, thread_pool);
|
||||
}
|
||||
}
|
||||
|
||||
template <typename T>
|
||||
common::Status QlinearSoftmaxCPU(size_t N,
|
||||
size_t D,
|
||||
const T* x_data,
|
||||
T* y_data,
|
||||
const uint32_t* lookup_table,
|
||||
uint32_t y_scale,
|
||||
T yzp,
|
||||
onnxruntime::concurrency::ThreadPool* thread_pool);
|
||||
|
||||
template <>
|
||||
common::Status QlinearSoftmaxCPU<uint8_t>(size_t N,
|
||||
size_t D,
|
||||
const uint8_t* x_data,
|
||||
uint8_t* y_data,
|
||||
const uint32_t* lookup_table,
|
||||
uint32_t y_scale,
|
||||
uint8_t yzp,
|
||||
onnxruntime::concurrency::ThreadPool* thread_pool) {
|
||||
using onnxruntime::TensorOpCost;
|
||||
using onnxruntime::concurrency::ThreadPool;
|
||||
ThreadPool::TryParallelFor(
|
||||
thread_pool, N,
|
||||
// Read 3*N (max,sum,div) write N (div), computation=Read
|
||||
TensorOpCost{static_cast<double>(D * 3),
|
||||
static_cast<double>(D),
|
||||
static_cast<double>(D * 3)},
|
||||
[x_data, y_data, D, y_scale, yzp, &lookup_table](std::ptrdiff_t first, std::ptrdiff_t last) {
|
||||
const auto c_y_scale = y_scale;
|
||||
const auto c_y_zp = yzp;
|
||||
const uint8_t* x_t = x_data + first * D;
|
||||
uint8_t* y_t = y_data + first * D;
|
||||
for (; first < last; first++) {
|
||||
// reduceMaxUint8
|
||||
uint8_t xmax = *std::max_element(x_t, x_t + D);
|
||||
// we want the xmas to align with 255 for higher precision.
|
||||
// as we build a lookup table with X-255. So we could use the adjustment here
|
||||
// to let all numbers have a shift in the lookup table.
|
||||
// 1 2 3 4 5 ...........................254 255
|
||||
// 1 3 5 ... 10
|
||||
// after the shift --->
|
||||
// 235 237 239 .. 255
|
||||
const uint32_t* shifted_lookuptable = lookup_table + 255 - xmax;
|
||||
size_t elements_n = D;
|
||||
// reduceSumUin8ToUint32: need speedup
|
||||
// vsum = \sum_i{e^x_i}
|
||||
uint32_t vsum = 0;
|
||||
const uint8_t* x_t_cur = x_t;
|
||||
do {
|
||||
const size_t vx = *x_t_cur++;
|
||||
vsum += shifted_lookuptable[vx];
|
||||
} while (--elements_n != 0);
|
||||
if (vsum == 0) {
|
||||
return;
|
||||
}
|
||||
elements_n = D;
|
||||
x_t_cur = x_t;
|
||||
// elementwise div, y_i=\frac{x_i}{vsum}
|
||||
const uint32_t vrounding = (vsum >> 1);
|
||||
do {
|
||||
const size_t vx = *x_t_cur++;
|
||||
const uint32_t vt = shifted_lookuptable[vx];
|
||||
// simulate round function, and re-quant to uint8
|
||||
const uint32_t vq = ((vt * c_y_scale) + vrounding) / vsum + c_y_zp;
|
||||
const uint8_t vy = vq > 255 ? static_cast<uint8_t>(255) : static_cast<uint8_t>(vq);
|
||||
*y_t++ = vy;
|
||||
} while (--elements_n != 0);
|
||||
x_t = x_t_cur;
|
||||
}
|
||||
});
|
||||
|
||||
return Status::OK();
|
||||
}
|
||||
|
||||
template <>
|
||||
common::Status QlinearSoftmaxCPU<int8_t>(size_t N,
|
||||
size_t D,
|
||||
const int8_t* x_data,
|
||||
int8_t* y_data,
|
||||
const uint32_t* lookup_table,
|
||||
uint32_t y_scale,
|
||||
int8_t yzp,
|
||||
onnxruntime::concurrency::ThreadPool* thread_pool) {
|
||||
using onnxruntime::TensorOpCost;
|
||||
using onnxruntime::concurrency::ThreadPool;
|
||||
ThreadPool::TryParallelFor(
|
||||
thread_pool, N,
|
||||
// Read 3*N (max,sum,div) write N (div), computation=Read
|
||||
TensorOpCost{static_cast<double>(D * 3),
|
||||
static_cast<double>(D),
|
||||
static_cast<double>(D * 3)},
|
||||
[x_data, y_data, D, y_scale, yzp, &lookup_table](std::ptrdiff_t first, std::ptrdiff_t last) {
|
||||
const auto c_y_scale = y_scale;
|
||||
const auto c_y_zp = yzp;
|
||||
|
||||
const int8_t* x_t = x_data + first * D;
|
||||
int8_t* y_t = y_data + first * D;
|
||||
for (; first < last; first++) {
|
||||
// reduceMaxInt8
|
||||
int8_t xmax = *std::max_element(x_t, x_t + D);
|
||||
const size_t adjustment = 127 - xmax;
|
||||
const uint32_t* shifted_lookuptable = lookup_table;
|
||||
size_t elements_n = D;
|
||||
// reduceSumUin8ToUint32: need speedup
|
||||
uint32_t vsum = 0;
|
||||
const int8_t* x_t_cur = x_t;
|
||||
do {
|
||||
const size_t vx = uint8_t(adjustment + (*x_t_cur++));
|
||||
vsum += shifted_lookuptable[vx];
|
||||
} while (--elements_n != 0);
|
||||
if (vsum == 0) {
|
||||
return;
|
||||
}
|
||||
elements_n = D;
|
||||
x_t_cur = x_t;
|
||||
// elementwise div
|
||||
const uint32_t vrounding = (vsum >> 1);
|
||||
do {
|
||||
const size_t vx = uint8_t(adjustment + (*x_t_cur++));
|
||||
const uint32_t vt = shifted_lookuptable[vx];
|
||||
// simulate round function, and re-quant to Int8
|
||||
const uint32_t vq = ((vt * c_y_scale) + vrounding) / vsum + c_y_zp;
|
||||
const int8_t vy = static_cast<int32_t>(vq) > 255 ? static_cast<int8_t>(255) : static_cast<int8_t>(vq);
|
||||
*y_t++ = vy;
|
||||
} while (--elements_n != 0);
|
||||
x_t = x_t_cur;
|
||||
}
|
||||
});
|
||||
|
||||
return Status::OK();
|
||||
}
|
||||
|
||||
gsl::span<const uint32_t> QLinearSoftmax::GetLookupTable(OpKernelContext* context,
|
||||
gsl::span<uint32_t> lookup_table_span,
|
||||
size_t reduce_len) const {
|
||||
gsl::span<const uint32_t> lookup_table = fixed_lookup_table_;
|
||||
if (fixed_lookup_table_.size() == 0) {
|
||||
lookup_table = lookup_table_span;
|
||||
const float X_scale = *(context->Input<Tensor>(1)->Data<float>());
|
||||
QlinearBuildLookupTableUint32(lookup_table_span, X_scale, reduce_len, is_signed_);
|
||||
}
|
||||
return lookup_table;
|
||||
}
|
||||
|
||||
// opset-12 and below
|
||||
Status QLinearSoftmax::ComputeInternal(OpKernelContext* context, const Tensor& input, Tensor& output,
|
||||
gsl::span<const uint32_t> lookup_table, int axis,
|
||||
concurrency::ThreadPool* thread_pool) const {
|
||||
const auto* Y_scale_tensor = context->Input<Tensor>(3);
|
||||
const auto* Y_zp_tensor = context->Input<Tensor>(4);
|
||||
const auto Y_scale = gsl::narrow_cast<uint32_t>(1.0F / (*(Y_scale_tensor->Data<float>())));
|
||||
const auto& X_shape = input.Shape();
|
||||
const size_t N = X_shape.SizeToDimension(axis);
|
||||
const size_t D = X_shape.SizeFromDimension(axis);
|
||||
common::Status status;
|
||||
if (is_signed_) {
|
||||
using T = int8_t;
|
||||
const T Y_zp = Y_zp_tensor ? *(Y_zp_tensor->Data<T>()) : 0;
|
||||
status = QlinearSoftmaxCPU<T>(N, D, input.Data<T>(), output.MutableData<T>(),
|
||||
lookup_table.data(), Y_scale, Y_zp, thread_pool);
|
||||
} else {
|
||||
using T = uint8_t;
|
||||
const T Y_zp = Y_zp_tensor ? *(Y_zp_tensor->Data<T>()) : 0;
|
||||
status = QlinearSoftmaxCPU<T>(N, D, input.Data<T>(), output.MutableData<T>(),
|
||||
lookup_table.data(), Y_scale, Y_zp, thread_pool);
|
||||
}
|
||||
return status;
|
||||
}
|
||||
|
||||
// opset-13 and above
|
||||
Status QLinearSoftmax::ComputeImplOpset13(OpKernelContext* context,
|
||||
const Tensor& input, Tensor& output,
|
||||
gsl::span<const uint32_t> lookup_table,
|
||||
concurrency::ThreadPool* thread_pool) const {
|
||||
const auto& X_shape = input.Shape();
|
||||
size_t rank = X_shape.NumDimensions();
|
||||
|
||||
bool is_transpose_required = (size_t(axis_) != (rank - 1));
|
||||
Tensor transposed_input;
|
||||
Tensor intermediate_output; // output that the softmax implementation will write into while using transposed input
|
||||
std::vector<size_t> permutation(rank);
|
||||
|
||||
if (is_transpose_required) {
|
||||
AllocatorPtr alloc;
|
||||
ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));
|
||||
std::iota(std::begin(permutation), std::end(permutation), 0);
|
||||
|
||||
// swap the innermost dim with the dim corresponding to axis
|
||||
permutation[axis_] = rank - 1;
|
||||
permutation[rank - 1] = axis_;
|
||||
std::vector<int64_t> transposed_input_dims(rank);
|
||||
std::transform(permutation.cbegin(), permutation.cend(),
|
||||
transposed_input_dims.begin(), [&X_shape](size_t e) { return X_shape[e]; });
|
||||
|
||||
// Allocate a temporary tensor to hold transposed input
|
||||
transposed_input = Tensor(input.DataType(), TensorShape(transposed_input_dims), alloc);
|
||||
// Perform the transpose
|
||||
ORT_RETURN_IF_ERROR(TransposeBase::DoTranspose(permutation, input, transposed_input));
|
||||
// Allocate memory for the intermediate output
|
||||
intermediate_output = Tensor(output.DataType(), TensorShape(transposed_input_dims), alloc);
|
||||
}
|
||||
|
||||
common::Status status;
|
||||
|
||||
const auto& input_tensor = is_transpose_required ? transposed_input : input;
|
||||
auto& output_tensor = is_transpose_required ? intermediate_output : output;
|
||||
|
||||
ORT_RETURN_IF_ERROR(ComputeInternal(context, input_tensor, output_tensor, lookup_table, int(rank - 1), thread_pool));
|
||||
|
||||
if (is_transpose_required) {
|
||||
// Perform the transpose to get the axes back to the original ordering
|
||||
status = (TransposeBase::DoTranspose(permutation, intermediate_output, output));
|
||||
}
|
||||
return status;
|
||||
}
|
||||
|
||||
ONNX_CPU_OPERATOR_MS_KERNEL(
|
||||
QLinearSoftmax,
|
||||
1,
|
||||
KernelDefBuilder().TypeConstraint(
|
||||
"T",
|
||||
{DataTypeImpl::GetTensorType<uint8_t>(),
|
||||
DataTypeImpl::GetTensorType<int8_t>()}),
|
||||
QLinearSoftmax)
|
||||
|
||||
} // namespace contrib
|
||||
} // namespace onnxruntime
|
||||
34
onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.h
Normal file
34
onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.h
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
// Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
// Licensed under the MIT License.
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <vector>
|
||||
|
||||
#include "core/framework/op_kernel.h"
|
||||
|
||||
namespace onnxruntime {
|
||||
namespace contrib {
|
||||
|
||||
class QLinearSoftmax final : public OpKernel {
|
||||
public:
|
||||
QLinearSoftmax(const OpKernelInfo& info);
|
||||
Status Compute(OpKernelContext* context) const override;
|
||||
|
||||
private:
|
||||
gsl::span<const uint32_t> GetLookupTable(OpKernelContext* context, gsl::span<uint32_t> lookup_table_span, size_t reduce_len) const;
|
||||
|
||||
Status ComputeInternal(OpKernelContext* context, const Tensor& input, Tensor& output, gsl::span<const uint32_t> lookup_table, int axis, concurrency::ThreadPool* thread_pool) const;
|
||||
|
||||
Status ComputeImplOpset13(OpKernelContext* context, const Tensor& input, Tensor& output,
|
||||
gsl::span<const uint32_t> lookup_table, concurrency::ThreadPool* thread_pool) const;
|
||||
|
||||
private:
|
||||
std::vector<uint32_t> fixed_lookup_table_;
|
||||
int axis_ = -1;
|
||||
int opset_ = 1;
|
||||
bool is_signed_{false};
|
||||
};
|
||||
|
||||
} // namespace contrib
|
||||
} // namespace onnxruntime
|
||||
|
|
@ -29,6 +29,7 @@ class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearLeakyRelu);
|
|||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearMul);
|
||||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearReduceMean);
|
||||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSigmoid);
|
||||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSoftmax);
|
||||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QuantizeLinear);
|
||||
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, ReduceSumInteger);
|
||||
|
||||
|
|
@ -98,6 +99,7 @@ class OpSet_Microsoft_ver1 {
|
|||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearMul)>());
|
||||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearReduceMean)>());
|
||||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSigmoid)>());
|
||||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QLinearSoftmax)>());
|
||||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, QuantizeLinear)>());
|
||||
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Microsoft, 1, ReduceSumInteger)>());
|
||||
|
||||
|
|
|
|||
|
|
@ -559,7 +559,7 @@ and produces one output data (Tensor<T>) where the function `f(x) = quantize(alp
|
|||
.TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
|
||||
|
||||
const char* QLinearSigmoidDoc_ver1 = R"DOC(
|
||||
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
|
||||
QLinearSigmoid takes quantized input data (Tensor), and quantize parameter for output, and produces one output data
|
||||
(Tensor<T>) where the function `f(x) = quantize(Sigmoid(dequantize(x)))`, is applied to the data tensor elementwise.
|
||||
Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))` )DOC";
|
||||
|
||||
|
|
@ -585,6 +585,62 @@ Wwhere the function `Sigmoid(x) = 1 / (1 + exp(-x))` )DOC";
|
|||
"Constrain input and output types to 8 bit tensors.")
|
||||
.TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
|
||||
|
||||
ONNX_MS_OPERATOR_SET_SCHEMA(QLinearSoftmax, 1, OpSchema().SetDoc(R"DOC(
|
||||
QLinearSoftmax computes the normalized exponential values for the given input:
|
||||
Softmax(input, axis) = Exp(input) / ReduceSum(Exp(input), axis=axis, keepdims=1)
|
||||
The input does not need to explicitly be a 2D vector. The "axis" attribute
|
||||
indicates the dimension along which QLinearSoftmax will be performed for onnx v.13+.
|
||||
or the dimension coerced to NxD Matrix for onnx v.12-.
|
||||
The output tensor has the same shape.
|
||||
)DOC")
|
||||
.Attr("axis", "apply softmax to elements for dimensions axis,"
|
||||
"or all dims along with axis according to op-version", AttributeProto::INT, static_cast<int64_t>(-1))
|
||||
.Attr("opset", "opset version of corresponding SoftMax.", AttributeProto::INT)
|
||||
.Input(0, "X", "The input tensor", "T")
|
||||
.Input(1, "X_scale", "Scale of quantized input 'X'. It must be a scalar.", "tensor(float)")
|
||||
.Input(2, "x_zero_point",
|
||||
"Zero point tensor for input 'X'."
|
||||
"It must be a scalar.",
|
||||
"T", OpSchema::Optional)
|
||||
.Input(3, "y_scale", "Scale of quantized output 'Y'. It must be a scalar.", "tensor(float)")
|
||||
.Input(4, "y_zero_point",
|
||||
"Zero point tensor for output 'Y'. "
|
||||
"It must be a scalar.",
|
||||
"T")
|
||||
.Output(0, "Y",
|
||||
"Output data tensor from pooling across the input "
|
||||
"tensor. The output tensor has the same rank as the input. ",
|
||||
"T")
|
||||
.TypeConstraint("T", {"tensor(uint8)", "tensor(int8)"},
|
||||
"Constrain input and output types to singed/unsigned int8 tensors.")
|
||||
.TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
|
||||
// Type inference
|
||||
propagateElemTypeFromInputToOutput(ctx, 0, 0);
|
||||
|
||||
// Shape inference starts
|
||||
if (!hasNInputShapes(ctx, 1)) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Validate the value of 'axis'
|
||||
const ONNX_NAMESPACE::TensorShapeProto& input_shape =
|
||||
ctx.getInputType(0)->tensor_type().shape();
|
||||
int r = input_shape.dim_size();
|
||||
int axis = static_cast<int>(getAttribute(ctx, "axis", -1));
|
||||
if (axis < -r || axis >= r) {
|
||||
fail_shape_inference(
|
||||
"'axis' must be in [",
|
||||
-r,
|
||||
" , ",
|
||||
(r - 1),
|
||||
"]. Its actual value is: ",
|
||||
axis);
|
||||
}
|
||||
|
||||
// Shape inference
|
||||
propagateShapeFromInputToOutput(ctx, 0, 0);
|
||||
}));
|
||||
|
||||
ONNX_MS_OPERATOR_SET_SCHEMA(DynamicQuantizeLSTM, 1, OpSchema()
|
||||
.Attr(
|
||||
"direction",
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@
|
|||
#include "core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h"
|
||||
|
||||
#include "core/optimizer/qdq_transformer/qdq_util.h"
|
||||
|
||||
#include "core/graph/node_attr_utils.h"
|
||||
namespace onnxruntime {
|
||||
namespace QDQ {
|
||||
|
||||
|
|
@ -195,6 +195,15 @@ UnaryReplaceWithQLinear::UnaryReplaceWithQLinear(std::string domain)
|
|||
: ReplaceWithQLinear(std::move(domain), UnaryMoves()) {
|
||||
}
|
||||
|
||||
NodeAttributes UnaryReplaceWithQLinear::ExtraAttributes(const RuntimeState& state) const {
|
||||
const auto& target = state.selected_nodes.Target();
|
||||
NodeAttributes attr;
|
||||
if (target.OpType() == "Softmax") {
|
||||
attr["opset"] = utils::MakeAttribute(std::string("opset"), int64_t(target.SinceVersion()));
|
||||
}
|
||||
return attr;
|
||||
}
|
||||
|
||||
BinaryReplaceWithQLinear::BinaryReplaceWithQLinear(std::string domain)
|
||||
: ReplaceWithQLinear(std::move(domain), BinaryMoves()) {
|
||||
}
|
||||
|
|
|
|||
|
|
@ -43,6 +43,9 @@ struct ReplaceWithQLinear : public QDQReplaceWithNew {
|
|||
|
||||
struct UnaryReplaceWithQLinear : ReplaceWithQLinear {
|
||||
UnaryReplaceWithQLinear(std::string domain);
|
||||
|
||||
private:
|
||||
NodeAttributes ExtraAttributes(const RuntimeState& state) const override;
|
||||
};
|
||||
|
||||
struct BinaryReplaceWithQLinear : ReplaceWithQLinear {
|
||||
|
|
|
|||
|
|
@ -82,7 +82,8 @@ void UnaryOpQDQRules(SelectorActionRegistry& qdq_selector_action_registry) {
|
|||
{{"AveragePool", {}},
|
||||
{"LeakyRelu", {}},
|
||||
{"GlobalAveragePool", {}},
|
||||
{"Sigmoid", {}}},
|
||||
{"Sigmoid", {}},
|
||||
{"Softmax", {}}},
|
||||
std::move(selector),
|
||||
std::move(action));
|
||||
#else
|
||||
|
|
|
|||
|
|
@ -132,6 +132,8 @@ class ONNXQuantizer:
|
|||
# some output from nodes will be quantized, yet itself should be treat as existing so
|
||||
# no dequantized will be applied when needed later
|
||||
self.generated_value_names = self.model.get_non_initializer_inputs()
|
||||
# to store specified scale and zeropoint instead of calculated value, tensor_name->(scale, zeropoint)
|
||||
self.used_scale_zp_map = {}
|
||||
|
||||
# routines for subgraph support
|
||||
def quantize_subgraph(self, subgraph, graph_key):
|
||||
|
|
@ -625,6 +627,18 @@ class ONNXQuantizer:
|
|||
self.quantized_value_map[input_name] = QuantizedValue(input_name, output_name, scale_name, zp_name, qType)
|
||||
return nodes + [qlinear_node]
|
||||
|
||||
def set_quant_scale_zp(self, tensor_name, value):
|
||||
assert isinstance(value, tuple) and len(value) == 2, "value must be scale(float) and zeropoint"
|
||||
assert tensor_name not in self.used_scale_zp_map, f"{tensor_name} has been setted before"
|
||||
self.used_scale_zp_map[tensor_name] = value
|
||||
|
||||
def find_quant_scale_zp(self, input_name):
|
||||
if input_name in self.used_scale_zp_map:
|
||||
return self.used_scale_zp_map[input_name]
|
||||
if self.parent is not None:
|
||||
return self.parent.find_quantized_value(input_name)
|
||||
return (None, None)
|
||||
|
||||
def find_quantized_value(self, input_name):
|
||||
if input_name in self.quantized_value_map:
|
||||
return self.quantized_value_map[input_name]
|
||||
|
|
|
|||
86
onnxruntime/python/tools/quantization/operators/softmax.py
Normal file
86
onnxruntime/python/tools/quantization/operators/softmax.py
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
import onnx
|
||||
|
||||
from ..quant_utils import QuantizedValue, QuantizedValueType, attribute_to_kwarg, ms_domain
|
||||
from .base_operator import QuantOperatorBase
|
||||
from .qdq_base_operator import QDQOperatorBase
|
||||
|
||||
|
||||
class QLinearSoftmax(QuantOperatorBase):
|
||||
def quantize(self):
|
||||
node = self.node
|
||||
# set limitations for softmax output scale and zp, because the output of softmax is always 0-1
|
||||
if self.quantizer.input_qType == onnx.onnx_pb.TensorProto.UINT8:
|
||||
out_scale = 1 / 256.0
|
||||
out_zero_point = 0
|
||||
else:
|
||||
out_scale = 1 / 256.0
|
||||
out_zero_point = -128
|
||||
# only try to quantize when given quantization parameters for it
|
||||
(
|
||||
data_found,
|
||||
output_scale_name,
|
||||
output_zp_name,
|
||||
_,
|
||||
_,
|
||||
) = self.quantizer._get_quantization_params(node.output[0], out_scale, out_zero_point)
|
||||
|
||||
# get quantized input tensor names, quantize input if needed
|
||||
(
|
||||
quantized_input_names,
|
||||
input_zero_point_names,
|
||||
input_scale_names,
|
||||
nodes,
|
||||
) = self.quantizer.quantize_inputs(node, [0])
|
||||
|
||||
if not data_found or quantized_input_names is None:
|
||||
return super().quantize()
|
||||
|
||||
# Create an entry for output quantized value.
|
||||
qlinear_output_name = node.output[0] + "_quantized"
|
||||
quantized_output_value = QuantizedValue(
|
||||
node.output[0],
|
||||
qlinear_output_name,
|
||||
output_scale_name,
|
||||
output_zp_name,
|
||||
QuantizedValueType.Input,
|
||||
)
|
||||
self.quantizer.quantized_value_map[node.output[0]] = quantized_output_value
|
||||
|
||||
# Create qlinear softmax node for given type
|
||||
kwargs = {}
|
||||
for attribute in node.attribute:
|
||||
kwargs.update(attribute_to_kwarg(attribute))
|
||||
kwargs["domain"] = ms_domain
|
||||
# make qlinearsoft has the real opset_version, its default SinceVersion would be 1
|
||||
kwargs["opset"] = self.quantizer.opset_version
|
||||
qlinear_node_name = node.name + "_quant" if node.name != "" else ""
|
||||
qnode = onnx.helper.make_node(
|
||||
"QLinear" + node.op_type,
|
||||
[
|
||||
quantized_input_names[0],
|
||||
input_scale_names[0],
|
||||
input_zero_point_names[0],
|
||||
output_scale_name,
|
||||
output_zp_name,
|
||||
],
|
||||
[qlinear_output_name],
|
||||
qlinear_node_name,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
# add all newly created nodes
|
||||
nodes.append(qnode)
|
||||
self.quantizer.new_nodes += nodes
|
||||
return None
|
||||
|
||||
|
||||
class QDQSoftmax(QDQOperatorBase):
|
||||
def quantize(self):
|
||||
super().quantize()
|
||||
if self.quantizer.input_qType == onnx.onnx_pb.TensorProto.UINT8:
|
||||
out_scale = 1 / 256.0
|
||||
out_zero_point = 0
|
||||
else:
|
||||
out_scale = 1 / 256.0
|
||||
out_zero_point = -128
|
||||
self.quantizer.set_quant_scale_zp(self.node.output[0], (out_scale, out_zero_point))
|
||||
|
|
@ -340,7 +340,10 @@ class QDQQuantizer(ONNXQuantizer):
|
|||
if initializer:
|
||||
self._add_qdq_pair_for_weight(initializer, tensor_info.axis)
|
||||
else:
|
||||
data_found, scale_name, zp_name, _, _ = self._get_quantization_params(tensor_name)
|
||||
used_scale, used_zp = self.find_quant_scale_zp(tensor_name)
|
||||
data_found, scale_name, zp_name, _, _ = self._get_quantization_params(
|
||||
tensor_name, used_scale, used_zp
|
||||
)
|
||||
|
||||
if not data_found:
|
||||
raise ValueError(
|
||||
|
|
|
|||
|
|
@ -17,6 +17,7 @@ from .operators.pad import QPad
|
|||
from .operators.pooling import QLinearPool
|
||||
from .operators.qdq_base_operator import QDQOperatorBase
|
||||
from .operators.resize import QDQResize, QResize
|
||||
from .operators.softmax import QDQSoftmax, QLinearSoftmax
|
||||
from .operators.split import QDQSplit, QSplit
|
||||
from .quant_utils import QuantizationMode
|
||||
|
||||
|
|
@ -55,6 +56,7 @@ QLinearOpsRegistry = {
|
|||
"Resize": QResize,
|
||||
"AveragePool": QLinearPool,
|
||||
"Concat": QLinearConcat,
|
||||
"Softmax": QLinearSoftmax,
|
||||
}
|
||||
QLinearOpsRegistry.update(CommonOpsRegistry)
|
||||
|
||||
|
|
@ -73,6 +75,7 @@ QDQRegistry = {
|
|||
"MatMul": QDQMatMul,
|
||||
"Split": QDQSplit,
|
||||
"Gather": QDQGather,
|
||||
"Softmax": QDQSoftmax,
|
||||
}
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -115,5 +115,153 @@ TEST(QLinearLookupTableBasedOperatorTests, QLinearSigmoid_UInt8_0_Y_ZP) {
|
|||
run_test(true);
|
||||
}
|
||||
|
||||
/*
|
||||
\brief data is generated by pytorch script
|
||||
\details model defines
|
||||
```
|
||||
input(int8/uint8)
|
||||
x = self.dequant(x)
|
||||
x = self.softmax(x)
|
||||
x = self.quant2(x)
|
||||
output(int8/uint8)
|
||||
```
|
||||
\see then followed by the [DOC](https://pytorch.org/docs/stable/quantization.html)
|
||||
*/
|
||||
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8_v12) {
|
||||
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
|
||||
test.AddAttribute<int64_t>("axis", -2);
|
||||
test.AddAttribute<int64_t>("opset", 12);
|
||||
float X_scale = 0.166099221f;
|
||||
//
|
||||
uint8_t X_zero_point = 128;
|
||||
float Y_scale = 1.0f / 256.0f;
|
||||
uint8_t Y_zero_point = 0;
|
||||
//
|
||||
|
||||
std::vector<int64_t> dims = {2, 4, 5};
|
||||
auto x_in = std::vector<uint8_t>{50, 67, 58, 68, 46, 69, 77, 91, 62, 74, 67, 72, 71, 70, 83, 88, 75, 54, 74, 88};
|
||||
auto y_out = std::vector<uint8_t> { 0, 2, 0, 2, 0, 2, 8, 86, 1, 5, 2, 4, 3, 3, 23, 52, 6, 0, 5, 52 };
|
||||
for (int64_t i = 1; i < dims[0]; i++) {
|
||||
for (int64_t j = 0; j < dims[1] * dims[2]; j++) {
|
||||
x_in.push_back(x_in[j]);
|
||||
y_out.push_back(y_out[j]);
|
||||
}
|
||||
}
|
||||
|
||||
test.AddInput<uint8_t>("X", dims, x_in);
|
||||
test.AddInput<float>("X_scale", {}, {X_scale});
|
||||
test.AddInput<uint8_t>("X_zero_point", {}, {X_zero_point});
|
||||
test.AddInput<float>("Y_scale", {}, {Y_scale});
|
||||
test.AddInput<uint8_t>("Y_zero_point", {}, {Y_zero_point});
|
||||
test.AddOutput<uint8_t>("Y", dims, y_out);
|
||||
auto origin_round_mode = std::fegetround();
|
||||
std::fesetround(FE_TONEAREST);
|
||||
test.Run();
|
||||
std::fesetround(origin_round_mode);
|
||||
}
|
||||
|
||||
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_UInt8_v13) {
|
||||
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
|
||||
test.AddAttribute<int64_t>("axis", -2);
|
||||
test.AddAttribute<int64_t>("opset", 13);
|
||||
float X_scale = 0.0304f;
|
||||
//
|
||||
uint8_t X_zero_point = 128;
|
||||
float Y_scale = 0.0059f;
|
||||
uint8_t Y_zero_point = 0;
|
||||
//
|
||||
|
||||
std::vector<int64_t> dims = {4, 4, 4};
|
||||
auto x_in = std::vector<uint8_t> {
|
||||
62, 50, 71, 37, 68, 88, 64, 51, 59, 95, 41, 54, 55, 20, 77, 32, 92,
|
||||
63, 43, 13, 76, 82, 53, 43, 60, 18, 73, 74, 22, 89, 44, 106, 17,
|
||||
95, 27, 35, 47, 57, 0, 78, 97, 66, 56, 28, 127, 33, 106, 71, 119,
|
||||
64, 16, 0, 16, 79, 27, 89, 110, 126, 88, 90, 67, 11, 4, 90};
|
||||
auto y_out = std::vector<uint8_t> {
|
||||
43, 20, 50, 33, 52, 63, 40, 51, 39, 78,
|
||||
20, 56, 35, 8, 59, 29, 80, 32, 29, 6, 49, 57, 39, 16, 30, 8, 72, 40,
|
||||
10, 71, 30, 107, 4, 90, 11, 20, 10, 28, 5, 74, 45, 37, 27, 16, 111, 14,
|
||||
125, 59, 84, 18, 14, 4, 4, 28, 20, 54, 64, 119, 126, 56, 17, 4, 10, 56};
|
||||
|
||||
test.AddInput<uint8_t>("X", dims, x_in);
|
||||
test.AddInput<float>("X_scale", {}, {X_scale});
|
||||
test.AddInput<uint8_t>("X_zero_point", {}, {X_zero_point});
|
||||
test.AddInput<float>("Y_scale", {}, {Y_scale});
|
||||
test.AddInput<uint8_t>("Y_zero_point", {}, {Y_zero_point});
|
||||
test.AddOutput<uint8_t>("Y", dims, y_out);
|
||||
auto origin_round_mode = std::fegetround();
|
||||
std::fesetround(FE_TONEAREST);
|
||||
test.Run();
|
||||
std::fesetround(origin_round_mode);
|
||||
}
|
||||
|
||||
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_Int8_v13) {
|
||||
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
|
||||
test.AddAttribute<int64_t>("axis", -2);
|
||||
test.AddAttribute<int64_t>("opset", 13);
|
||||
float X_scale = 0.0304F;
|
||||
//
|
||||
int8_t X_zero_point = 0;
|
||||
float Y_scale = 0.0059F;
|
||||
int8_t Y_zero_point = -128;
|
||||
//
|
||||
|
||||
std::vector<int64_t> dims = {4, 4, 4};
|
||||
auto x_in = std::vector<int8_t> {
|
||||
-4, -16, 5, -29, 2, 22, -2, -15, -7, 29, -25, -12, -11, -46, 11, -34, 26,
|
||||
-3, -23, -53, 10, 16, -13, -23, -6, -48, 7, 8, -44, 23, -22, 40, -49, 29, -39, -31, -19, -9,
|
||||
-72, 12, 31, 0, -10, -38, 61, -33, 40, 5, 53, -2, -50, -66, -50, 13, -39, 23, 44, 60, 22, 24,
|
||||
1, -55, -62, 24};
|
||||
auto y_out = std::vector<int8_t> {
|
||||
-85, -108, -78, -95, -76, -65, -88, -77, -89, -50, -108, -72, -93,
|
||||
-120, -69, -99, -48, -96, -99, -122, -79, -71, -89, -112, -98, -120, -56, -88, -118, -57, -98,
|
||||
-21, -124, -38, -117, -108, -118, -100, -124, -54, -83, -91, -100, -112, -17, -114, -2, -69, -44,
|
||||
-110, -114, -124, -124, -100, -108, -74, -64, -9, -2, -72, -111, -124, -118, -72};
|
||||
|
||||
test.AddInput<int8_t>("X", dims, x_in);
|
||||
test.AddInput<float>("X_scale", {}, {X_scale});
|
||||
test.AddInput<int8_t>("X_zero_point", {}, {X_zero_point});
|
||||
test.AddInput<float>("Y_scale", {}, {Y_scale});
|
||||
test.AddInput<int8_t>("Y_zero_point", {}, {Y_zero_point});
|
||||
test.AddOutput<int8_t>("Y", dims, y_out);
|
||||
auto origin_round_mode = std::fegetround();
|
||||
std::fesetround(FE_TONEAREST);
|
||||
test.Run();
|
||||
std::fesetround(origin_round_mode);
|
||||
}
|
||||
|
||||
TEST(QLinearLookupTableBasedOperatorTests, QLinearSoftmax_Int8_v12) {
|
||||
OpTester test("QLinearSoftmax", 1, onnxruntime::kMSDomain);
|
||||
test.AddAttribute<int64_t>("axis", -2);
|
||||
test.AddAttribute<int64_t>("opset", 12);
|
||||
float X_scale = 0.166099221f;
|
||||
//
|
||||
int8_t X_zero_point = 0;
|
||||
float Y_scale = 1.0f / 128.0f;
|
||||
int8_t Y_zero_point = 0;
|
||||
//
|
||||
|
||||
std::vector<int64_t> dims = {2, 4, 5};
|
||||
auto x_in = std::vector<int8_t>{-28, -4, -4, -7, 3, -26, 4, -16, 23, 14, -7, 26, -8, 19, -16, -13, 7, 17, 27, 5};
|
||||
auto y_out = std::vector<int8_t>{0, 0, 0, 0, 1, 0, 1, 0, 22, 5, 0, 35, 0, 11, 0, 0, 2, 8, 42, 1};
|
||||
for (int64_t i = 1; i < dims[0]; i++) {
|
||||
for (int64_t j = 0; j < dims[1] * dims[2]; j++) {
|
||||
x_in.push_back(x_in[j]);
|
||||
y_out.push_back(y_out[j]);
|
||||
}
|
||||
}
|
||||
|
||||
test.AddInput<int8_t>("X", dims, x_in);
|
||||
test.AddInput<float>("X_scale", {}, {X_scale});
|
||||
test.AddInput<int8_t>("X_zero_point", {}, {X_zero_point});
|
||||
test.AddInput<float>("Y_scale", {}, {Y_scale});
|
||||
test.AddInput<int8_t>("Y_zero_point", {}, {Y_zero_point});
|
||||
test.AddOutput<int8_t>("Y", dims, y_out);
|
||||
auto origin_round_mode = std::fegetround();
|
||||
std::fesetround(FE_TONEAREST);
|
||||
test.Run();
|
||||
std::fesetround(origin_round_mode);
|
||||
}
|
||||
|
||||
} // namespace test
|
||||
} // namespace onnxruntime
|
||||
|
|
|
|||
|
|
@ -1857,6 +1857,67 @@ TEST(QDQTransformerTests, Concat) {
|
|||
test_case({{1, 6, 36}, {1, 6, 8}, {1, 6, 2}}, 2, false, false, true);
|
||||
}
|
||||
|
||||
template <typename InputType, typename OutputType>
|
||||
void QDQTransformerSoftmaxTests() {
|
||||
auto test_case = [&](const std::vector<int64_t>& input_shape, int64_t axis) {
|
||||
auto build_test_case = [&](ModelTestBuilder& builder) {
|
||||
auto* input_arg = builder.MakeInput<float>(input_shape, -5.f, 5.f);
|
||||
auto* output_arg = builder.MakeOutput();
|
||||
// add QDQ + Softmax
|
||||
auto* dq_output = AddQDQNodePair<InputType>(builder, input_arg, .105f,
|
||||
(std::numeric_limits<OutputType>::max() / 255 * 255) / 2);
|
||||
auto* softmax_output = builder.MakeIntermediate();
|
||||
auto& softmax_node = builder.AddNode("Softmax", {dq_output}, {softmax_output});
|
||||
softmax_node.AddAttribute("axis", axis);
|
||||
// add QDQ output
|
||||
auto* q_output = builder.MakeIntermediate();
|
||||
builder.AddQuantizeLinearNode<OutputType>(softmax_output,
|
||||
1.0f / (std::numeric_limits<OutputType>::max() + 1),
|
||||
0,
|
||||
q_output);
|
||||
builder.AddDequantizeLinearNode<OutputType>(q_output,
|
||||
1.0f / (std::numeric_limits<OutputType>::max() + 1),
|
||||
0,
|
||||
output_arg);
|
||||
};
|
||||
|
||||
auto check_graph = [&](InferenceSessionWrapper& session) {
|
||||
auto op_to_count = CountOpsInGraph(session.GetGraph());
|
||||
if constexpr (std::is_same<InputType, OutputType>::value) {
|
||||
EXPECT_EQ(op_to_count["com.microsoft.QLinearSoftmax"], 1);
|
||||
EXPECT_EQ(op_to_count["Softmax"], 0);
|
||||
EXPECT_EQ(op_to_count["QuantizeLinear"], 1);
|
||||
EXPECT_EQ(op_to_count["DequantizeLinear"], 1);
|
||||
} else {
|
||||
EXPECT_EQ(op_to_count["com.microsoft.QLinearSoftmax"], 0);
|
||||
EXPECT_EQ(op_to_count["Softmax"], 1);
|
||||
EXPECT_EQ(op_to_count["QuantizeLinear"], 2);
|
||||
EXPECT_EQ(op_to_count["DequantizeLinear"], 2);
|
||||
}
|
||||
};
|
||||
|
||||
TransformerTester(build_test_case,
|
||||
check_graph,
|
||||
TransformerLevel::Level1,
|
||||
TransformerLevel::Level2,
|
||||
12 /*opset_version*/,
|
||||
0.01 /*per_sample_tolerance*/,
|
||||
0.01 /*relative_per_sample_tolerance*/,
|
||||
std::make_unique<QDQSelectorActionTransformer>(QDQIsInt8Allowed()));
|
||||
};
|
||||
|
||||
test_case({1, 12, 37}, -1);
|
||||
test_case({1, 23, 13, 13}, -2);
|
||||
}
|
||||
|
||||
TEST(QDQTransformerTests, Softmax_S8S8) {
|
||||
QDQTransformerSoftmaxTests<int8_t, int8_t>();
|
||||
}
|
||||
|
||||
TEST(QDQTransformerTests, Softmax_U8U8) {
|
||||
QDQTransformerSoftmaxTests<uint8_t, uint8_t>();
|
||||
}
|
||||
|
||||
#endif // !defined(DISABLE_CONTRIB_OPS)
|
||||
|
||||
TEST(QDQTransformerTests, QDQPropagation_QBackward) {
|
||||
|
|
|
|||
|
|
@ -74,6 +74,8 @@ def check_model_correctness(testcase, model_path_origin, model_path_to_check, in
|
|||
model_path_origin, sess_options=sess_options, providers=["CPUExecutionProvider"]
|
||||
)
|
||||
origin_results = origin_sess.run([], inputs)
|
||||
# enable QDQ transformers
|
||||
# sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
|
||||
target_sess = onnxruntime.InferenceSession(
|
||||
model_path_to_check,
|
||||
sess_options=sess_options,
|
||||
|
|
|
|||
180
onnxruntime/test/python/quantization/test_op_softmax.py
Normal file
180
onnxruntime/test/python/quantization/test_op_softmax.py
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
#!/usr/bin/env python
|
||||
"""
|
||||
Softmax quantization test case
|
||||
"""
|
||||
# coding: utf-8
|
||||
# -------------------------------------------------------------------------
|
||||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License. See License.txt in the project root for
|
||||
# license information.
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import onnx
|
||||
from onnx import TensorProto, helper
|
||||
from op_test_utils import TestDataFeeds, check_model_correctness, check_op_type_count, check_qtype_by_node_type
|
||||
|
||||
from onnxruntime.quantization import QuantFormat, QuantType, quantize_static
|
||||
|
||||
|
||||
class TestOpSoftmax(unittest.TestCase):
|
||||
"""_summary_
|
||||
unittest (softmax): quantization of QDQ and Qop with u8 and s8
|
||||
"""
|
||||
|
||||
def input_feeds(self, n_repeat, name2shape):
|
||||
input_data_list = []
|
||||
for _ in range(n_repeat):
|
||||
inputs = {}
|
||||
for name, shape in name2shape.items():
|
||||
inputs.update({name: np.random.randint(-1, 2, shape).astype(np.float32)})
|
||||
input_data_list.extend([inputs])
|
||||
data_r = TestDataFeeds(input_data_list)
|
||||
return data_r
|
||||
|
||||
def construct_model_conv_softmax(
|
||||
self,
|
||||
output_model_path,
|
||||
conv_input_shape,
|
||||
conv_weight_shape,
|
||||
softmax_input_shape,
|
||||
softmax_attributes,
|
||||
output_shape,
|
||||
):
|
||||
# (input)
|
||||
# \
|
||||
# Conv
|
||||
# / \
|
||||
# Identity Softmax
|
||||
# / \
|
||||
# (identity_out) (output)
|
||||
input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, conv_input_shape)
|
||||
|
||||
conv_weight_arr = np.random.randint(-1, 2, conv_weight_shape).astype(np.float32)
|
||||
conv_weight_initializer = onnx.numpy_helper.from_array(conv_weight_arr, name="conv1_weight")
|
||||
conv_node = onnx.helper.make_node("Conv", ["input", "conv1_weight"], ["conv_output"], name="conv_node")
|
||||
|
||||
identity_out = helper.make_tensor_value_info("identity_out", TensorProto.FLOAT, softmax_input_shape)
|
||||
identity_node = helper.make_node("Identity", ["conv_output"], ["identity_out"], name="IdentityNode")
|
||||
|
||||
initializers = [conv_weight_initializer]
|
||||
|
||||
output_tensor = helper.make_tensor_value_info("output", TensorProto.FLOAT, output_shape)
|
||||
softmax_node = helper.make_node(
|
||||
"Softmax", ["conv_output"], ["output"], name="softmax_node", **softmax_attributes
|
||||
)
|
||||
|
||||
graph = helper.make_graph(
|
||||
[conv_node, identity_node, softmax_node],
|
||||
"TestOpQuantizersoftmax_test_model",
|
||||
[input_tensor],
|
||||
[identity_out, output_tensor],
|
||||
initializer=initializers,
|
||||
)
|
||||
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 13)])
|
||||
model.ir_version = 7 # use stable onnx ir version
|
||||
onnx.save(model, output_model_path)
|
||||
|
||||
def quantize_softmax_test(self, activation_type, weight_type, extra_options={}):
|
||||
np.random.seed(1)
|
||||
model_fp32_path = "softmax_fp32.onnx"
|
||||
self.construct_model_conv_softmax(
|
||||
model_fp32_path,
|
||||
[1, 2, 26, 42],
|
||||
[3, 2, 3, 3],
|
||||
[1, 3, 24, 40],
|
||||
{"axis": -2},
|
||||
[1, 3, 24, 40],
|
||||
)
|
||||
data_reader = self.input_feeds(1, {"input": [1, 2, 26, 42]})
|
||||
|
||||
activation_proto_qtype = TensorProto.UINT8 if activation_type == QuantType.QUInt8 else TensorProto.INT8
|
||||
activation_type_str = "u8" if (activation_type == QuantType.QUInt8) else "s8"
|
||||
weight_type_str = "u8" if (weight_type == QuantType.QUInt8) else "s8"
|
||||
model_q8_path = f"softmax_{activation_type_str}{weight_type_str}.onnx"
|
||||
model_q8_qdq_path = f"softmax_qdq_{activation_type_str}{weight_type_str}.onnx"
|
||||
|
||||
# Verify QOperator mode
|
||||
data_reader.rewind()
|
||||
quantize_static(
|
||||
model_fp32_path,
|
||||
model_q8_path,
|
||||
data_reader,
|
||||
quant_format=QuantFormat.QOperator,
|
||||
activation_type=activation_type,
|
||||
weight_type=weight_type,
|
||||
extra_options=extra_options,
|
||||
)
|
||||
qnode_counts = {
|
||||
"QLinearConv": 1,
|
||||
"QuantizeLinear": 1,
|
||||
"DequantizeLinear": 2,
|
||||
"QLinearSoftmax": 1,
|
||||
"Softmax": 0,
|
||||
}
|
||||
check_op_type_count(self, model_q8_path, **qnode_counts)
|
||||
qnode_io_qtypes = {
|
||||
"QuantizeLinear": [
|
||||
["i", 2, activation_proto_qtype],
|
||||
["o", 0, activation_proto_qtype],
|
||||
]
|
||||
}
|
||||
qnode_io_qtypes.update(
|
||||
{
|
||||
"QLinearConv": [
|
||||
["i", 2, activation_proto_qtype],
|
||||
["i", 7, activation_proto_qtype],
|
||||
["o", 0, activation_proto_qtype],
|
||||
]
|
||||
}
|
||||
)
|
||||
qnode_io_qtypes.update(
|
||||
{"QLinearSoftmax": [["i", 4, activation_proto_qtype]]}
|
||||
) # shape info note workig on custome ops
|
||||
check_qtype_by_node_type(self, model_q8_path, qnode_io_qtypes)
|
||||
data_reader.rewind()
|
||||
check_model_correctness(self, model_fp32_path, model_q8_path, data_reader.get_next())
|
||||
|
||||
# Verify QDQ mode
|
||||
data_reader.rewind()
|
||||
quantize_static(
|
||||
model_fp32_path,
|
||||
model_q8_qdq_path,
|
||||
data_reader,
|
||||
quant_format=QuantFormat.QDQ,
|
||||
activation_type=activation_type,
|
||||
weight_type=weight_type,
|
||||
extra_options=extra_options,
|
||||
)
|
||||
qdqnode_counts = {
|
||||
"Conv": 1,
|
||||
"QuantizeLinear": 3,
|
||||
"DequantizeLinear": 4,
|
||||
"Softmax": 1,
|
||||
}
|
||||
check_op_type_count(self, model_q8_qdq_path, **qdqnode_counts)
|
||||
qnode_io_qtypes = {
|
||||
"QuantizeLinear": [
|
||||
["i", 2, activation_proto_qtype],
|
||||
["o", 0, activation_proto_qtype],
|
||||
]
|
||||
}
|
||||
check_qtype_by_node_type(self, model_q8_qdq_path, qnode_io_qtypes)
|
||||
data_reader.rewind()
|
||||
check_model_correctness(self, model_fp32_path, model_q8_qdq_path, data_reader.get_next())
|
||||
|
||||
def test_quantize_softmax(self):
|
||||
self.quantize_softmax_test(QuantType.QUInt8, QuantType.QUInt8)
|
||||
|
||||
def test_quantize_softmax_s8s8(self):
|
||||
self.quantize_softmax_test(
|
||||
QuantType.QInt8,
|
||||
QuantType.QInt8,
|
||||
extra_options={"ActivationSymmetric": True},
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
|
@ -298,5 +298,9 @@
|
|||
[
|
||||
"QGemm com.microsoft CPUExecutionProvider",
|
||||
13737193491843065240
|
||||
],
|
||||
[
|
||||
"QLinearSoftmax com.microsoft CPUExecutionProvider",
|
||||
10339195975968977840
|
||||
]
|
||||
]
|
||||
]
|
||||
|
|
|
|||
Loading…
Reference in a new issue