MNN/source/backend/cpu/compute/CommonOptFunction.cpp

3855 lines
140 KiB
C++
Raw Normal View History

2019-04-17 10:49:11 +08:00
//
// CommonOptFunction.cpp
// MNN
//
// Created by MNN on 2018/09/06.
// Copyright © 2018, Alibaba Group Holding Limited
//
#include "CommonOptFunction.h"
2021-04-08 15:34:23 +08:00
#include "ConvOpt.h"
#include "WinogradOptFunction.hpp"
2021-09-18 15:52:30 +08:00
#include "Int8FunctionsOpt.h"
#include "ImageProcessFunction.hpp"
2019-04-17 10:49:11 +08:00
#include <string.h>
#include <algorithm>
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
#include <cmath>
#include <math.h>
2020-11-05 16:41:56 +08:00
#include "math/Vec.hpp"
2020-07-04 01:21:30 +08:00
#include <vector>
#include "../CPURuntime.hpp"
2024-05-11 19:17:02 +08:00
#include "core/MemoryFormater.h"
// TODO: Find better way to optimize it
#include "../CPUBinary.hpp"
#include "../CPUUnary.hpp"
#include "../CPUPool.hpp"
2023-12-04 11:12:20 +08:00
#define PACK 4
#define FLOAT float
using Vec = MNN::Math::Vec<float, 4>;
#include "../GridSampler.hpp"
2025-05-08 12:39:44 +08:00
#ifdef MNN_LOW_MEMORY
#ifdef __aarch64__
#include "backend/cpu/arm/arm64/low_memory/MNNDynamicQuantFunctions.hpp"
#endif
#endif
2023-12-04 11:12:20 +08:00
#ifndef MNN_USE_SSE
void MNNInt8ToInt16(int16_t* dest, const int8_t* source, size_t count) {
// Should not be called
MNN_ASSERT(false);
}
#endif
2024-07-22 19:51:53 +08:00
#ifndef __aarch64__
2024-09-12 12:57:57 +08:00
#ifdef MNN_CPU_WEIGHT_DEQUANT_GEMM
2024-07-22 19:51:53 +08:00
static void _MNNPackedMatMulRemain_int4(float* C, const float* A, const float* fB, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, int aStride, const float* k, const float* b) {
auto B = reinterpret_cast<const uint8_t*>(fB);
auto h = parameter[2];
auto l = parameter[1];
auto cStride = parameter[3] / sizeof(float);
auto hRemain = parameter[4];
float weightBytes = 0.5; // sizeof(int4_t)
auto bExtraStride = static_cast<int32_t>(parameter[5] / weightBytes);
auto bStride = bExtraStride + 4 * l;
auto hC4 = UP_DIV(h, 4);
float minValue = -std::numeric_limits<float>().max();
float maxValue = std::numeric_limits<float>().max();
if (nullptr != postParameters) {
minValue = postParameters[2];
maxValue = postParameters[3];
}
int blockId = parameter[6];
for (int x=0; x<eSize; ++x) {
auto dst = C + 4 * x;
auto src = A + x;
for (int y=0; y<hC4; ++y) {
auto dstY = dst + y * cStride;
auto weight = B + y * bStride / 2;
auto alpha = k + y * 4;
auto qbias = b + y * 4;
float summer[4] = {
0.0f,
0.0f,
0.0f,
0.0f,
};
if (blockId > 0) {
summer[0] = dstY[0];
summer[1] = dstY[1];
summer[2] = dstY[2];
summer[3] = dstY[3];
}
if (nullptr != bias && nullptr != postParameters) {
for (int v=0; v<4; ++v) {
summer[v] += bias[4 * y + v];
}
}
for (int z=0; z<l; ++z) {
auto aZ = src + z * aStride;
auto i4wZ = weight + z * 2;
float wZ[4];
{
auto w01 = i4wZ[0];
auto w23 = i4wZ[1];
int iw01 = w01;
int iw23 = w23;
int iw0 = iw01 / 16;
int iw1 = iw01 % 16;
int iw2 = iw23 / 16;
int iw3 = iw23 % 16;
wZ[0] = iw0 * alpha[0] + qbias[0];
wZ[1] = iw1 * alpha[1] + qbias[1];
wZ[2] = iw2 * alpha[2] + qbias[2];
wZ[3] = iw3 * alpha[3] + qbias[3];
}
summer[0] += wZ[0] * aZ[0];
summer[1] += wZ[1] * aZ[0];
summer[2] += wZ[2] * aZ[0];
summer[3] += wZ[3] * aZ[0];
}
for (int v=0; v<4; ++v) {
auto dstValue = std::min(summer[v], maxValue);
dstValue = std::max(dstValue, minValue);
dstY[v] = dstValue;
}
}
}
}
static void _MNNPackedMatMulRemain_int8(float* C, const float* A, const float* fB, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, int aStride, const float* k, const float* b) {
auto B = reinterpret_cast<const int8_t*>(fB);
auto h = parameter[2];
auto l = parameter[1];
auto cStride = parameter[3] / sizeof(float);
auto hRemain = parameter[4];
float weightBytes = 1; // sizeof(int8_t)
auto bExtraStride = static_cast<int32_t>(parameter[5] / weightBytes);
auto bStride = bExtraStride + 4 * l;
auto hC4 = UP_DIV(h, 4);
float minValue = -std::numeric_limits<float>().max();
float maxValue = std::numeric_limits<float>().max();
if (nullptr != postParameters) {
minValue = postParameters[2];
maxValue = postParameters[3];
}
int blockId = parameter[6];
for (int x=0; x<eSize; ++x) {
auto dst = C + 4 * x;
auto src = A + x;
for (int y=0; y<hC4; ++y) {
auto dstY = dst + y * cStride;
auto weight = B + y * bStride;
auto alpha = k + y * 4;
auto qbias = b + y * 4;
float summer[4] = {
0.0f,
0.0f,
0.0f,
0.0f,
};
if (blockId > 0) {
summer[0] = dstY[0];
summer[1] = dstY[1];
summer[2] = dstY[2];
summer[3] = dstY[3];
}
if (nullptr != bias && nullptr != postParameters) {
for (int v=0; v<4; ++v) {
summer[v] += bias[4 * y + v];
}
}
for (int z=0; z<l; ++z) {
auto aZ = src + z * aStride;
auto i8wZ = weight + z * 4;
float wZ[4];
{
wZ[0] = i8wZ[0] * alpha[0] + qbias[0];
wZ[1] = i8wZ[1] * alpha[1] + qbias[1];
wZ[2] = i8wZ[2] * alpha[2] + qbias[2];
wZ[3] = i8wZ[3] * alpha[3] + qbias[3];
}
summer[0] += wZ[0] * aZ[0];
summer[1] += wZ[1] * aZ[0];
summer[2] += wZ[2] * aZ[0];
summer[3] += wZ[3] * aZ[0];
}
for (int v=0; v<4; ++v) {
auto dstValue = std::min(summer[v], maxValue);
dstValue = std::max(dstValue, minValue);
dstY[v] = dstValue;
}
}
}
}
void MNNPackedMatMul_int4(float* C, const float* A, const float* B, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
_MNNPackedMatMulRemain_int4(C, A, B, 16, parameter, postParameters, bias, 16, k, b);
}
void MNNPackedMatMulRemain_int4(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
auto aStride = parameter[0] / sizeof(float);
_MNNPackedMatMulRemain_int4(C, A, B, eSize, parameter, postParameters, bias, aStride, k, b);
}
void MNNPackedMatMul_int8(float* C, const float* A, const float* B, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
_MNNPackedMatMulRemain_int8(C, A, B, 16, parameter, postParameters, bias, 16, k, b);
}
void MNNPackedMatMulRemain_int8(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
auto aStride = parameter[0] / sizeof(float);
_MNNPackedMatMulRemain_int8(C, A, B, eSize, parameter, postParameters, bias, aStride, k, b);
}
2024-09-12 12:57:57 +08:00
#endif // MNN_CPU_WEIGHT_DEQUANT_GEMM
#ifdef MNN_LOW_MEMORY
2024-07-22 19:51:53 +08:00
void MNNQuantScaleFP32(float* absmax, float* quant_scale, float* dequant_scale, size_t thread, size_t batch) {
for (int i = 0; i < batch; ++i) {
auto absmaxPtr = absmax + i;
float absVal = 0.f;
for (int t = 0; t < thread; ++t) {
absVal = std::max(absVal, absmaxPtr[t * batch]);
}
2024-11-18 14:37:45 +08:00
if (absVal < 1e-7) {
quant_scale[i] = 1.f;
dequant_scale[i] = 1.f;
} else {
quant_scale[i] = 127.0f / absVal;
dequant_scale[i] = absVal / 127.0f;
}
2024-07-22 19:51:53 +08:00
}
2023-12-11 18:01:44 +08:00
}
2025-01-22 14:47:50 +08:00
2025-05-08 12:39:44 +08:00
void MNNDynamicUpdateConvBiasScale(float* newbias, float* oldbias, float* weightKernelSum, float* inputBias, size_t ocQuad) {
2025-01-22 14:47:50 +08:00
int ocUp4 = 4 * ocQuad;
int pack = 4;
for (int i = 0; i < ocUp4; ++i) {
2025-05-08 12:39:44 +08:00
newbias[i] = oldbias[i] + weightKernelSum[i] * inputBias[0];
2025-01-22 14:47:50 +08:00
}
}
#endif // LOW_MEMORY
#endif // not __aarch64__
2025-05-08 12:39:44 +08:00
static void MNNCountMaxMinValue(const float* source, float* minVal, float* maxVal, size_t size) {
int pack = 4;
float max_ = source[0], min_ = source[0];
for (int i = 1; i < size; ++i) {
if (max_ < source[i]) {
max_ = source[i];
}
if (min_ > source[i]) {
min_ = source[i];
}
}
*minVal = min_;
*maxVal = max_;
}
2025-01-22 14:47:50 +08:00
#ifdef MNN_LOW_MEMORY
static void MNNAbsMaxFP32(const float* source, float* absmax, size_t src_depth_quad, size_t realSize, int pack) {
#ifdef __aarch64__
if (pack == 4) {
MNNAbsMaxFP32_Pack4(source, absmax, src_depth_quad, realSize, pack);
return;
}
2025-02-12 11:14:19 +08:00
if (pack == 8) {
MNNAbsMaxFP32_Pack8(source, absmax, src_depth_quad, realSize, pack);
return;
}
2025-01-22 14:47:50 +08:00
#endif
// source: (ic/4, N, 4)
auto srcStep = pack * realSize;
for (int i = 0; i < realSize; ++i) {
float absmaxVal = 0.f; // absmaxVal>=0
for (int c = 0; c < src_depth_quad; ++c) {
auto src = source + c * srcStep + i * pack;
for (int k = 0; k < pack; ++k) {
absmaxVal = std::max(absmaxVal, std::abs(src[k]));
}
}
absmax[i] = absmaxVal;
}
}
2025-05-08 12:39:44 +08:00
void MNNDynamicQuantFP32(const float* src, int8_t* dst, const float* scale, size_t src_depth_quad, size_t realSize, int pack, const float* bias = nullptr) {
2025-01-22 14:47:50 +08:00
#ifdef __aarch64__
if (pack == 4) {
2025-05-08 12:39:44 +08:00
MNNDynamicQuantFP32_Pack4(src, dst, scale, src_depth_quad, realSize, nullptr, pack);
2025-01-22 14:47:50 +08:00
return;
}
2025-02-12 11:14:19 +08:00
if (pack == 8) {
2025-05-08 12:39:44 +08:00
MNNDynamicQuantFP32_Pack8(src, dst, scale, src_depth_quad, realSize, nullptr, pack);
2025-02-12 11:14:19 +08:00
return;
}
2025-01-22 14:47:50 +08:00
#endif
2024-07-22 19:51:53 +08:00
#ifdef MNN_USE_SSE
uint8_t* dstPtr = reinterpret_cast<uint8_t*>(dst);
int offset = 128;
#else
int8_t* dstPtr = dst;
int offset = 0;
2023-12-11 18:01:44 +08:00
#endif
2024-07-22 19:51:53 +08:00
for (int i = 0; i < realSize; ++i) {
auto scaleVal = scale[i];
for (int c = 0; c < src_depth_quad; ++c) {
auto srcZ = src + c * pack * realSize + i * pack;
auto dstZ = dstPtr + c * pack * realSize + i * pack;
for (int k = 0; k < pack; ++k) {
int val = (int)roundf(srcZ[k] * scaleVal);
dstZ[k] = val + offset;
}
}
}
}
2025-05-08 12:39:44 +08:00
static void MNNAsyQuantFunc(int8_t* dst, const float* src, float* qscale, float* qbias, const size_t* info) {
// input shape: [kernelsize, blockNum, blockLU, EP, LP]
auto blockNum = info[0];
auto EP = info[1]; // real area for data
auto LP = info[2]; // Innermost data layout, may come from backend's pack or gemmint8 units' SRC_UNIT
auto DST_XUNIT = info[3]; // backend gemmint8 units
auto SRC_UNIT = info[4];
auto kernelsize = info[5];
auto blockLU = info[6];
auto stride0 = blockNum * blockLU * EP * LP;
auto stride1 = blockLU * EP * LP;
int int8Max = 127;
int int8Min = -128;
// qscale&qbias [blockNum, EP]
#ifdef __aarch64__
if (LP == 4 || LP == 8) {
for (int k = 0; k < kernelsize; ++k) {
for (int i = 0; i < blockNum; ++i) {
if (LP == 4) {
MNNDynamicQuantFP32_Pack4(src + k * stride0 + i * stride1, dst + k * stride0 + i * stride1, qscale + i * EP, blockLU, EP, qbias + i * EP, LP);
}
if (LP == 8) {
MNNDynamicQuantFP32_Pack8(src + k * stride0 + i * stride1, dst + k * stride0 + i * stride1, qscale + i * EP, blockLU, EP, qbias + i * EP, LP);
}
}
}
return;
}
#endif
for (int i = 0; i < EP; ++i) {
for (int bk = 0; bk < blockNum; ++bk) {
float quant_scale = qscale[i + bk * EP];
float quant_bias = qbias[i + bk * EP];
for (int n = 0; n < kernelsize; ++n) {
for (int k = 0; k < blockLU; ++k) {
for (int j = 0; j < LP; ++j) {
int dataIndx = n * stride0 + bk * stride1 + k * EP * LP + i * LP + j;
float data_ = src[dataIndx];
int qval = static_cast<int32_t>(roundf(data_ * quant_scale + quant_bias));
#ifdef MNN_USE_SSE
((uint8_t*)dst)[dataIndx] = qval + 128;
#else
dst[dataIndx] = ALIMIN(int8Max, ALIMAX(int8Min, qval));
#endif
}
}
}
}
}
}
static void MNNAsyQuantInfo_FP32(float* scale, float* bias, float* qscale, float* qbias, float* dstMin, float* dstMax, const float* src, const size_t* info) {
auto blockNum = info[0];
auto plane = info[1]; // real area for data
auto innerSide = info[2]; // Innermost data layout, may come from backend's pack or gemmint8 units' SRC_UNIT
auto DST_XUNIT = info[3];
auto kernelsize = info[5];
auto blockLU = info[6];
auto stride0 = blockNum * blockLU * plane * innerSide;
auto stride1 = blockLU * plane * innerSide;
if (info[7] == 1) { // scale&bias:[1]
float maxval, minval;
MNNCountMaxMinValue(src, &minval, &maxval, kernelsize * stride0);
if (info[8] == 1 && (maxval -minval) > 1e-7) {
if (minval > 0.f) {
minval = 0;
} else if (maxval < 0.f){
maxval = 0;
}
}
auto range = maxval - minval;
if (range <= 1e-7) {
scale[0] = 0.f;
qscale[0] = 0.f;
qbias[0] = 0.f;
bias[0] = maxval;
} else {
qscale[0] = 255.f / range;
scale[0] = range / 255.f;
qbias[0] = roundf(-minval * 255.f / range)- 128.f;
bias[0] = -qbias[0] * scale[0];
}
return;
}
// input : [kernelsize, blockNum, blockLU, plane, pack]
// dequant scale/bias : [EU, blockNum, step], step=ALIMIN(step, EP), EU=UP_DIV(plane, EP)
// quant scale/bias : [blockNum, plane]
#ifdef __aarch64__
if (DST_XUNIT == 12 && innerSide == 4) { // Arm82,fp32: SRC_UNIT=4, core->pack=4
// max,min shape: [blockNum, EP]
for (int i = 0; i < kernelsize; ++i) {
MNNLocalMinMaxFP32_Pack4(dstMin, dstMax, src + i * stride0, blockNum, blockLU, plane, innerSide, i);
}
// scale, bias
bool success = MNNAsyLocalQuantInfo_EP12_FP32(scale, bias, qscale, qbias, dstMin, dstMax, info);
if (!success) {
MNN_ERROR("Call error for:MNNAsyLocalQuantInfo_EP12\n");
return;
}
return;
}
if (DST_XUNIT == 10) { // Arm86,fp32: SRC_UNIT=8,core->pack=4
// max,min shape: [blockNum, EP]
if (innerSide == 4) {
for (int i = 0; i < kernelsize; ++i) {
MNNLocalMinMaxFP32_Pack4(dstMin, dstMax, src + i * stride0, blockNum, blockLU, plane, innerSide, i);
}
}
if (innerSide == 8) {
for (int i = 0; i < kernelsize; ++i) {
MNNLocalMinMaxFP32_Pack8(dstMin, dstMax, src + i * stride0, blockNum, blockLU, plane, innerSide, i);
}
}
// scale, bias
bool success = MNNAsyLocalQuantInfo_EP10_FP32(scale, bias, qscale, qbias, dstMin, dstMax, info);
if (!success) {
MNN_ERROR("Call error for:MNNAsyLocalQuantInfo_EP10\n");
return;
}
return;
}
#endif
// max,min shape: [blockNum, plane]
for (int i = 0; i < plane; ++i) {
for (int bk = 0; bk < blockNum; ++bk) {
auto idx0 = i *innerSide + bk * stride1;
float max_ = src[idx0];
float min_ = max_;
for (int n = 0; n < kernelsize; ++n) {
for (int k = 0; k < blockLU; ++k) {
for (int j = 0; j < innerSide; ++j) {
auto dataIndx = idx0 + n * stride0 + k * (plane * innerSide) + j;
float data_ = src[dataIndx];
max_ = ALIMAX(max_, data_);
min_ = ALIMIN(min_, data_);
}
}
}
auto sindx = i + bk * plane;
dstMin[sindx] = min_;
dstMax[sindx] = max_;
}
}
// scale, bias
for (int i = 0; i < plane; ++i) {
auto step = ALIMIN(DST_XUNIT, plane - (i / DST_XUNIT) * DST_XUNIT);
auto sind0 = (i / DST_XUNIT) * DST_XUNIT * blockNum + (i % DST_XUNIT);
for (int k = 0; k < blockNum; ++k) {
auto sind = sind0 + k * step;
auto qind = i + k * plane;
auto max_ = dstMax[qind];
auto min_ = dstMin[qind];
if (fabs(max_ - min_) < 1e-7) {
qscale[qind] = 0.f;
qbias[qind] = 0.f;
scale[sind] = 0.f;
bias[sind] = max_;
} else {
qscale[qind] = 255.f / (max_ - min_);
qbias[qind] = roundf(-min_ * 255.f / (max_ - min_)) - 128.0f;
scale[sind] = (max_ - min_) / 255.f;
#ifndef MNN_USE_SSE
bias[sind] = min_ + (128.f / 255.f) * (max_ - min_);
#else
bias[sind] = min_;
#endif
}
}
}
}
2025-03-12 11:35:16 +08:00
#endif // MNN_LOW_MEMORY
static void MNNReorderWeightInt4(uint8_t* dest, const uint8_t* source, int32_t* shape, size_t size, float* kernelsum) {
MNN_ASSERT(size > 4);
auto blocknum = shape[0];
auto hu = shape[1];
auto lu = shape[2];
auto hp = shape[3];
auto lp = shape[4];
auto ic = blocknum * lu * lp;
2025-05-08 12:39:44 +08:00
auto stride0 = blocknum * hp * lu * lp;
2025-03-12 11:35:16 +08:00
auto stride1 = lu * hp * lp;
auto stride2 = hp * lp;
2025-05-08 12:39:44 +08:00
// [oc,ic]->[hu,blocknum,lu,hp,lp]
2025-03-12 11:35:16 +08:00
for (int i = 0; i < hu; ++i) {
for (int k = 0; k < hp; ++k) {
for (int bl = 0; bl < blocknum; ++bl) {
for (int j = 0; j < lu; ++j) {
int srcindex = (i * hp + k) * ic + bl * (lu * lp) + j * lp;
2025-05-08 12:39:44 +08:00
int dstindex = i * stride0 + bl * stride1 + j * stride2 + k * lp;
2025-03-12 11:35:16 +08:00
memcpy(dest + dstindex, source + srcindex, lp);
}
}
}
}
2025-05-08 12:39:44 +08:00
// [hu,blocknum,lu,hp,lp] address [hp,lp] for int4
2025-03-12 11:35:16 +08:00
auto inside = lp * hp;
auto outside = blocknum * hu;
std::vector<uint8_t> buffer(inside);
for (int i = 0; i < outside; ++i) {
std::vector<float> accum(hp, 0);
for (int k = 0; k < lu; ++k) {
for (int j = 0; j < inside / 2; ++j) {
auto w0 = dest[j + (i * lu + k) * inside] >> 4;
auto w1 = dest[j + (i * lu + k) * inside] & 0x0f;
auto w2 = dest[(i * lu + k) * inside + j + inside / 2] >> 4;
auto w3 = dest[(i * lu + k) * inside + j + inside / 2] & 0x0f;
buffer[2 * j + 0] = w0 * 16 + w2;
buffer[2 * j + 1] = w1 * 16 + w3;
// sum
accum[j / lp] += ((float)w0 + (float)w1);
accum[(j + inside / 2) / lp] += ((float)w2 + (float)w3);
}
memcpy(dest + (i * lu + k) * inside, buffer.data(), inside);
}
memcpy(kernelsum + i * hp, accum.data(), hp * sizeof(float));
}
}
#ifdef __aarch64__
static void MNNReorderWeightInt4Arm86(uint8_t* dest, const uint8_t* source, int32_t* shape, size_t size, float* kernelsum) {
MNN_ASSERT(size > 4);
auto blocknum = shape[0];
auto hu = shape[1];
auto lu = shape[2];
auto hp = shape[3];
auto lp = shape[4];
auto ic = blocknum *lu * lp;
2025-05-08 12:39:44 +08:00
auto stride0 = blocknum * hp * lu * lp;
2025-03-12 11:35:16 +08:00
auto stride1 = lu * hp * lp;
auto stride2 = hp * lp;
auto dstPtr = (int32_t*)dest;
auto srcPtr = (int32_t*)source;
int unitpacksize = sizeof(int32_t) / sizeof(uint8_t);
for (int i = 0; i < hu; ++i) {
for (int k = 0; k < hp; ++k) {
for (int bl = 0; bl < blocknum; ++bl) {
int j = 0;
while (j + 7 < lu) {
auto srcindex0 = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / unitpacksize;
auto srcindex1 = ((i * hp + k) * ic + bl * (lu * lp) + (j + 4) * lp) / unitpacksize;
2025-05-08 12:39:44 +08:00
auto dstindex0 = (bl * stride1 + i * stride0 + j * stride2 + k * lp) / unitpacksize;
auto dstindex1 = (bl * stride1 + i * stride0 + (j + 1) * stride2 + k * lp) / unitpacksize;
auto dstindex2 = (bl * stride1 + i * stride0 + (j + 2) * stride2 + k * lp) / unitpacksize;
auto dstindex3 = (bl * stride1 + i * stride0 + (j + 3) * stride2 + k * lp) / unitpacksize;
auto dstindex4 = (bl * stride1 + i * stride0 + (j + 4) * stride2 + k * lp) / unitpacksize;
auto dstindex5 = (bl * stride1 + i * stride0 + (j + 5) * stride2 + k * lp) / unitpacksize;
auto dstindex6 = (bl * stride1 + i * stride0 + (j + 6) * stride2 + k * lp) / unitpacksize;
auto dstindex7 = (bl * stride1 + i * stride0 + (j + 7) * stride2 + k * lp) / unitpacksize;
2025-03-12 11:35:16 +08:00
j += 8;
auto srcdata0 = vld1q_s32(srcPtr + srcindex0);
auto srcdata1 = vld1q_s32(srcPtr + srcindex1);
vst1q_lane_s32(dstPtr + dstindex0, srcdata0, 0);
vst1q_lane_s32(dstPtr + dstindex1, srcdata0, 1);
vst1q_lane_s32(dstPtr + dstindex2, srcdata0, 2);
vst1q_lane_s32(dstPtr + dstindex3, srcdata0, 3);
vst1q_lane_s32(dstPtr + dstindex4, srcdata1, 0);
vst1q_lane_s32(dstPtr + dstindex5, srcdata1, 1);
vst1q_lane_s32(dstPtr + dstindex6, srcdata1, 2);
vst1q_lane_s32(dstPtr + dstindex7, srcdata1, 3);
}
while (j + 3 < lu) {
auto srcindex = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / unitpacksize;
2025-05-08 12:39:44 +08:00
auto dstindex0 = (bl * stride1 + i * stride0 + j * stride2 + k * lp) / unitpacksize;
auto dstindex1 = (bl * stride1 + i * stride0 + (j + 1) * stride2 + k * lp) / unitpacksize;
auto dstindex2 = (bl * stride1 + i * stride0 + (j + 2) * stride2 + k * lp) / unitpacksize;
auto dstindex3 = (bl * stride1 + i * stride0 + (j + 3) * stride2 + k * lp) / unitpacksize;
2025-03-12 11:35:16 +08:00
j += 4;
auto srcdata = vld1q_s32(srcPtr + srcindex);
vst1q_lane_s32(dstPtr + dstindex0, srcdata, 0);
vst1q_lane_s32(dstPtr + dstindex1, srcdata, 1);
vst1q_lane_s32(dstPtr + dstindex2, srcdata, 2);
vst1q_lane_s32(dstPtr + dstindex3, srcdata, 3);
}
while (j < lu) {
auto srcindex = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / unitpacksize;
2025-05-08 12:39:44 +08:00
auto dstindex = (bl * stride1+ i * stride0 + j * stride2 + k * lp) / unitpacksize;
2025-03-12 11:35:16 +08:00
dstPtr[dstindex] = srcPtr[srcindex];
j++;
}
}
}
}
MNNPermuteSumWeightInt4Arm86(dest, dest, blocknum * hu, lu, kernelsum);
}
static void MNNReorderWeightInt4Arm82(uint8_t* dest, const uint8_t* source, int32_t* shape, size_t size, float* kernelsum) {
MNN_ASSERT(size > 4);
2025-05-08 12:39:44 +08:00
// dst shape: [hu, blocknum, kernelCount, lu, hp, lp], kernelCount=1 in this case
2025-03-12 11:35:16 +08:00
auto blocknum = shape[0];
auto hu = shape[1];
auto lu = shape[2];
auto hp = shape[3];
auto lp = shape[4];
auto ic = blocknum *lu * lp;
2025-05-08 12:39:44 +08:00
auto stride0 = blocknum * hp * lu * lp;
2025-03-12 11:35:16 +08:00
auto stride1 = lu * hp * lp;
auto stride2 = hp * lp;
auto dstPtr = (int16_t*)dest;
auto srcPtr = (int16_t*)source;
int unitpacksize = sizeof(int16_t) / sizeof(uint8_t);
for (int i = 0; i < hu; ++i) {
for (int k = 0; k < hp; ++k) {
for (int bl = 0; bl < blocknum; ++bl) {
int j = 0;
while (j + 7 < lu) {
auto srcindex = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / unitpacksize;
2025-05-08 12:39:44 +08:00
auto dstindex0 = (bl * stride1 + i * stride0 + j * stride2 + k * lp) / unitpacksize;
auto dstindex1 = (bl * stride1 + i * stride0 + (j + 1) * stride2 + k * lp) / unitpacksize;
auto dstindex2 = (bl * stride1 + i * stride0 + (j + 2) * stride2 + k * lp) / unitpacksize;
auto dstindex3 = (bl * stride1 + i * stride0 + (j + 3) * stride2 + k * lp) / unitpacksize;
auto dstindex4 = (bl * stride1 + i * stride0 + (j + 4) * stride2 + k * lp) / unitpacksize;
auto dstindex5 = (bl * stride1 + i * stride0 + (j + 5) * stride2 + k * lp) / unitpacksize;
auto dstindex6 = (bl * stride1 + i * stride0 + (j + 6) * stride2 + k * lp) / unitpacksize;
auto dstindex7 = (bl * stride1 + i * stride0 + (j + 7) * stride2 + k * lp) / unitpacksize;
2025-03-12 11:35:16 +08:00
j += 8;
auto srcdata = vld1q_s16(srcPtr + srcindex);
vst1q_lane_s16(dstPtr + dstindex0, srcdata, 0);
vst1q_lane_s16(dstPtr + dstindex1, srcdata, 1);
vst1q_lane_s16(dstPtr + dstindex2, srcdata, 2);
vst1q_lane_s16(dstPtr + dstindex3, srcdata, 3);
vst1q_lane_s16(dstPtr + dstindex4, srcdata, 4);
vst1q_lane_s16(dstPtr + dstindex5, srcdata, 5);
vst1q_lane_s16(dstPtr + dstindex6, srcdata, 6);
vst1q_lane_s16(dstPtr + dstindex7, srcdata, 7);
}
while (j + 3 < lu) {
auto srcindex = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / unitpacksize;
2025-05-08 12:39:44 +08:00
auto dstindex0 = (bl * stride1 + i * stride0 + j * stride2 + k * lp) / unitpacksize;
auto dstindex1 = (bl * stride1 + i * stride0 + (j + 1) * stride2 + k * lp) / unitpacksize;
auto dstindex2 = (bl * stride1 + i * stride0 + (j + 2) * stride2 + k * lp) / unitpacksize;
auto dstindex3 = (bl * stride1 + i * stride0 + (j + 3) * stride2 + k * lp) / unitpacksize;
2025-03-12 11:35:16 +08:00
j += 4;
auto srcdata = vld1_s16(srcPtr + srcindex);
vst1_lane_s16(dstPtr + dstindex0, srcdata, 0);
vst1_lane_s16(dstPtr + dstindex1, srcdata, 1);
vst1_lane_s16(dstPtr + dstindex2, srcdata, 2);
vst1_lane_s16(dstPtr + dstindex3, srcdata, 3);
}
while (j < lu)
{
auto srcindex = ((i * hp + k) * ic + bl * (lu * lp) + j * lp) / 2;
2025-05-08 12:39:44 +08:00
auto dstindex = (bl * stride1 + i * stride0 + j * stride2 + k * lp) / 2;
2025-03-12 11:35:16 +08:00
dstPtr[dstindex] = srcPtr[srcindex];
j++;
}
}
}
}
MNNPermuteSumWeightInt4Arm82(dest, dest, blocknum * hu, lu, kernelsum);
}
#endif // __aarch64__
static void MNNSumWeightInt8(float* kernelsum, int8_t* source, size_t outside, size_t reduceAxis, size_t hP, size_t lP) {
// weight shape: [outside, axis, hP, lP]
// outside = blocknum * hU
// reduceAxis = kernelCount * lU
auto inside = hP * lP;
auto stride0 = inside * reduceAxis;
std::vector<float> accum(hP);
for (int i = 0; i < outside; ++i) {
memset(accum.data(), 0, hP * 4);
for (int j = 0; j < reduceAxis; ++j) {
for (int k = 0; k < hP; ++k) {
for (int x = 0; x < lP; ++x) {
accum[k] += (float)source[x + k * lP + j * inside + i * stride0];
}
}
}
memcpy(kernelsum + i * hP, accum.data(), hP * sizeof(float));
}
}
2024-07-22 19:51:53 +08:00
static void MNNSumByAxisLForMatmul_A(float* dest, int8_t* source, const float* scale, ssize_t realDstCount, SumByAxisParams sumParams) {
#ifdef MNN_USE_SSE
uint8_t* srcInt8 = reinterpret_cast<uint8_t*>(source);
#else
int8_t* srcInt8 = source;
2023-12-11 18:01:44 +08:00
#endif
2024-07-22 19:51:53 +08:00
auto scalePtr = scale;
auto blockNum = sumParams.blockNum;
auto EP = sumParams.DST_XUNIT;
auto LP = sumParams.SRC_UNIT;
2025-05-08 12:39:44 +08:00
auto col_buffer_unit_size = sumParams.unitColBufferSize;
2024-07-22 19:51:53 +08:00
auto oneScale = sumParams.oneScale;
2025-01-22 14:47:50 +08:00
auto LU = sumParams.LU;
auto valid = sumParams.valid;
auto kernelxy = sumParams.kernelxy;
auto blockSizeQuad = LU / blockNum;
2025-05-08 12:39:44 +08:00
auto inputBlockQuant = sumParams.inputBlock;
2025-01-22 14:47:50 +08:00
auto lastL = LP;
if (valid) {
lastL = valid;
}
float singlescale = scale[0];
2024-07-22 19:51:53 +08:00
do {
int step = ALIMIN(EP, realDstCount);
2025-05-08 12:39:44 +08:00
int scaleOffset = inputBlockQuant ? (step * blockNum) : step;
2024-07-22 19:51:53 +08:00
for (int k = 0; k < blockNum; ++k) {
2025-04-28 11:38:44 +08:00
const auto src_x = srcInt8 + k * (step * LP * blockSizeQuad * kernelxy);
2024-07-22 19:51:53 +08:00
for (int w = 0; w < step; ++w) {
2025-01-22 14:47:50 +08:00
float dequantScale = singlescale;
2025-05-08 12:39:44 +08:00
if (oneScale == 0 && inputBlockQuant) {
dequantScale = scalePtr[w + k * step];
} else if (oneScale == 0) {
2024-07-22 19:51:53 +08:00
dequantScale = scalePtr[w];
}
int sumint32 = 0;
const auto src_y = src_x + w * LP;
2025-01-22 14:47:50 +08:00
for (int j = 0; j < kernelxy; ++j) {
for (int i = 0; i < blockSizeQuad; ++i) {
auto sumsize = i == (blockSizeQuad - 1) ? lastL : LP;
const auto src_z = src_y + j * (blockSizeQuad * step * LP) + i * step * LP;
for (int x = 0; x < sumsize; ++x) {
sumint32 += src_z[x];
}
2024-07-22 19:51:53 +08:00
}
}
dest[w + k * step] = dequantScale * static_cast<float>(sumint32);
}
}
2025-05-08 12:39:44 +08:00
scalePtr += scaleOffset;
2024-07-22 19:51:53 +08:00
dest += (step * blockNum);
realDstCount -= step;
srcInt8 += col_buffer_unit_size;
2024-12-19 16:20:00 +08:00
} while(realDstCount > 0);
2024-07-22 19:51:53 +08:00
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
template<typename T>
void MNNPackC4Common(T* dst, const T* src, size_t area, size_t depth, int* areaOffset) {
int depthC4 = depth / 4;
int depthRemain = depthC4 * 4;
int remain = depth - depthRemain;
int z, x, y;
const T* srcChannel[4];
const T* srcOffset = src;
for(z = 0; z < depthC4; ++z) {
auto dstZ = dst + z * areaOffset[1] * 4;
for(y = 0; y < 4; ++y) {
srcChannel[y] = srcOffset + areaOffset[0] * y;
}
for(x = 0; x < area; ++x) {
for(y = 0; y < 4; ++y) {
dstZ[0] = srcChannel[y][x];
dstZ++;
}
}
srcOffset += areaOffset[0] * 4;
}
if(remain > 0){
auto dstZ = dst + depthC4 * areaOffset[1] * 4;
for(y = 0; y < remain; ++y) {
srcChannel[y] = srcOffset + areaOffset[0] * y;
}
for(x = 0; x < area; ++x) {
for(y = 0; y < remain; ++y) {
dstZ[0] = srcChannel[y][x];
dstZ++;
}
for(y = remain; y < 4; ++y) {
dstZ[0] = 0;
dstZ++;
}
}
}
}
template<typename T>
void MNNUnpackC4Common(T* dst, const T* src, size_t area, size_t depth, int* areaOffset) {
int depthC4 = depth / 4;
int depthRemain = depthC4 * 4;
int remain = depth - depthRemain;
int z, x, y;
const T* srcChannel[4];
const T* srcOffset = src;
for(z = 0; z < depthC4; ++z) {
for(y = 0; y < 4; ++y) {
auto dstZ = dst + (z * 4 + y) * areaOffset[1];
srcChannel[y] = srcOffset + y;
for(x = 0; x < area; ++x) {
dstZ[x] = srcChannel[y][0];
srcChannel[y] += 4;
}
}
srcOffset += areaOffset[0] * 4;
}
if(remain > 0){
auto dstZ = dst + depthC4 * areaOffset[1] * 4;
for(y = 0; y < remain; ++y) {
srcChannel[y] = srcOffset + y;
for(x = 0; x < area; ++x) {
dstZ[x] = srcChannel[y][0];
srcChannel[y] += 4;
}
dstZ += areaOffset[1];
}
}
}
2023-02-15 10:30:27 +08:00
template<typename T>
void MNNPackC2Common(T* dst, const T* src, size_t area, size_t depth, int* areaOffset) {
int depthC2 = depth / 2;
int depthRemain = depthC2 * 2;
int remain = depth - depthRemain;
int z, x, y;
const T* srcChannel[2];
const T* srcOffset = src;
for(z = 0; z < depthC2; ++z) {
auto dstZ = dst + z * areaOffset[1] * 2;
for(y = 0; y < 2; ++y) {
srcChannel[y] = srcOffset + areaOffset[0] * y;
}
for(x = 0; x < area; ++x) {
for(y = 0; y < 2; ++y) {
dstZ[0] = srcChannel[y][x];
dstZ++;
}
}
srcOffset += areaOffset[0] * 2;
}
if(remain > 0){
auto dstZ = dst + depthC2 * areaOffset[1] * 2;
for(y = 0; y < remain; ++y) {
srcChannel[y] = srcOffset + areaOffset[0] * y;
}
for(x = 0; x < area; ++x) {
for(y = 0; y < remain; ++y) {
dstZ[0] = srcChannel[y][x];
dstZ++;
}
for(y = remain; y < 2; ++y) {
dstZ[0] = 0;
dstZ++;
}
}
}
}
template<typename T>
2023-12-04 11:12:20 +08:00
void MNNUnpackC2Common(T* dst, const T* src, size_t area, size_t depth, int* areaOffset, int pack = 1) {
2023-02-15 10:30:27 +08:00
int depthC2 = depth / 2;
int depthRemain = depthC2 * 2;
int remain = depth - depthRemain;
int z, x, y;
const T* srcChannel[2];
const T* srcOffset = src;
for(z = 0; z < depthC2; ++z) {
for(y = 0; y < 2; ++y) {
2023-12-04 11:12:20 +08:00
auto dstZ = dst + (z * 2 + y) * areaOffset[1] * pack;
srcChannel[y] = srcOffset + y * pack;
2023-02-15 10:30:27 +08:00
for(x = 0; x < area; ++x) {
2023-12-04 11:12:20 +08:00
for (int p = 0; p < pack; ++p) {
dstZ[x * pack + p] = srcChannel[y][p];
}
srcChannel[y] += (2 * pack);
2023-02-15 10:30:27 +08:00
}
}
2023-12-04 11:12:20 +08:00
srcOffset += areaOffset[0] * 2 * pack;
2023-02-15 10:30:27 +08:00
}
if(remain > 0){
2023-12-04 11:12:20 +08:00
auto dstZ = dst + depthC2 * areaOffset[1] * 2 * pack;
2023-02-15 10:30:27 +08:00
for(y = 0; y < remain; ++y) {
2023-12-04 11:12:20 +08:00
srcChannel[y] = srcOffset + y * pack;
2023-02-15 10:30:27 +08:00
for(x = 0; x < area; ++x) {
2023-12-04 11:12:20 +08:00
for (int p = 0; p < pack; ++p) {
dstZ[x * pack + p] = srcChannel[y][p];
}
srcChannel[y] += 2 * pack;
2023-02-15 10:30:27 +08:00
}
2023-12-04 11:12:20 +08:00
dstZ += areaOffset[1] * pack;
2023-02-15 10:30:27 +08:00
}
}
}
2023-10-18 10:31:02 +08:00
void MNN4BitcopyWithStride (uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
auto src = (uint32_t*)srcO;
auto dst = (uint32_t*)dstO;
for (int i = 0; i < size; ++i) {
dst[0] = *src;
dst += ds;
src += stride;
}
}
void MNN4BitcopyFast (uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
// ds=1, stride=0||1
auto src = (float*)srcO;
auto dst = (float*)dstO;
int cnt = size;
if (stride == 1) { // stride=1
#ifdef MNN_USE_NEON
for (; cnt >= 8; cnt -= 8) {
auto v4 = vld1q_f32(src);
auto u4 = vld1q_f32(src + 4);
vst1q_f32(dst, v4);
vst1q_f32(dst + 4, u4);
dst += 8;
src += 8;
}
for (; cnt >= 4; cnt -= 4) {
auto v4 = vld1q_f32(src);
vst1q_f32(dst, v4);
dst += 4;
src += 4;
}
#elif defined(MNN_USE_SSE)
for (; cnt >= 8; cnt -= 8) {
__m128 v4 = _mm_loadu_ps(src);
__m128 u4 = _mm_loadu_ps(src + 4);
_mm_storeu_ps(dst, v4);
_mm_storeu_ps(dst + 4, u4);
dst += 8;
src += 8;
}
for (; cnt >= 4; cnt -= 4) {
__m128 v4 = _mm_loadu_ps(src);
_mm_storeu_ps(dst, v4);
dst += 4;
src += 4;
}
#endif
} else { // stride=0
int i = 0;
float val = *src;
#ifdef MNN_USE_NEON
auto val4 = vdupq_n_f32(val);
for (; cnt >= 8; cnt -= 8) {
vst1q_f32(dst, val4);
vst1q_f32(dst + 4, val4);
dst += 8;
}
for (; cnt >= 4; cnt -= 4) {
vst1q_f32(dst, val4);
dst += 4;
}
#elif defined(MNN_USE_SSE)
__m128 val4 = _mm_set_ps(val, val, val, val);
for (; cnt >= 8; cnt -= 8) {
_mm_storeu_ps(dst, val4);
_mm_storeu_ps((dst + 4), val4);
dst += 8;
}
for (; cnt >= 4; cnt -= 4) {
_mm_storeu_ps(dst, val4);
dst += 4;
}
#endif
}
for (; cnt > 0; --cnt) {
dst[0] = *src;
dst += ds;
src += stride;
}
}
void MNN2BitcopyWithStride(uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
auto src = (uint16_t*)srcO;
auto dst = (uint16_t*)dstO;
for (int i=0; i<size; ++i) {
*dst = *src;
src+=stride;
dst+=ds;
}
}
void MNN2BitcopyFast(uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
auto src = (uint16_t*)srcO;
auto dst = (uint16_t*)dstO;
int cnt = size;
uint16_t val = *src;
if (stride == 1) {
#ifdef MNN_USE_NEON
for (; cnt >= 8; cnt-=8) {
auto val8 = vld1q_u16(src);
vst1q_u16(dst, val8);
src += 8;
dst += 8;
}
for (; cnt >= 4; cnt-=4) {
auto val4 = vld1_u16(src);
vst1_u16(dst, val4);
src += 4;
dst += 4;
}
#elif defined(MNN_USE_SSE)
for (; cnt >= 8; cnt-=8) {
auto tmp = _mm_loadu_ps((float*)src);
_mm_storeu_ps((float*)dst, tmp);
src += 8;
dst += 8;
}
#endif
} else { // stride=0
#ifdef MNN_USE_NEON
auto val4 = vdup_n_u16(val);
auto val8 = vdupq_n_u16(val);
for (; cnt >= 8; cnt-=8) {
vst1q_u16(dst, val8);
dst += 8;
}
for (; cnt >= 4; cnt-=4) {
vst1_u16(dst, val4);
dst += 4;
}
#elif defined(MNN_USE_SSE)
uint16_t arr[8] = {val, val, val, val, val, val, val, val};
auto val8 = _mm_loadu_ps((float*)arr);
for (; cnt >= 8; cnt-=8) {
_mm_storeu_ps((float*)dst, val8);
dst += 8;
}
#endif
}
for (; cnt > 0; --cnt) {
*dst = *src;
src += stride;
dst += ds;
}
}
void MNN1BitcopyWithStride (uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
for (int i = 0; i < size; ++i) {
dstO[0] = *srcO;
dstO += ds;
srcO += stride;
}
}
void MNN1BitCopyFast (uint8_t* dstO, const uint8_t* srcO, int size, int stride, int ds) {
int cnt = size;
uint8_t val = *srcO;
if (stride == 1) {
#ifdef MNN_USE_SSE
for (; cnt >= 16; cnt-=16) {
auto tmp = _mm_loadu_ps((float*)srcO);
_mm_storeu_ps((float*)dstO, tmp);
srcO += 16;
dstO += 16;
}
#elif defined(MNN_USE_NEON)
for (; cnt >= 16; cnt-=16) {
auto val16 = vld1q_u8(srcO);
vst1q_u8(dstO, val16);
srcO += 16;
dstO += 16;
}
for (; cnt >= 8; cnt-=8) {
auto val8 = vld1_u8(srcO);
vst1_u8(dstO, val8);
srcO += 8;
dstO += 8;
}
#endif
} else { // stride=0
#ifdef MNN_USE_SSE
std::vector<uint8_t> arr(16, val);
auto val16 = _mm_loadu_ps((float*)arr.data());
2024-05-11 19:17:02 +08:00
2023-10-18 10:31:02 +08:00
for (; cnt >= 16; cnt-=16) {
_mm_storeu_ps((float*)dstO, val16);
dstO += 16;
}
#elif defined(MNN_USE_NEON)
auto val16 = vdupq_n_u8(val);
auto val8 = vdup_n_u8(val);
for (; cnt >= 16; cnt-=16) {
vst1q_u8(dstO, val16);
dstO += 16;
}
for (; cnt >= 8; cnt-=8) {
vst1_u8(dstO, val8);
dstO += 8;
}
#endif
}
for (; cnt > 0; --cnt) {
dstO[0] = *srcO;
dstO += ds;
srcO += stride;
}
}
void MNNAccumulateSequenceNumber (float* dst, const float* src, int size) {
// mode: 0:Add, 1:Sub, 2:Min, 3:Max
int size8 = (size / 8) * 8;
int i = 0;
float sum = 0.f;
float tmp[4];
#ifdef MNN_USE_NEON
if (size >= 8) {
auto sum4_1 = vdupq_n_f32(0.f);
auto sum4_2 = vdupq_n_f32(0.f);
for (; i < size8; i += 8) {
auto v4 = vld1q_f32(src);
auto u4 = vld1q_f32(src + 4);
sum4_1 = vaddq_f32(sum4_1, v4);
sum4_2 = vaddq_f32(sum4_2, u4);
src += 8;
}
sum4_1 = vaddq_f32(sum4_1, sum4_2);
2024-12-31 15:34:08 +08:00
sum = (sum4_1[0] + sum4_1[1]) + (sum4_1[2] + sum4_1[3]);
2023-10-18 10:31:02 +08:00
}
#elif defined(MNN_USE_SSE)
if (size >= 8) {
auto sum4_1 = _mm_set_ps1(0.f);
auto sum4_2 = _mm_set_ps1(0.f);
2024-05-11 19:17:02 +08:00
2023-10-18 10:31:02 +08:00
for (; i < size8; i += 8) {
auto v4 = _mm_loadu_ps(src);
auto u4 = _mm_loadu_ps(src + 4);
sum4_1 = _mm_add_ps(sum4_1, v4);
sum4_2 = _mm_add_ps(sum4_2, u4);
src += 8;
}
2024-05-11 19:17:02 +08:00
2023-10-18 10:31:02 +08:00
sum4_1 = _mm_add_ps(sum4_1, sum4_2);
_mm_storeu_ps(tmp, sum4_1);
sum += (tmp[0] + tmp[1] + tmp[2] + tmp[3]);
}
#endif
for (; i < size; ++i) {
sum += (*src);
src += 1;
}
*dst = sum;
}
2021-04-08 15:34:23 +08:00
#ifndef MNN_USE_NEON
void MNNGetMatMulPackMode(int* eP, int *lP, int* hP) {
*eP = 16;
*lP = 1;
*hP = 4;
}
void MNNGetSparseMatMulPackMode(int* eP, int *lP, int* hP) {
*eP = 16;
*lP = 1;
*hP = 4;
// hp is corresponding to sparse block along right matrix colum dimension. in ramdom sparse, it is 1.
return;
}
void MNNPackForMatMul_B(float* dest, const float* source, size_t h, size_t l, bool transpose) {
2021-04-08 15:34:23 +08:00
auto hP = h / 4;
auto hR = hP * 4;
if (hR != h) {
::memset(dest, 0, UP_DIV(h, 4)*4*l*sizeof(float));
2021-04-08 15:34:23 +08:00
}
if (!transpose) {
for (int y=0; y<hP; ++y) {
auto destY = dest + y * 4 * l;
auto sourceY = source + y * 4;
for (int x=0; x<l; ++x) {
::memcpy(destY + 4 * x, sourceY + x * h, 4 * sizeof(float));
2021-04-08 15:34:23 +08:00
}
}
auto hRemain = h - hR;
if (hRemain > 0) {
auto destY = dest + hP * 4 * l;
auto sourceY = source + hP * 4;
for (int x=0; x<l; ++x) {
::memcpy(destY + 4 * x, sourceY + x * h, hRemain * sizeof(float));
2021-04-08 15:34:23 +08:00
}
}
return;
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
int offset[] = {
(int)l,
(int)l
};
MNNPackC4(dest, source, l, h, offset);
2021-04-08 15:34:23 +08:00
}
static void _MNNPackedMatMulRemain(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, int aStride) {
2021-04-08 15:34:23 +08:00
auto h = parameter[2];
auto l = parameter[1];
auto cStride = parameter[3] / sizeof(float);
auto hRemain = parameter[4];
auto bExtraStride = parameter[5] / sizeof(float);
auto bStride = bExtraStride + l * 4;
auto hC4 = UP_DIV(h, 4);
for (int y=0; y<hC4; ++y) {
::memset(C + y * cStride, 0, eSize * 4 * sizeof(float));
}
float alpha = 1.0f;
float beta = 0.0f;
float minValue = -std::numeric_limits<float>().max();
float maxValue = std::numeric_limits<float>().max();
if (nullptr != postParameters) {
minValue = postParameters[2];
maxValue = postParameters[3];
alpha = postParameters[0];
beta = postParameters[1];
2020-07-04 01:21:30 +08:00
}
2021-04-08 15:34:23 +08:00
for (int x=0; x<eSize; ++x) {
auto dst = C + 4 * x;
auto src = A + x;
for (int y=0; y<hC4; ++y) {
auto dstY = dst + y * cStride;
auto weight = B + y * bStride;
float summer[4] = {
0.0f,
0.0f,
0.0f,
0.0f,
};
if (nullptr != bias) {
for (int v=0; v<4; ++v) {
summer[v] = bias[4 * y + v];
}
}
for (int z=0; z<l; ++z) {
auto aZ = src + z * aStride;
2021-04-08 15:34:23 +08:00
auto wZ = weight + z * 4;
summer[0] += wZ[0] * aZ[0];
summer[1] += wZ[1] * aZ[0];
summer[2] += wZ[2] * aZ[0];
summer[3] += wZ[3] * aZ[0];
}
for (int v=0; v<4; ++v) {
auto dstValue = std::min(summer[v], maxValue);
dstValue = std::max(dstValue, minValue);
dstY[v] = dstValue;
}
}
}
}
void MNNPackedMatMul(float* C, const float* A, const float* B, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
return _MNNPackedMatMulRemain(C, A, B, 16, parameter, postParameters, bias, 16);
}
void MNNPackedMatMulRemain(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, const float* k, const float* b) {
auto aStride = parameter[0] / sizeof(float);
_MNNPackedMatMulRemain(C, A, B, eSize, parameter, postParameters, bias, aStride);
}
2021-04-08 15:34:23 +08:00
void MNNPackC4ForMatMul_A(float* destOrigin, float const** sourceGroup, const int32_t* info, const int32_t* el) {
int number = info[0];
int eReal = info[1];
int eDest = info[2];
2021-04-08 15:34:23 +08:00
int offset = info[3];
for (int n=0; n<number; ++n) {
int e = el[4 * n + 0];
int l = el[4 * n + 1];
int eOffset = el[4 * n + 2];
int lOffset = el[4 * n + 3];
auto dest = destOrigin + lOffset * eDest + eOffset;
auto source = sourceGroup[n];
for (int y=0; y<e; ++y) {
auto yR = y % eDest;
2021-04-08 15:34:23 +08:00
for (int x=0; x<l; ++x) {
auto xR = x % 4;
auto xC = x / 4;
dest[(x) * eDest + yR] = source[xC * eReal * 4 + y * 4 * offset + xR];
}
}
2020-07-04 01:21:30 +08:00
}
}
void MNNPackedSparseMatMulEpx1(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, unsigned int* NNZMap, int* dataOffsetMap) {
auto eP = parameter[0] / sizeof(float);
MNN_ASSERT((eP & 0x03) == 0); // In sparse calculate, eP should be evenly divided by 4
auto h = parameter[2];
auto l = parameter[1];
auto cStride = parameter[3] / sizeof(float);
auto aStride = eP * l;
auto hRemain = parameter[4];
auto bExtraStride = parameter[5] / sizeof(float);
auto bStride = bExtraStride + l * 4;
auto hC4 = UP_DIV(h, 4);
float minValue = -std::numeric_limits<float>().max();
float maxValue = std::numeric_limits<float>().max();
if (nullptr != postParameters) {
minValue = postParameters[2];
maxValue = postParameters[3];
}
// MNN_PRINT("MNNPackedSparseMatMul eP:%lu, eSize:%lu, l:%lu, h:%lu, cStride:%lu, aStride:%lu\n", eP, eSize, l, h, cStride, aStride);
const float* a = A;
size_t ie = 0;
for (ie = 0; ie < eSize && eP <= eSize; ie += eP) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
for (auto ih = 0; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihPack * cStride + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
float acc4 = initValue;
float acc5 = initValue;
float acc6 = initValue;
float acc7 = initValue;
float acc8 = initValue;
float acc9 = initValue;
float acc10 = initValue;
float acc11 = initValue;
float acc12 = initValue;
float acc13 = initValue;
float acc14 = initValue;
float acc15 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float a8 = a[8];
const float a9 = a[9];
const float a10 = a[10];
const float a11 = a[11];
const float a12 = a[12];
const float a13 = a[13];
const float a14 = a[14];
const float a15 = a[15];
const float oneW = *w++;
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
acc4 += a4 * oneW;
acc5 += a5 * oneW;
acc6 += a6 * oneW;
acc7 += a7 * oneW;
acc8 += a8 * oneW;
acc9 += a9 * oneW;
acc10 += a10 * oneW;
acc11 += a11 * oneW;
acc12 += a12 * oneW;
acc13 += a13 * oneW;
acc14 += a14 * oneW;
acc15 += a15 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
acc4 = std::max(std::min(maxValue, acc4), minValue);
acc5 = std::max(std::min(maxValue, acc5), minValue);
acc6 = std::max(std::min(maxValue, acc6), minValue);
acc7 = std::max(std::min(maxValue, acc7), minValue);
acc8 = std::max(std::min(maxValue, acc8), minValue);
acc9 = std::max(std::min(maxValue, acc9), minValue);
acc10 = std::max(std::min(maxValue, acc10), minValue);
acc11 = std::max(std::min(maxValue, acc11), minValue);
acc12 = std::max(std::min(maxValue, acc12), minValue);
acc13 = std::max(std::min(maxValue, acc13), minValue);
acc14 = std::max(std::min(maxValue, acc14), minValue);
acc15 = std::max(std::min(maxValue, acc15), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
c[4 * 4] = acc4;
c[4 * 5] = acc5;
c[4 * 6] = acc6;
c[4 * 7] = acc7;
c[4 * 8] = acc8;
c[4 * 9] = acc9;
c[4 * 10] = acc10;
c[4 * 11] = acc11;
c[4 * 12] = acc12;
c[4 * 13] = acc13;
c[4 * 14] = acc14;
c[4 * 15] = acc15;
}
a += aStride;
}
// const float* blockA = A + ie * l;
if (eSize & 0x08) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
for (auto ih = 0; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihPack * cStride + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
float acc4 = initValue;
float acc5 = initValue;
float acc6 = initValue;
float acc7 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float oneW = *w++;
// MNN_PRINT("8-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-7]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {8});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
acc4 += a4 * oneW;
acc5 += a5 * oneW;
acc6 += a6 * oneW;
acc7 += a7 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
acc4 = std::max(std::min(maxValue, acc4), minValue);
acc5 = std::max(std::min(maxValue, acc5), minValue);
acc6 = std::max(std::min(maxValue, acc6), minValue);
acc7 = std::max(std::min(maxValue, acc7), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
c[4 * 4] = acc4;
c[4 * 5] = acc5;
c[4 * 6] = acc6;
c[4 * 7] = acc7;
}
ie += 8;
a += 8;
}
if (eSize & 0x04) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
for (auto ih = 0; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihPack * cStride + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float oneW = *w++;
// MNN_PRINT("4-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-3]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {4});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
}
ie += 4;
a += 4;
}
if (eSize & 0x02) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
for (auto ih = 0; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihPack * cStride + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float oneW = *w++;
// MNN_PRINT("2-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-1]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {2});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
}
ie += 2;
a += 2;
}
if (eSize & 0x01) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
for (auto ih = 0; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihPack * cStride + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float oneW = *w++;
// MNN_PRINT("1-loop: ie:%zu, a offset:%ld, c offset:%ld, w offset:%ld, w value:%f, a value[0]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {1});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
}
ie += 1;
// a += 1;
}
return;
}
void MNNPackedSparseMatMulEpx4(float* C, const float* A, const float* B, size_t eSize, const size_t* parameter, const float* postParameters, const float* bias, unsigned int* NNZMap, int* dataOffsetMap) {
auto eP = parameter[0] / sizeof(float);
MNN_ASSERT((eP & 0x03) == 0); // In sparse calculate, eP should be evenly divided by 4
auto h = parameter[2];
auto l = parameter[1];
auto cStride = parameter[3] / sizeof(float);
auto aStride = eP * l;
auto hRemain = parameter[4];
auto bExtraStride = parameter[5] / sizeof(float);
auto bStride = bExtraStride + l * 4;
auto hC4 = UP_DIV(h, 4);
float minValue = -std::numeric_limits<float>().max();
float maxValue = std::numeric_limits<float>().max();
if (nullptr != postParameters) {
minValue = postParameters[2];
maxValue = postParameters[3];
}
// MNN_PRINT("MNNPackedSparseMatMul 16x4 eP:%lu, eSize:%lu, l:%lu, h:%lu, cStride:%lu, aStride:%lu\n", eP, eSize, l, h, cStride, aStride);
const int sparseBlockOC = 4;
const float* a = A;
size_t ie = 0;
for (ie = 0; ie < eSize && eP <= eSize; ie += eP) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
size_t ih = 0;
for (; ih < (h & (~0x03)); ih += sparseBlockOC) {
auto ihPack = ih >> 2;
auto c = blockC + ihPack * cStride;
float initValue[4] = {0, 0, 0, 0};
if (nullptr != bias) {
memcpy(initValue, bias + ih, 4 * sizeof(float));
}
float acc0[4];
float acc1[4];
float acc2[4];
float acc3[4];
float acc4[4];
float acc5[4];
float acc6[4];
float acc7[4];
float acc8[4];
float acc9[4];
float acc10[4];
float acc11[4];
float acc12[4];
float acc13[4];
float acc14[4];
float acc15[4];
memcpy(acc0, initValue, 4 * sizeof(float));
memcpy(acc1, initValue, 4 * sizeof(float));
memcpy(acc2, initValue, 4 * sizeof(float));
memcpy(acc3, initValue, 4 * sizeof(float));
memcpy(acc4, initValue, 4 * sizeof(float));
memcpy(acc5, initValue, 4 * sizeof(float));
memcpy(acc6, initValue, 4 * sizeof(float));
memcpy(acc7, initValue, 4 * sizeof(float));
memcpy(acc8, initValue, 4 * sizeof(float));
memcpy(acc9, initValue, 4 * sizeof(float));
memcpy(acc10, initValue, 4 * sizeof(float));
memcpy(acc11, initValue, 4 * sizeof(float));
memcpy(acc12, initValue, 4 * sizeof(float));
memcpy(acc13, initValue, 4 * sizeof(float));
memcpy(acc14, initValue, 4 * sizeof(float));
memcpy(acc15, initValue, 4 * sizeof(float));
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float a8 = a[8];
const float a9 = a[9];
const float a10 = a[10];
const float a11 = a[11];
const float a12 = a[12];
const float a13 = a[13];
const float a14 = a[14];
const float a15 = a[15];
const float wv[4] = {*w++, *w++, *w++, *w++};
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
for (int lane = 0; lane < 4; lane++) {
acc0[lane] += a0 * wv[lane];
acc1[lane] += a1 * wv[lane];
acc2[lane] += a2 * wv[lane];
acc3[lane] += a3 * wv[lane];
acc4[lane] += a4 * wv[lane];
acc5[lane] += a5 * wv[lane];
acc6[lane] += a6 * wv[lane];
acc7[lane] += a7 * wv[lane];
acc8[lane] += a8 * wv[lane];
acc9[lane] += a9 * wv[lane];
acc10[lane] += a10 * wv[lane];
acc11[lane] += a11 * wv[lane];
acc12[lane] += a12 * wv[lane];
acc13[lane] += a13 * wv[lane];
acc14[lane] += a14 * wv[lane];
acc15[lane] += a15 * wv[lane];
}
}
for (int lane = 0; lane < 4; lane++) {
acc0[lane] = std::max(std::min(maxValue, acc0[lane]), minValue);
acc1[lane] = std::max(std::min(maxValue, acc1[lane]), minValue);
acc2[lane] = std::max(std::min(maxValue, acc2[lane]), minValue);
acc3[lane] = std::max(std::min(maxValue, acc3[lane]), minValue);
acc4[lane] = std::max(std::min(maxValue, acc4[lane]), minValue);
acc5[lane] = std::max(std::min(maxValue, acc5[lane]), minValue);
acc6[lane] = std::max(std::min(maxValue, acc6[lane]), minValue);
acc7[lane] = std::max(std::min(maxValue, acc7[lane]), minValue);
acc8[lane] = std::max(std::min(maxValue, acc8[lane]), minValue);
acc9[lane] = std::max(std::min(maxValue, acc9[lane]), minValue);
acc10[lane] = std::max(std::min(maxValue, acc10[lane]), minValue);
acc11[lane] = std::max(std::min(maxValue, acc11[lane]), minValue);
acc12[lane] = std::max(std::min(maxValue, acc12[lane]), minValue);
acc13[lane] = std::max(std::min(maxValue, acc13[lane]), minValue);
acc14[lane] = std::max(std::min(maxValue, acc14[lane]), minValue);
acc15[lane] = std::max(std::min(maxValue, acc15[lane]), minValue);
}
memcpy(c, acc0, 4 * sizeof(float)); // store continuous c
memcpy(c + 4, acc1, 4 * sizeof(float));
memcpy(c + 4 * 2, acc2, 4 * sizeof(float));
memcpy(c + 4 * 3, acc3, 4 * sizeof(float));
memcpy(c + 4 * 4, acc4, 4 * sizeof(float));
memcpy(c + 4 * 5, acc5, 4 * sizeof(float));
memcpy(c + 4 * 6, acc6, 4 * sizeof(float));
memcpy(c + 4 * 7, acc7, 4 * sizeof(float));
memcpy(c + 4 * 8, acc8, 4 * sizeof(float));
memcpy(c + 4 * 9, acc9, 4 * sizeof(float));
memcpy(c + 4 * 10, acc10, 4 * sizeof(float));
memcpy(c + 4 * 11, acc11, 4 * sizeof(float));
memcpy(c + 4 * 12, acc12, 4 * sizeof(float));
memcpy(c + 4 * 13, acc13, 4 * sizeof(float));
memcpy(c + 4 * 14, acc14, 4 * sizeof(float));
memcpy(c + 4 * 15, acc15, 4 * sizeof(float));
}
blockC += (h >> 2) * cStride;
for (; ih < h; ih++) {
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
float acc4 = initValue;
float acc5 = initValue;
float acc6 = initValue;
float acc7 = initValue;
float acc8 = initValue;
float acc9 = initValue;
float acc10 = initValue;
float acc11 = initValue;
float acc12 = initValue;
float acc13 = initValue;
float acc14 = initValue;
float acc15 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float a8 = a[8];
const float a9 = a[9];
const float a10 = a[10];
const float a11 = a[11];
const float a12 = a[12];
const float a13 = a[13];
const float a14 = a[14];
const float a15 = a[15];
const float oneW = *w++;
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
acc4 += a4 * oneW;
acc5 += a5 * oneW;
acc6 += a6 * oneW;
acc7 += a7 * oneW;
acc8 += a8 * oneW;
acc9 += a9 * oneW;
acc10 += a10 * oneW;
acc11 += a11 * oneW;
acc12 += a12 * oneW;
acc13 += a13 * oneW;
acc14 += a14 * oneW;
acc15 += a15 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
acc4 = std::max(std::min(maxValue, acc4), minValue);
acc5 = std::max(std::min(maxValue, acc5), minValue);
acc6 = std::max(std::min(maxValue, acc6), minValue);
acc7 = std::max(std::min(maxValue, acc7), minValue);
acc8 = std::max(std::min(maxValue, acc8), minValue);
acc9 = std::max(std::min(maxValue, acc9), minValue);
acc10 = std::max(std::min(maxValue, acc10), minValue);
acc11 = std::max(std::min(maxValue, acc11), minValue);
acc12 = std::max(std::min(maxValue, acc12), minValue);
acc13 = std::max(std::min(maxValue, acc13), minValue);
acc14 = std::max(std::min(maxValue, acc14), minValue);
acc15 = std::max(std::min(maxValue, acc15), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
c[4 * 4] = acc4;
c[4 * 5] = acc5;
c[4 * 6] = acc6;
c[4 * 7] = acc7;
c[4 * 8] = acc8;
c[4 * 9] = acc9;
c[4 * 10] = acc10;
c[4 * 11] = acc11;
c[4 * 12] = acc12;
c[4 * 13] = acc13;
c[4 * 14] = acc14;
c[4 * 15] = acc15;
}
a += aStride;
}
// const float* blockA = A + ie * l;
if (eSize & 0x08) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
size_t ih = 0;
for (; ih < (h & (~0x03)); ih += sparseBlockOC) {
auto ihPack = ih >> 2;
auto c = blockC + ihPack * cStride;
float initValue[4] = {0, 0, 0, 0};
if (nullptr != bias) {
memcpy(initValue, bias + ih, 4 * sizeof(float));
}
float acc0[4];
float acc1[4];
float acc2[4];
float acc3[4];
float acc4[4];
float acc5[4];
float acc6[4];
float acc7[4];
memcpy(acc0, initValue, 4 * sizeof(float));
memcpy(acc1, initValue, 4 * sizeof(float));
memcpy(acc2, initValue, 4 * sizeof(float));
memcpy(acc3, initValue, 4 * sizeof(float));
memcpy(acc4, initValue, 4 * sizeof(float));
memcpy(acc5, initValue, 4 * sizeof(float));
memcpy(acc6, initValue, 4 * sizeof(float));
memcpy(acc7, initValue, 4 * sizeof(float));
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float wv[4] = {*w++, *w++, *w++, *w++};
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
for (int lane = 0; lane < 4; lane++) {
acc0[lane] += a0 * wv[lane];
acc1[lane] += a1 * wv[lane];
acc2[lane] += a2 * wv[lane];
acc3[lane] += a3 * wv[lane];
acc4[lane] += a4 * wv[lane];
acc5[lane] += a5 * wv[lane];
acc6[lane] += a6 * wv[lane];
acc7[lane] += a7 * wv[lane];
}
}
for (int lane = 0; lane < 4; lane++) {
acc0[lane] = std::max(std::min(maxValue, acc0[lane]), minValue);
acc1[lane] = std::max(std::min(maxValue, acc1[lane]), minValue);
acc2[lane] = std::max(std::min(maxValue, acc2[lane]), minValue);
acc3[lane] = std::max(std::min(maxValue, acc3[lane]), minValue);
acc4[lane] = std::max(std::min(maxValue, acc4[lane]), minValue);
acc5[lane] = std::max(std::min(maxValue, acc5[lane]), minValue);
acc6[lane] = std::max(std::min(maxValue, acc6[lane]), minValue);
acc7[lane] = std::max(std::min(maxValue, acc7[lane]), minValue);
}
memcpy(c, acc0, 4 * sizeof(float)); // store continuous c
memcpy(c + 4, acc1, 4 * sizeof(float));
memcpy(c + 4 * 2, acc2, 4 * sizeof(float));
memcpy(c + 4 * 3, acc3, 4 * sizeof(float));
memcpy(c + 4 * 4, acc4, 4 * sizeof(float));
memcpy(c + 4 * 5, acc5, 4 * sizeof(float));
memcpy(c + 4 * 6, acc6, 4 * sizeof(float));
memcpy(c + 4 * 7, acc7, 4 * sizeof(float));
}
blockC += (ih >> 2) * cStride;
for (; ih < h; ih++) {
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
float acc4 = initValue;
float acc5 = initValue;
float acc6 = initValue;
float acc7 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float a4 = a[4];
const float a5 = a[5];
const float a6 = a[6];
const float a7 = a[7];
const float oneW = *w++;
// MNN_PRINT("8-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-7]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {8});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
acc4 += a4 * oneW;
acc5 += a5 * oneW;
acc6 += a6 * oneW;
acc7 += a7 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
acc4 = std::max(std::min(maxValue, acc4), minValue);
acc5 = std::max(std::min(maxValue, acc5), minValue);
acc6 = std::max(std::min(maxValue, acc6), minValue);
acc7 = std::max(std::min(maxValue, acc7), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
c[4 * 4] = acc4;
c[4 * 5] = acc5;
c[4 * 6] = acc6;
c[4 * 7] = acc7;
}
ie += 8;
a += 8;
}
if (eSize & 0x04) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
size_t ih = 0;
for (; ih < (h & (~0x03)); ih += sparseBlockOC) {
auto ihPack = ih >> 2;
auto c = blockC + ihPack * cStride;
float initValue[4] = {0, 0, 0, 0};
if (nullptr != bias) {
memcpy(initValue, bias + ih, 4 * sizeof(float));
}
float acc0[4];
float acc1[4];
float acc2[4];
float acc3[4];
memcpy(acc0, initValue, 4 * sizeof(float));
memcpy(acc1, initValue, 4 * sizeof(float));
memcpy(acc2, initValue, 4 * sizeof(float));
memcpy(acc3, initValue, 4 * sizeof(float));
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float wv[4] = {*w++, *w++, *w++, *w++};
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
for (int lane = 0; lane < 4; lane++) {
acc0[lane] += a0 * wv[lane];
acc1[lane] += a1 * wv[lane];
acc2[lane] += a2 * wv[lane];
acc3[lane] += a3 * wv[lane];
}
}
for (int lane = 0; lane < 4; lane++) {
acc0[lane] = std::max(std::min(maxValue, acc0[lane]), minValue);
acc1[lane] = std::max(std::min(maxValue, acc1[lane]), minValue);
acc2[lane] = std::max(std::min(maxValue, acc2[lane]), minValue);
acc3[lane] = std::max(std::min(maxValue, acc3[lane]), minValue);
}
memcpy(c, acc0, 4 * sizeof(float)); // store continuous c
memcpy(c + 4, acc1, 4 * sizeof(float));
memcpy(c + 4 * 2, acc2, 4 * sizeof(float));
memcpy(c + 4 * 3, acc3, 4 * sizeof(float));
}
blockC += (ih >> 2) * cStride;
for (; ih < h; ih++) {
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
float acc2 = initValue;
float acc3 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float a2 = a[2];
const float a3 = a[3];
const float oneW = *w++;
// MNN_PRINT("4-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-3]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {4});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
acc2 += a2 * oneW;
acc3 += a3 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
acc2 = std::max(std::min(maxValue, acc2), minValue);
acc3 = std::max(std::min(maxValue, acc3), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
c[4 * 2] = acc2;
c[4 * 3] = acc3;
}
ie += 4;
a += 4;
}
if (eSize & 0x02) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
size_t ih = 0;
for (; ih < (h & (~0x03)); ih += sparseBlockOC) {
auto ihPack = ih >> 2;
auto c = blockC + ihPack * cStride;
float initValue[4] = {0, 0, 0, 0};
if (nullptr != bias) {
memcpy(initValue, bias + ih, 4 * sizeof(float));
}
float acc0[4];
float acc1[4];
memcpy(acc0, initValue, 4 * sizeof(float));
memcpy(acc1, initValue, 4 * sizeof(float));
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float wv[4] = {*w++, *w++, *w++, *w++};
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
for (int lane = 0; lane < 4; lane++) {
acc0[lane] += a0 * wv[lane];
acc1[lane] += a1 * wv[lane];
}
}
for (int lane = 0; lane < 4; lane++) {
acc0[lane] = std::max(std::min(maxValue, acc0[lane]), minValue);
acc1[lane] = std::max(std::min(maxValue, acc1[lane]), minValue);
}
memcpy(c, acc0, 4 * sizeof(float)); // store continuous c
memcpy(c + 4, acc1, 4 * sizeof(float));
}
blockC += (ih >> 2) * cStride;
for (; ih < h; ih++) {
auto ihPack = ih >> 2;
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
float acc1 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float a1 = a[1];
const float oneW = *w++;
// MNN_PRINT("2-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-1]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {2});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
acc1 += a1 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
acc1 = std::max(std::min(maxValue, acc1), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
c[4] = acc1;
}
ie += 2;
a += 2;
}
if (eSize & 0x01) {
const int* dataOffset = dataOffsetMap;
const int diff = *dataOffset++;
// const float* a = blockA + diff;
a += diff;
const float* w = B;
float* blockC = C + (ie << 2);
const unsigned int* nnz = NNZMap;
size_t ih = 0;
for (; ih < (h & (~0x03)); ih += sparseBlockOC) {
auto ihPack = ih >> 2;
auto c = blockC + ihPack * cStride;
float initValue[4] = {0, 0, 0, 0};
if (nullptr != bias) {
memcpy(initValue, bias + ih, 4 * sizeof(float));
}
float acc0[4];
memcpy(acc0, initValue, 4 * sizeof(float));
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float wv[4] = {*w++, *w++, *w++, *w++};
// MNN_PRINT("16-loop: ie:%zu, a offset:%ld, w offset:%ld, c offset:%ld, w value:%f, a value[0-15]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {16});
// MNN_PRINT("\n");
a = a + diff;
for (int lane = 0; lane < 4; lane++) {
acc0[lane] += a0 * wv[lane];
}
}
for (int lane = 0; lane < 4; lane++) {
acc0[lane] = std::max(std::min(maxValue, acc0[lane]), minValue);
}
memcpy(c, acc0, 4 * sizeof(float)); // store continuous c
}
blockC += (ih >> 2) * cStride;
for (; ih < h; ih++) {
auto ihSubIndex = ih & 0x03;
auto c = blockC + ihSubIndex;
const float initValue = nullptr != bias ? bias[ih] : 0;
float acc0 = initValue;
const int lElement = *nnz++;
for (auto il = 0; il < lElement; il++) {
const int diff = *dataOffset++;
const float a0 = a[0];
const float oneW = *w++;
// MNN_PRINT("1-loop: ie:%zu, a offset:%ld, c offset:%ld, w offset:%ld, w value:%f, a value[0]:", ie, a - A, w - B - 1, c - C, oneW);
// formatMatrix(a, {1});
// MNN_PRINT("\n");
a = a + diff;
acc0 += a0 * oneW;
}
acc0 = std::max(std::min(maxValue, acc0), minValue);
// how to store faster: st4 / transpose /
c[0] = acc0;
}
ie += 1;
// a += 1;
}
return;
}
2021-04-08 15:34:23 +08:00
#endif
2020-07-04 01:21:30 +08:00
2020-04-10 14:44:01 +08:00
#ifndef MNN_USE_SSE
2020-11-05 16:41:56 +08:00
#ifndef MNN_USE_NEON
void MNNTranspose32Bit(int32_t* dstO, const int32_t* srcO, int32_t* dim) {
int w = dim[0];
int h = dim[1];
int srcStride = dim[2];
int dstStride = dim[3];
for (int i=0; i<h; ++i) {
auto si = srcO + i;
auto di = dstO + i * dstStride;
for (int j=0; j<w; ++j) {
auto sj = si + j * srcStride;
auto dj = di + j;
*dj = *sj;
}
}
}
2023-12-04 11:12:20 +08:00
void MNNTranspose16Bit(int16_t* dstO, const int16_t* srcO, int32_t* dim) {
int w = dim[0];
int h = dim[1];
int srcStride = dim[2];
int dstStride = dim[3];
for (int i=0; i<h; ++i) {
auto si = srcO + i;
auto di = dstO + i * dstStride;
for (int j=0; j<w; ++j) {
auto sj = si + j * srcStride;
auto dj = di + j;
*dj = *sj;
}
}
}
2020-11-05 16:41:56 +08:00
#endif
2020-07-04 01:21:30 +08:00
void MNNFunctionInit() {
// Do nothing
}
2020-04-10 14:44:01 +08:00
#endif
2019-04-17 10:49:11 +08:00
#ifdef MNN_USE_NEON
#include <arm_neon.h>
#endif
2019-04-17 10:49:11 +08:00
#define UNIT 4
2020-11-05 16:41:56 +08:00
using Vec4 = MNN::Math::Vec<float, 4>;
2019-04-17 10:49:11 +08:00
#ifndef MNN_USE_NEON
#ifndef MNN_USE_SSE
void MNNCopyC4WithStride(const float* source, float* dest, size_t srcStride, size_t dstStride, size_t count) {
for (int i = 0; i < count; ++i) {
auto s = source + i * srcStride;
auto d = dest + i * dstStride;
for (int j = 0; j < 4; ++j) {
d[j] = s[j];
}
}
}
void MNNAddC4WithStride(const float* source, float* dest, size_t srcStride, size_t dstStride, size_t count) {
for (int i = 0; i < count; ++i) {
auto s = source + i * srcStride;
auto d = dest + i * dstStride;
for (int j = 0; j < 4; ++j) {
d[j] += s[j];
}
}
}
2020-02-26 09:57:17 +08:00
void MNNReluWithSlopeChannel(float* dst, const float* src, const float* slope, size_t sizeQuad, size_t depthQuad) {
for (int j = 0; j < depthQuad; j++) {
const float* slopeZ = slope + 4 * j;
const float* srcZ = src + 4 * j * sizeQuad;
float* dstZ = dst + 4 * j * sizeQuad;
for (int i = 0; i < sizeQuad; i++) {
for (int c = 0; c < 4; c++) {
if (srcZ[4 * i + c] < 0) {
dstZ[4 * i + c] = srcZ[4 * i + c] * slopeZ[c];
} else {
dstZ[4 * i + c] = srcZ[4 * i + c];
}
}
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackC4(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
MNNPackC4Common<float>(dst, src, area, depth, areaOffset);
2020-11-05 16:41:56 +08:00
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackC4(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
MNNUnpackC4Common<float>(dst, src, area, depth, areaOffset);
2020-11-05 16:41:56 +08:00
}
2024-06-03 20:09:34 +08:00
void MNNExpC8(float* dest, const float* source, float* offset, const float* parameters, size_t countC8) {
2020-11-05 16:41:56 +08:00
auto count = countC8 * 8;
auto param = parameters[0];
float xLimit = 87;
2024-06-03 20:09:34 +08:00
float summer = offset[3];
2020-11-05 16:41:56 +08:00
for (int i = 0; i < count; ++i) {
2024-06-03 20:09:34 +08:00
auto x = source[i] * offset[0] + offset[2];
2020-11-05 16:41:56 +08:00
x = ALIMAX(x, -xLimit);
x = ALIMIN(x, xLimit);
int div = (x * parameters[1]);
int div2 = (div + 127) << 23;
auto xReamin = x - div * param;
float expBasic = *(float*)(&div2);
2025-01-22 14:47:50 +08:00
auto t = xReamin * 0.25f;
2020-11-05 16:41:56 +08:00
auto expRemain =
2025-01-22 14:47:50 +08:00
((((parameters[7] * t + parameters[6]) * t + parameters[5]) * t + parameters[4]) * t + 1.0f) * t +
1.0f;
expRemain = expRemain * expRemain;
expRemain = expRemain * expRemain;
2021-09-18 15:52:30 +08:00
dest[i] = expBasic * expRemain + offset[1];
2024-06-03 20:09:34 +08:00
summer+= dest[i];
2020-11-05 16:41:56 +08:00
}
2024-06-03 20:09:34 +08:00
offset[3] = summer;
2020-11-05 16:41:56 +08:00
}
2021-04-08 15:34:23 +08:00
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNSoftmax(float* dest, const float* source, size_t size) {
float maxValue = ALIMAX(source[0], source[1]);
for (int i = 2; i < size; ++i) {
maxValue = ALIMAX(maxValue, source[i]);
}
float xLimit = 87, param = 0.6931471805599453, sumValue = 0.f;
for (int i = 0; i < size; ++i) {
auto x = source[i] - maxValue;
x = x > -xLimit ? x : -xLimit;
x = x < xLimit ? x : xLimit;
int div = (x / param);
int div2 = (div + 127) << 23;
auto xReamin = x - div * param;
float expBasic = *(float*)(&div2);
auto t = xReamin;
auto expRemain = ((((1.0f / 120 * t + 1.0f / 24) * t + 1.0f / 6) * t + 0.5f) * t + 1.0f) * t + 1.0f;
dest[i] = expBasic * expRemain;
sumValue += dest[i];
}
sumValue = 1.f / sumValue;
for (int i = 0; i < size; ++i) {
dest[i] *= sumValue;
}
}
2023-02-28 10:41:24 +08:00
void MNNReluInt8(int8_t* dst, const int8_t* src, size_t size, ssize_t zeroPoint) {
for (int i = 0; i < size; ++i) {
if (src[i] < zeroPoint) {
dst[i] = zeroPoint;
2021-04-08 15:34:23 +08:00
} else {
dst[i] = src[i];
}
}
}
#endif // no MNN_USE_SSE
2019-04-17 10:49:11 +08:00
void MNNMaxFloat(float* input, float* maxBuffer, int32_t inputCountUnit) {
for (int i = 0; i < inputCountUnit; i++) {
for (int j = 0; j < UNIT; j++) {
for (int m = 0; m < 2; m++) {
maxBuffer[j] = std::max(input[i * UNIT * 2 + j * 2 + m], maxBuffer[j]);
}
}
}
}
void MNNMinFloat(float* input, float* minBuffer, int32_t inputCountUnit) {
for (int i = 0; i < inputCountUnit; i++) {
for (int j = 0; j < UNIT; j++) {
for (int m = 0; m < 2; m++) {
minBuffer[j] = std::min(input[i * UNIT * 2 + j * 2 + m], minBuffer[j]);
}
}
}
}
void MNNScaleAndAddBias(float* dst, const float* src, const float* bias, const float* alpha, size_t planeNumber,
size_t biasNumber) {
for (int z = 0; z < biasNumber; ++z) {
float* dstZ = dst + planeNumber * 4 * z;
const float* srcZ = src + planeNumber * 4 * z;
auto biasZ = Vec4::load(bias + 4 * z);
auto alphaZ = Vec4::load(alpha + 4 * z);
2019-04-17 10:49:11 +08:00
for (int p = 0; p < planeNumber; ++p) {
float* dstX = dstZ + 4 * p;
const float* srcX = srcZ + 4 * p;
Vec4::save(dstX, (Vec4::load(srcX) * alphaZ) + biasZ);
2019-04-17 10:49:11 +08:00
}
}
}
2019-04-17 10:49:11 +08:00
void MNNUInt8ToInt16WithOffsetC4Common(int16_t* dst, const uint8_t* src, size_t zeroPoint, size_t sizeQuad,
size_t dstStride, size_t srcStride) {
dstStride /= sizeof(int16_t);
srcStride /= sizeof(uint8_t);
for (int z = 0; z < sizeQuad; ++z) {
auto dstZ = dst + dstStride * z;
auto srcZ = src + srcStride * z;
for (int j = 0; j < 4; ++j) {
dstZ[j] = (int16_t)((int32_t)srcZ[j] - (int32_t)zeroPoint);
}
}
}
void MNNUInt8ToInt16WithOffsetC4Fast(int16_t* colAddr, const uint8_t* srcStart, size_t zeroPoint, size_t sizeQuad,
size_t depthQuad, size_t dstZStep, size_t srcZStep) {
dstZStep /= sizeof(int16_t);
srcZStep /= sizeof(uint8_t);
for (int sz = 0; sz < depthQuad; ++sz) {
auto dstZ = colAddr + sz * dstZStep;
auto srcZ = srcStart + sz * srcZStep;
MNNUInt8ToInt16WithOffsetC4Common(dstZ, srcZ, zeroPoint, sizeQuad, 4 * sizeof(int16_t), 4 * sizeof(uint8_t));
}
}
void MNNPowC8(float* dest, const float* source, const float* powfParam, size_t betaInt, size_t countC8) {
const int count = countC8 * 8;
const float powfConstant = powfParam[6];
for (int i = 0; i < count; ++i) {
float result = 1, x, xInv = 1 / source[i];
for (int j = 0; j < betaInt; result *= xInv, ++j)
;
for (x = source[i]; x >= 1.25; x /= 1.5, result *= powfConstant)
;
float t = x - 1;
float powRemain =
powfParam[0] +
t * (powfParam[1] + t * (powfParam[2] + t * (powfParam[3] + t * (powfParam[4] + t * powfParam[5]))));
result *= powRemain;
dest[i] = result;
}
}
#endif // no MNN_USE_NEON
2024-12-02 10:12:08 +08:00
void MNNGridSampleComputeCord(float* dst, const float* src, size_t inH, size_t inW, size_t outH, size_t outW, bool alignCorners) {
float a = alignCorners ? 1.0f : 0.0f;
float b = alignCorners ? 0.0f : 1.0f;
2024-12-02 10:12:08 +08:00
int area = outH * outW;
float kx = 0.5f * ((float)inW - a);
float bx = 0.5f * ((float)inW - a - b);
float ky = 0.5f * ((float)inH - a);
float by = 0.5f * ((float)inH - a - b);
for (int w = 0; w < area; ++w) {
auto x = src[2 * w + 0];
auto y = src[2 * w + 1];
dst[2 * w + 0] = kx * x + bx;
dst[2 * w + 1] = ky * y + by;
}
}
void MNNGridSampleComputeCord3D(float* dst, const float* src, size_t inD, size_t inH, size_t inW, size_t outD, size_t outH, size_t outW, bool alignCorners) {
int strideD = outH * outW * 3;
int strideH = outW * 3;
2022-06-24 18:30:05 +08:00
float a = alignCorners ? 1.0f : 0.0f;
float b = alignCorners ? 0.0f : 1.0f;
2024-12-02 10:12:08 +08:00
int area = outD * outH * outW;
float kx = 0.5f * ((float)inW - a);
float bx = 0.5f * ((float)inW - a - b);
float ky = 0.5f * ((float)inH - a);
float by = 0.5f * ((float)inH - a - b);
float kz = 0.5f * ((float)inD - a);
float bz = 0.5f * ((float)inD - a - b);
for (int w=0; w<area; ++w) {
auto x = src[3 * w + 0];
auto y = src[3 * w + 1];
auto z = src[3 * w + 2];
dst[3 * w + 0] = kx * x + bx;
dst[3 * w + 1] = ky * y + by;
dst[3 * w + 2] = kz * z + bz;
2022-06-24 18:30:05 +08:00
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
#ifndef MNN_USE_SSE
2024-04-19 11:58:21 +08:00
void MNNNorm(float *dst, const float *src, const float *gamma, const float *beta, float epsilon, size_t size, bool RMSNorm) {
float mean = 0;
if(false == RMSNorm){
float sum = 0.f;
for (int j = 0; j < size; ++j) {
sum += src[j];
}
mean = sum / size;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
}
float square_sum = 0.f;
for (int j = 0; j < size; ++j) {
square_sum += (src[j] - mean) * (src[j] - mean);
}
2025-01-22 14:47:50 +08:00
#ifdef __aarch64__
auto vs = vadd_f32(vdiv_f32(vdup_n_f32(square_sum), vdup_n_f32(size)), vdup_n_f32(epsilon));
auto vecs = vdiv_f32(vdup_n_f32(1.0f), vsqrt_f32(vs));
float vars[2];
vst1_f32(vars, vecs);
float variable = vars[0];
#else
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
float variable = square_sum / size;
variable = 1.f / std::sqrt(variable + epsilon);
2025-01-22 14:47:50 +08:00
#endif
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
if (gamma && beta) {
for (int j = 0; j < size; ++j) {
dst[j] = (src[j] - mean) * variable * gamma[j] + beta[j];
}
} else {
for (int j = 0; j < size; ++j) {
dst[j] = (src[j] - mean) * variable;
}
}
}
#endif
void MNNRoiPoolingMax(float* dst, const float* src, int hLen, int wLen, int iw) {
Vec4 max = Vec4(-FLT_MAX);
for (int h = 0; h < hLen; h++, src += iw * UNIT) {
for (int w = 0; w < wLen; w++) {
Vec4 in = Vec4::load(src + w * UNIT);
max = Vec4::max(max, in);
}
}
Vec4::save(dst, max);
2023-02-15 10:30:27 +08:00
}
void MNNRoiAlignMax(float* dst, const float* src, const std::vector<std::vector<int>> &vecPos, const std::vector<std::vector<float>> &vecArea, int samplingRatioArea, int pooledHeight, int pooledWidth) {
2023-02-15 10:30:27 +08:00
for (int h = 0; h < pooledHeight; ++h, dst += pooledWidth * UNIT) {
int preCalcIdx = h * pooledWidth * samplingRatioArea;
for (int w = 0; w < pooledWidth; ++w) {
Vec4 res = Vec4(-FLT_MAX);
for (int i = 0; i < samplingRatioArea; ++i) {
const std::vector<int>& pos = vecPos[preCalcIdx];
const std::vector<float>& area = vecArea[preCalcIdx];
Vec4 val0 = Vec4::load(src + pos[0] * UNIT);
Vec4 val1 = Vec4::load(src + pos[1] * UNIT);
Vec4 val2 = Vec4::load(src + pos[2] * UNIT);
Vec4 val3 = Vec4::load(src + pos[3] * UNIT);
Vec4 mla = val0 * area[0];
mla = Vec4::fma(mla, val1, area[1]);
mla = Vec4::fma(mla, val2, area[2]);
mla = Vec4::fma(mla, val3, area[3]);
res = Vec4::max(res, mla);
preCalcIdx++;
}
Vec4::save(dst + w * UNIT, res);
}
}
}
void MNNRoiAlignAvg(float* dst, const float* src, const std::vector<std::vector<int>> &vecPos, const std::vector<std::vector<float>> &vecArea, int samplingRatioArea, int pooledHeight, int pooledWidth) {
float invSamplingCnt = 1.f / samplingRatioArea;
2023-02-15 10:30:27 +08:00
for (int h = 0; h < pooledHeight; ++h, dst += pooledWidth * UNIT) {
int preCalcIdx = h * pooledWidth * samplingRatioArea;
for (int w = 0; w < pooledWidth; ++w) {
Vec4 res = Vec4(0.f);
for (int i = 0; i < samplingRatioArea; ++i) {
const std::vector<int>& pos = vecPos[preCalcIdx];
const std::vector<float>& area = vecArea[preCalcIdx];
Vec4 val0 = Vec4::load(src + pos[0] * UNIT);
Vec4 val1 = Vec4::load(src + pos[1] * UNIT);
Vec4 val2 = Vec4::load(src + pos[2] * UNIT);
Vec4 val3 = Vec4::load(src + pos[3] * UNIT);
Vec4 mla = val0 * area[0];
mla = Vec4::fma(mla, val1, area[1]);
mla = Vec4::fma(mla, val2, area[2]);
mla = Vec4::fma(mla, val3, area[3]);
res += mla;
preCalcIdx++;
}
res = res * invSamplingCnt;
Vec4::save(dst + w * UNIT, res);
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackC4Uint8(uint8_t* dst, const uint8_t* src, size_t area,size_t depth, int* areaOffset) {
MNNPackC4Common(dst, src, area, depth, areaOffset);
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackC4Uint8(uint8_t* dst, const uint8_t* src, size_t area,size_t depth, int* areaOffset) {
MNNUnpackC4Common(dst, src, area, depth, areaOffset);
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackTransposeUint8(uint8_t* dst, const uint8_t* src, size_t area,size_t depth, int* areaOffset) {
2019-04-17 10:49:11 +08:00
if (depth == 4) {
::memcpy(dst, src, area * depth * sizeof(uint8_t));
return;
}
#ifdef MNN_USE_NEON
if (depth == 3) {
uint8x16x4_t rgba;
rgba.val[3] = vdupq_n_u8(0);
int sta = 0;
int staC16 = (int)area / 16;
for (int i = 0; i < staC16; sta += 16, ++i) {
auto rgb = vld3q_u8(src + sta * 3);
rgba.val[0] = rgb.val[0];
rgba.val[1] = rgb.val[1];
rgba.val[2] = rgb.val[2];
vst4q_u8(dst + 4 * sta, rgba);
}
sta = staC16 * 16;
for (; sta < area; ++sta) {
auto s = src + sta * 3;
auto d = dst + sta * 4;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
d[3] = 0;
}
return;
}
if (depth == 1) {
uint8x16x4_t rgba;
rgba.val[1] = vdupq_n_u8(0);
rgba.val[2] = vdupq_n_u8(0);
rgba.val[3] = vdupq_n_u8(0);
int sta = 0;
for (; sta < area; sta += 16) {
rgba.val[0] = vld1q_u8(src + sta);
vst4q_u8(dst + 4 * sta, rgba);
}
for (; sta < area; ++sta) {
auto s = src + sta;
auto d = dst + sta * 4;
d[0] = s[0];
d[1] = 0;
d[2] = 0;
d[3] = 0;
}
return;
}
#endif
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
if (cAlign == c) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = reinterpret_cast<const int32_t*>(src + hi * c);
auto dstHeight = reinterpret_cast<int32_t*>(dst + hi * 4);
for (int ci = 0; ci < cDiv4; ++ci) {
dstHeight[ci * areaOffset[1]] = srcHeight[ci];
}
}
2019-04-17 10:49:11 +08:00
return;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
} else {
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = src + hi * c;
auto dstHeight = dst + hi * 4;
for (int ci = 0; ci < cDiv4; ++ci) {
dstHeight[ci * areaOffset[1] * 4 + 0] = srcHeight[ci * 4 + 0];
dstHeight[ci * areaOffset[1] * 4 + 1] = srcHeight[ci * 4 + 1];
dstHeight[ci * areaOffset[1] * 4 + 2] = srcHeight[ci * 4 + 2];
dstHeight[ci * areaOffset[1] * 4 + 3] = srcHeight[ci * 4 + 3];
}
}
2019-04-17 10:49:11 +08:00
}
int cReamin = c - cAlign;
auto srcAlign = src + cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto dstAlign = dst + areaOffset[1] * cAlign;
2019-04-17 10:49:11 +08:00
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = srcAlign + hi * c;
auto dstHeight = dstAlign + hi * 4;
for (int i = 0; i < 4; ++i) {
dstHeight[i] = 0;
}
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackTranspose(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
int srcAreaOffset = areaOffset[0];
int dstAreaOffset = areaOffset[1];
2019-04-17 10:49:11 +08:00
#ifdef MNN_USE_NEON
if (1 == depth) {
auto zeroValue = vmovq_n_f32(0.0f);
int areaC4 = (int)area / 4;
int remain = areaC4 * 4;
for (int i = 0; i < areaC4; ++i) {
auto srcCur = src + 4 * i;
auto dstCur = dst + 16 * i;
auto srcValue = vld1q_f32(srcCur);
float32x4x4_t dstValue;
dstValue.val[0] = srcValue;
dstValue.val[1] = zeroValue;
dstValue.val[2] = zeroValue;
dstValue.val[3] = zeroValue;
vst4q_f32(dstCur, dstValue);
}
for (int i = remain; i < area; ++i) {
dst[4 * i + 0] = src[i];
dst[4 * i + 1] = 0.0f;
dst[4 * i + 2] = 0.0f;
dst[4 * i + 3] = 0.0f;
}
return;
}
if (3 == depth) {
auto zeroValue = vmovq_n_f32(0.0f);
int areaC4 = (int)area / 4;
int remain = areaC4 * 4;
for (int i = 0; i < areaC4; ++i) {
auto srcCur = src + 12 * i;
auto dstCur = dst + 16 * i;
auto srcValue = vld3q_f32(srcCur);
float32x4x4_t dstValue;
dstValue.val[0] = srcValue.val[0];
dstValue.val[1] = srcValue.val[1];
dstValue.val[2] = srcValue.val[2];
dstValue.val[3] = zeroValue;
vst4q_f32(dstCur, dstValue);
}
for (int i = remain; i < area; ++i) {
dst[4 * i + 0] = src[3 * i + 0];
dst[4 * i + 1] = src[3 * i + 1];
dst[4 * i + 2] = src[3 * i + 2];
dst[4 * i + 3] = 0.0f;
}
return;
}
#endif
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
for (int hi = 0; hi < area; ++hi) {
const float* srcHeight = src + hi * c;
float* dstHeight = dst + hi * 4;
for (int ci = 0; ci < cDiv4; ++ci) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
Vec4::save(dstHeight + 4 * ci * dstAreaOffset, Vec4::load(srcHeight + 4 * ci));
2019-04-17 10:49:11 +08:00
}
}
if (cAlign == c) {
return;
}
int cReamin = c - cAlign;
auto srcAlign = src + cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto dstAlign = dst + dstAreaOffset * cAlign;
2019-04-17 10:49:11 +08:00
#ifdef MNN_USE_NEON
auto zeroVector = vdupq_n_f32(0.0f);
#endif
for (int hi = 0; hi < area; ++hi) {
const float* srcHeight = srcAlign + hi * c;
float* dstHeight = dstAlign + hi * 4;
#ifdef MNN_USE_NEON
vst1q_f32(dstHeight, zeroVector);
#else
for (int i = 0; i < 4; ++i) {
dstHeight[i] = 0;
}
#endif
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackTransposeUint8(uint8_t* dst, const uint8_t* src, size_t area,size_t depth, int* areaOffset) {
2019-04-17 10:49:11 +08:00
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
if (cAlign == c) {
int32_t* dst32 = (int32_t*)dst;
const int32_t* src32 = (int32_t*)src;
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = src32 + hi;
auto dstHeight = dst32 + hi * cDiv4;
for (int ci = 0; ci < cDiv4; ++ci) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
dstHeight[ci] = srcHeight[ci * areaOffset[0]];
2019-04-17 10:49:11 +08:00
}
}
return;
}
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = src + hi * 4;
auto dstHeight = dst + hi * c;
for (int ci = 0; ci < cDiv4; ++ci) {
for (int i = 0; i < 4; ++i) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
dstHeight[ci * 4 + i] = srcHeight[4 * ci * areaOffset[0] + i];
2019-04-17 10:49:11 +08:00
}
}
}
int cReamin = c - cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto srcAlign = src + areaOffset[0] * cAlign;
2019-04-17 10:49:11 +08:00
auto dstAlign = dst + cAlign;
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = srcAlign + hi * 4;
auto dstHeight = dstAlign + hi * c;
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackTranspose(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
#if defined(MNN_USE_NEON)
if (3 == depth) {
int areaC4 = (int)area / 4;
int remain = areaC4 * 4;
for (int i = 0; i < areaC4; ++i) {
auto srcCur = src + 16 * i;
auto dstCur = dst + 12 * i;
auto srcValue = vld4q_f32(srcCur);
float32x4x3_t dstValue;
dstValue.val[0] = srcValue.val[0];
dstValue.val[1] = srcValue.val[1];
dstValue.val[2] = srcValue.val[2];
vst3q_f32(dstCur, dstValue);
}
for (int i = remain; i < area; ++i) {
dst[3 * i + 0] = src[4 * i + 0];
dst[3 * i + 1] = src[4 * i + 1];
dst[3 * i + 2] = src[4 * i + 2];
}
return;
}
#elif defined(MNN_USE_SSE)
if (3 == depth) {
if (area < 1) return;
for (int i = 0; i < area - 1; ++i) {
auto srcValue = Vec4::load(src + 4 * i);
Vec4::save(dst + 3 * i, srcValue);
}
for (int i = 0; i < 3; ++i) {
dst[3 * (area - 1) + i] = src[4 * (area - 1) + i];
}
return;
}
#endif
2019-04-17 10:49:11 +08:00
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto srcArea = areaOffset[0];
2019-04-17 10:49:11 +08:00
for (int hi = 0; hi < area; ++hi) {
const float* srcHeight = src + hi * 4;
float* dstHeight = dst + hi * c;
for (int ci = 0; ci < cDiv4; ++ci) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
Vec4::save(dstHeight + 4 * ci, Vec4::load(srcHeight + 4 * ci * srcArea));
2019-04-17 10:49:11 +08:00
}
}
if (cAlign == c) {
return;
}
int cReamin = c - cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto srcAlign = src + srcArea * cAlign;
2019-04-17 10:49:11 +08:00
auto dstAlign = dst + cAlign;
for (int hi = 0; hi < area; ++hi) {
const float* srcHeight = srcAlign + hi * 4;
float* dstHeight = dstAlign + hi * c;
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
2024-06-03 20:09:34 +08:00
void MNNExp(float* dst, const float* src, float* offset, size_t dataSize) {
2023-12-27 17:26:44 +08:00
int countC8 = static_cast<int32_t>(dataSize) / 8;
int remain = static_cast<int32_t>(dataSize) % 8;
2025-01-22 14:47:50 +08:00
static const float parameters[] = {
(float)logf(2.0f), 1.0f / (float)logf(2.0f), 0.25f, 1.0f, 0.5f, 1.0f / 6.0f, 1.0f / 24.0f, 1.0f / 120.0f};
if (countC8 > 0) {
// Align to eight so asm is easier to write
2021-09-18 15:52:30 +08:00
MNNExpC8(dst, src, offset, parameters, countC8);
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
2024-06-03 20:09:34 +08:00
auto param = parameters[0];
float xLimit = 87;
float summer = offset[3];
auto source = src + countC8 * 8;
auto dest = dst + countC8 * 8;
for (int i = 0; i < remain; ++i) {
auto x = source[i] * offset[0] + offset[2];
x = ALIMAX(x, -xLimit);
x = ALIMIN(x, xLimit);
int div = (x * parameters[1]);
int div2 = (div + 127) << 23;
auto xReamin = x - div * param;
float expBasic = *(float*)(&div2);
2025-01-22 14:47:50 +08:00
auto t = xReamin * 0.25f;
2024-06-03 20:09:34 +08:00
auto expRemain =
2025-01-22 14:47:50 +08:00
((((parameters[7] * t + parameters[6]) * t + parameters[5]) * t + parameters[4]) * t + 1.0f) * t +
1.0f;
expRemain = expRemain * expRemain;
expRemain = expRemain * expRemain;
2024-06-03 20:09:34 +08:00
dest[i] = expBasic * expRemain + offset[1];
summer+= dest[i];
}
offset[3] = summer;
}
}
// Lambert's series with 7 divisions
// reference from
// https://varietyofsound.wordpress.com/2011/02/14/efficient-tanh-computation-using-lamberts-continued-fraction/
inline float tanhf_poly(float value) {
if (value > 5.0) {
return 1.0;
} else if (value <= -5.0) {
return -1.0;
} else {
float x2 = value * value;
float a = value * (135135.0f + x2 * (17325.0f + x2 * (378.0f + x2)));
float b = 135135.0f + x2 * (62370.0f + x2 * (3150.0f + x2 * 28.0f));
return a / b;
}
}
void MNNTanh(float* dst, const float* src, size_t dataSize) {
2021-04-08 15:34:23 +08:00
/* Origin Code
for (int i = 0; i < dataSize; i++) {
// outputData[i] = 1 - 2 / (expf(2 * inputData[i]) + 1);
dst[i] = tanhf_poly(src[i]);
}
2021-04-08 15:34:23 +08:00
*/
2024-06-03 20:09:34 +08:00
float offset[4] = {
2021-09-18 15:52:30 +08:00
-2.0f,
2024-06-03 20:09:34 +08:00
0.0f,
0.0f,
2021-09-18 15:52:30 +08:00
0.0f
};
MNNExp(dst, src, offset, dataSize);
2021-04-08 15:34:23 +08:00
for (int i = 0; i < dataSize; i++) {
// outputData[i] = 1 - 2 / (expf(2 * inputData[i]) + 1);
auto expX2 = dst[i];
dst[i] = (1.0f - expX2) / (1.0f + expX2);
}
}
2019-12-27 22:16:57 +08:00
void MNNReluWithSlope(float* dst, const float* src, size_t sizeQuad, float slope) {
float slopeValue[4];
for (int i=0; i<4; ++i) {
slopeValue[i] = slope;
}
MNNReluWithSlopeChannel(dst, src, slopeValue, sizeQuad, 1);
}
2021-04-08 15:34:23 +08:00
2020-02-26 09:57:17 +08:00
void MNNReluWithSlopeCommon(float* dst, const float* src, size_t size, float slope) {
2023-12-27 17:26:44 +08:00
int sizeQuad = static_cast<int32_t>(size) / 4;
int remain = static_cast<int32_t>(size) % 4;
2020-02-26 09:57:17 +08:00
if (sizeQuad > 0) {
MNNReluWithSlope(dst, src, sizeQuad, slope);
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
float intmp[4] = {0}, outmp[4] = {0};
::memcpy(intmp, src + sizeQuad * 4, remain * sizeof(float));
MNNReluWithSlope(outmp, intmp, 1, slope);
::memcpy(dst + sizeQuad * 4, outmp, remain * sizeof(float));
2020-02-26 09:57:17 +08:00
}
}
2021-04-08 15:34:23 +08:00
void MNNHardSwishCommon(float* dst, const float* src, size_t size) {
2023-10-18 10:31:02 +08:00
int sizeQuad = static_cast<int32_t>(size / 4);
2023-12-27 17:26:44 +08:00
int remain = static_cast<int32_t>(size) % 4;
2021-04-08 15:34:23 +08:00
#ifdef MNN_USE_SSE
if (sizeQuad > 0) {
MNNHardSwish(dst, src, sizeQuad);
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
float intmp[4] = {0}, outmp[4] = {0};
::memcpy(intmp, src + sizeQuad * 4, remain * sizeof(float));
MNNHardSwish(outmp, intmp, 1);
::memcpy(dst + sizeQuad * 4, outmp, remain * sizeof(float));
}
#else
2021-04-08 15:34:23 +08:00
#ifdef MNN_USE_NEON
float32x4_t zero = vdupq_n_f32(0.f);
float32x4_t three = vdupq_n_f32(3.f);
float32x4_t six = vdupq_n_f32(6.f);
float32x4_t divsix = vdupq_n_f32(1.0f/6.f);
for (int i = 0; i < sizeQuad; i++) {
auto x = vld1q_f32(src + 4 * i);
auto y = vmulq_f32(vmulq_f32(x, vminq_f32(vmaxq_f32(vaddq_f32(x, three), zero), six)), divsix);
vst1q_f32(dst + 4 * i, y);
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
float intmp[4] = {0}, outmp[4] = {0};
::memcpy(intmp, src + sizeQuad * 4, remain * sizeof(float));
auto x = vld1q_f32(intmp);
auto y = vmulq_f32(vmulq_f32(x, vminq_f32(vmaxq_f32(vaddq_f32(x, three), zero), six)), divsix);
vst1q_f32(outmp, y);
::memcpy(dst + sizeQuad * 4, outmp, remain * sizeof(float));
}
#else
for (int j = 0; j < size; j++) {
2021-04-08 15:34:23 +08:00
if (src[j] <= -3) {
dst[j] = 0;
} else if (src[j] >= 3){
dst[j] = src[j];
} else {
dst[j] = src[j] * (src[j] + 3) / 6.f;
}
}
2023-12-27 17:26:44 +08:00
#endif
#endif
2021-04-08 15:34:23 +08:00
}
2021-09-18 15:52:30 +08:00
void MNNGeluStandardCommon(float* dst, const float* src, size_t size) {
for (int i = 0; i < size; i++) {
dst[i] = (erf(src[i] * 0.7071067932881648) + 1) * src[i] * 0.5;
}
}
void MNNGeluCommon(float* dst, const float* src, size_t size) {
2023-10-18 10:31:02 +08:00
int sizeQuad = static_cast<int32_t>(size / 8);
2023-12-27 17:26:44 +08:00
int remain = static_cast<int32_t>(size) % 8;
2023-03-20 11:32:29 +08:00
#if defined(MNN_USE_SSE) || defined(MNN_USE_NEON)
2023-12-27 17:26:44 +08:00
float parameters[8] = {0.044715f, 0.79788458f, 378.f, 17325.f, 135135.f, 28.f, 3150.f, 62370.f};
if (sizeQuad > 0) {
2023-03-20 11:32:29 +08:00
MNNGelu(dst, src, sizeQuad, parameters);
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
float intmp[8] = {0};
float outmp[8] = {0};
::memcpy(intmp, src + 8 * sizeQuad, remain * sizeof(float));
MNNGelu(outmp, intmp, 1, parameters);
::memcpy(dst + 8 * sizeQuad, outmp, remain * sizeof(float));
}
#else
auto tanhf_poly = [](float value) -> float {
if (value > 5.0f) {
return 1.0f;
} else if (value <= -5.0f) {
return -1.0f;
} else {
float x2 = value * value;
float a = value * (135135.0f + x2 * (17325.0f + x2 * (378.0f + x2)));
float b = 135135.0f + x2 * (62370.0f + x2 * (3150.0f + x2 * 28.0f));
return a / b;
}
};
2023-12-27 17:26:44 +08:00
for (int i = 0; i < size; i++) {
float temp = 0.044715f * src[i] * src[i] * src[i];
temp = 0.79788458f * (temp + src[i]);
dst[i] = (1.0f + tanhf_poly(temp)) * src[i] * 0.5f;
}
2023-12-27 17:26:44 +08:00
#endif
}
void MNNScaleAndAddBiasScalar(float* dst, const float* src, float bias, float alpha, size_t number) {
int numberC4 = (int)number / 4;
int start = 0;
if (numberC4 > 0) {
float biasC4[4] = {
bias,
bias,
bias,
bias
};
float alphaC4[4] = {
alpha,
alpha,
alpha,
alpha
};
MNNScaleAndAddBias(dst, src, biasC4, alphaC4, numberC4, 1);
start = numberC4 * 4;
}
for (int i=start; i<number; ++i) {
dst[i] = src[i] * alpha + bias;
}
}
2020-07-04 01:21:30 +08:00
#ifndef MNN_USE_NEON
2021-04-08 15:34:23 +08:00
void MNNAxByClampBroadcastUnit(float* C, const float* A, const float* B, size_t width, size_t cStride, size_t aStride, size_t height, const float* parameters) {
2020-07-04 01:21:30 +08:00
auto minF = Vec4(parameters[2]);
auto maxF = Vec4(parameters[3]);
auto beta = Vec4(parameters[1]);
for (int y = 0; y < height; ++y) {
auto a = A + aStride * y;
auto b = B + 4 * y;
auto bv = Vec4::load(b);
auto c = C + cStride * y;
for (int x = 0; x < width; ++x) {
auto av = Vec4::load(a + 4 * x);
auto cv = av + bv * beta;
cv = Vec4::min(cv, maxF);
cv = Vec4::max(cv, minF);
Vec4::save(c + 4 * x, cv);
}
}
}
2020-12-15 14:12:35 +08:00
void MNNVectorTop1Float(float* input, float* maxValue, int32_t* maxIndex, size_t inputCountUnit) {
float maxV = input[0];
int maxIdx = 0;
for (int i = 0; i < inputCountUnit; i++) {
int offset = i * UNIT;
for (int j = 0; j < UNIT; j++) {
if (input[offset + j] > maxV) {
maxV = input[offset + j];
maxIdx = offset + j;
}
}
}
maxValue[0] = maxV;
maxIndex[0] = maxIdx;
}
void MNNVectorTop1Int32(int32_t* input, int32_t* maxValue, int32_t* maxIndex, size_t inputCountUnit) {
int32_t maxV = input[0];
int maxIdx = 0;
for (int i = 0; i < inputCountUnit; i++) {
int offset = i * UNIT;
for (int j = 0; j < UNIT; j++) {
if (input[offset + j] > maxV) {
maxV = input[offset + j];
maxIdx = offset + j;
}
}
}
maxValue[0] = maxV;
maxIndex[0] = maxIdx;
}
2020-07-04 01:21:30 +08:00
#endif
2021-01-06 16:29:37 +08:00
void MNNComputeMatMulForE_1(const float* A, const float* B, float* C, const float* biasPtr, const MatMulParam* param, size_t tId) {
auto l = param->l;
auto h = param->h;
auto numberThread = param->numberThread;
auto lC4 = l / 4;
auto lR = lC4 * 4;
if (param->BTranspose) {
for (int y=tId; y<h; y+=numberThread) {
Vec4 sumValue = Vec4(0.0f);
auto by = B + y * l;
for (int x=0; x<lC4; ++x) {
2024-11-18 14:37:45 +08:00
sumValue = Vec4::fma(sumValue, Vec4::load(A + x * 4), Vec4::load(by + x * 4));
2021-01-06 16:29:37 +08:00
}
float sumRemain = 0.0f;
for (int x=lR; x<l; ++x) {
sumRemain = sumRemain + A[x] * by[x];
}
if (nullptr != biasPtr) {
sumRemain += biasPtr[y];
}
C[y] = sumRemain + sumValue[0] + sumValue[1] + sumValue[2] + sumValue[3];
}
} else {
2024-06-03 20:09:34 +08:00
auto hC4 = h / 16;
auto hR = hC4 * 16;
2021-01-06 16:29:37 +08:00
for (int y=tId; y<hC4; y+=numberThread) {
2024-06-03 20:09:34 +08:00
auto bs = B + 16 * y;
Vec4 sumValue0 = Vec4(0.0f);
Vec4 sumValue1 = Vec4(0.0f);
Vec4 sumValue2 = Vec4(0.0f);
Vec4 sumValue3 = Vec4(0.0f);
2021-01-06 16:29:37 +08:00
if (biasPtr != nullptr) {
2024-06-03 20:09:34 +08:00
sumValue0 = Vec4::load(biasPtr + 16 * y + 0);
sumValue1 = Vec4::load(biasPtr + 16 * y + 4);
sumValue2 = Vec4::load(biasPtr + 16 * y + 8);
sumValue3 = Vec4::load(biasPtr + 16 * y + 12);
2021-01-06 16:29:37 +08:00
}
auto srcY = A + y * l;
for (int x=0; x<l; ++x) {
2024-06-03 20:09:34 +08:00
auto a = Vec4(A[x]);
2024-11-18 14:37:45 +08:00
sumValue0 = Vec4::fma(sumValue0, a, Vec4::load(bs + h * x));
sumValue1 = Vec4::fma(sumValue1, a, Vec4::load(bs + h * x + 4));
sumValue2 = Vec4::fma(sumValue2, a, Vec4::load(bs + h * x + 8));
sumValue3 = Vec4::fma(sumValue3, a, Vec4::load(bs + h * x + 12));
2021-01-06 16:29:37 +08:00
}
2024-06-03 20:09:34 +08:00
Vec4::save(C + 16 * y, sumValue0);
Vec4::save(C + 16 * y + 4, sumValue1);
Vec4::save(C + 16 * y + 8, sumValue2);
Vec4::save(C + 16 * y + 12, sumValue3);
2021-01-06 16:29:37 +08:00
}
2021-02-07 10:45:07 +08:00
for (int y=hR + tId; y<h; y+=numberThread) {
2021-01-06 16:29:37 +08:00
auto bs = B + y;
float sumValue = 0.0f;
if (biasPtr != nullptr) {
sumValue = biasPtr[y];
}
auto srcY = A + y * l;
for (int x=0; x<l; ++x) {
sumValue = sumValue + A[x] * bs[h * x];
}
C[y] = sumValue;
}
}
}
void MNNComputeMatMulForH_1(const float* A, const float* B, float* C, const float* biasPtr, const MatMulParam* param, size_t tId) {
int e = param->e;
int l = param->l;
int numberThread = param->numberThread;
if (param->ATranspose) {
float biasValue = 0.0f;
if (nullptr != biasPtr) {
biasValue = *biasPtr;
}
auto eC4 = e / 4;
auto eR = eC4 * 4;
for (int y=tId; y<eC4; y+=numberThread) {
Vec4 sumValue = Vec4(biasValue);
auto srcY = A + y * 4;
for (int x=0; x<l; ++x) {
sumValue = sumValue + Vec4::load(srcY + x * e) * Vec4(B[x]);
}
Vec4::save(C + 4 * y, sumValue);
}
if (0 == tId) {
for (int y=eR; y<e; ++y) {
float sumValue = biasValue;
auto srcY = A + y;
for (int x=0; x<l; ++x) {
sumValue = sumValue + srcY[x * e] * B[x];
}
C[y] = sumValue;
}
}
return;
}
float biasValue = 0.0f;
if (nullptr != biasPtr) {
biasValue = *biasPtr;
}
auto lC4 = l / 4;
auto lR = lC4 * 4;
for (int y=tId; y<e; y+=numberThread) {
Vec4 sumValue = Vec4(biasValue);
auto srcY = A + y * l;
for (int x=0; x<lC4; ++x) {
sumValue = sumValue + Vec4::load(srcY + 4 * x) * Vec4::load(B + 4 * x);
}
float sumSingle = sumValue[0] + sumValue[1] + sumValue[2] + sumValue[3];
for (int x=lR; x<l; ++x) {
sumSingle += srcY[x] * B[x];
}
C[y] = sumSingle;
}
}
2021-04-08 15:34:23 +08:00
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackC4Int16(int16_t* dst, const int16_t* src, size_t area,size_t depth, int* areaOffset) {
MNNPackC4Common(dst, src, area, depth, areaOffset);
2021-04-08 15:34:23 +08:00
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackC4Int16(int16_t* dst, const int16_t* src, size_t area,size_t depth, int* areaOffset) {
MNNUnpackC4Common(dst, src, area, depth, areaOffset);
2021-04-08 15:34:23 +08:00
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackTransposeInt16(int16_t* dst, const int16_t* src, size_t area,size_t depth, int* areaOffset) {
2021-04-08 15:34:23 +08:00
if (depth == 4) {
::memcpy(dst, src, area * depth * sizeof(int16_t));
return;
}
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = (src + hi * c);
auto dstHeight = (dst + hi * 4);
for (int ci = 0; ci < cDiv4; ++ci) {
for (int i = 0; i < 4; ++i) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
dstHeight[ci * areaOffset[1] * 4 + i] = srcHeight[4 * ci + i];
2021-04-08 15:34:23 +08:00
}
}
}
if (cAlign == c) {
return;
}
int cReamin = c - cAlign;
auto srcAlign = src + cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto dstAlign = dst + areaOffset[1] * cAlign;
2021-04-08 15:34:23 +08:00
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = srcAlign + hi * c;
auto dstHeight = dstAlign + hi * 4;
for (int i = 0; i < 4; ++i) {
dstHeight[i] = 0;
}
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNPackTransposeInt16(int16_t* dst, const int16_t* src, size_t area,size_t depth, int* areaOffset) {
2021-04-08 15:34:23 +08:00
int c = (int)depth;
int cDiv4 = c / 4;
int cAlign = cDiv4 * 4;
if (cAlign == c) {
int64_t* dst32 = (int64_t*)dst;
const int64_t* src32 = (int64_t*)src;
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = src32 + hi;
auto dstHeight = dst32 + hi * cDiv4;
for (int ci = 0; ci < cDiv4; ++ci) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
dstHeight[ci] = srcHeight[ci * areaOffset[0]];
2021-04-08 15:34:23 +08:00
}
}
return;
}
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = src + hi * 4;
auto dstHeight = dst + hi * c;
for (int ci = 0; ci < cDiv4; ++ci) {
for (int i = 0; i < 4; ++i) {
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
dstHeight[ci * 4 + i] = srcHeight[4 * ci * areaOffset[0] + i];
2021-04-08 15:34:23 +08:00
}
}
}
int cReamin = c - cAlign;
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
auto srcAlign = src + areaOffset[0] * cAlign;
2021-04-08 15:34:23 +08:00
auto dstAlign = dst + cAlign;
for (int hi = 0; hi < area; ++hi) {
auto srcHeight = srcAlign + hi * 4;
auto dstHeight = dstAlign + hi * c;
for (int ci = 0; ci < cReamin; ++ci) {
dstHeight[ci] = srcHeight[ci];
}
}
}
void MNNCopyC4Int16WithStride(const float* sourceF, float* destF, size_t srcStride, size_t dstStride, size_t count) {
auto source = (int16_t*)sourceF;
auto dest = (int16_t*)destF;
for (int i = 0; i < count; ++i) {
auto s = source + i * srcStride;
auto d = dest + i * dstStride;
*(int64_t*)(d) = *((int64_t*)s);
}
}
void MNNSin(float* dst, const float* src, size_t dataSize) {
for (int i = 0; i < dataSize; i++) {
dst[i] = sinf(src[i]);
}
}
void MNNSigmoid(float* dst, const float* src, size_t dataSize) {
2024-06-03 20:09:34 +08:00
float offset[4] = {
2021-09-18 15:52:30 +08:00
-1.0f,
2024-06-03 20:09:34 +08:00
0.0f,
0.0f,
2021-09-18 15:52:30 +08:00
0.0f
};
MNNExp(dst, src, offset, dataSize);
2021-04-08 15:34:23 +08:00
for (int i = 0; i < dataSize; ++i) {
dst[i] = 1.0f / (1.0f + dst[i]);
}
}
2024-12-02 10:12:08 +08:00
void MNNSiLu(float* dst, const float* src, size_t dataSize) {
float offset[4] = {
-1.0f,
0.0f,
0.0f,
0.0f
};
MNNExp(dst, src, offset, dataSize);
for (int i = 0; i < dataSize; ++i) {
dst[i] = src[i] / (1.0f + dst[i]);
}
}
2021-04-08 15:34:23 +08:00
/**
Modified from https://github.com/alibaba/MNN/pull/1359
Thanks for https://github.com/hroken
*/
void MNNSigmoidLowp(float* dst, const float* src, size_t dataSize) {
2024-06-03 20:09:34 +08:00
float offset[4] = {
2021-09-18 15:52:30 +08:00
-1.0f,
2024-06-03 20:09:34 +08:00
0.0f,
0.0f,
2021-09-18 15:52:30 +08:00
0.0f
};
MNNExp(dst, src, offset, dataSize);
2021-04-08 15:34:23 +08:00
#ifdef MNN_USE_NEON
2023-12-27 17:26:44 +08:00
int dataC4 = static_cast<int32_t>(dataSize) / 4;
int remain = static_cast<int32_t>(dataSize) % 4;
float32x4_t value = vdupq_n_f32(1.0f);
2021-04-08 15:34:23 +08:00
if(dataC4 > 0) {
float32x4_t out = vld1q_f32(dst);
2023-12-27 17:26:44 +08:00
// neon optimization for sigmid cpu
2021-04-08 15:34:23 +08:00
for (int i = 1; i < dataC4; ++i) {
out = vrecpeq_f32(vaddq_f32(value,out));
vst1q_f32(dst ,out);
dst += 4;
out = vld1q_f32(dst);
}
out = vrecpeq_f32(vaddq_f32(value,out));
vst1q_f32(dst, out);
dst += 4;
2021-04-08 15:34:23 +08:00
}
2023-12-27 17:26:44 +08:00
if (remain > 0) {
float intmp[4] = {0};
::memcpy(intmp, dst, remain * sizeof(float));
float32x4_t out = vld1q_f32(intmp);
out = vrecpeq_f32(vaddq_f32(value,out));
vst1q_f32(intmp, out);
::memcpy(dst, intmp, remain * sizeof(float));
}
#else
2021-04-08 15:34:23 +08:00
for (int i = 0; i < dataSize; ++i) {
dst[i] = 1.0f / (1.0f + dst[i]);
}
2023-12-27 17:26:44 +08:00
#endif
2021-04-08 15:34:23 +08:00
}
2024-12-02 10:12:08 +08:00
void MNNSiLuLowp(float* dst, const float* src, size_t dataSize) {
float offset[4] = {
-1.0f,
0.0f,
0.0f,
0.0f
};
MNNExp(dst, src, offset, dataSize);
#ifdef __aarch64__
int dataC4 = static_cast<int32_t>(dataSize) / 4;
int remain = static_cast<int32_t>(dataSize) % 4;
float32x4_t one = vdupq_n_f32(1.0f);
if(dataC4 > 0) {
float32x4_t out = vld1q_f32(dst);
float32x4_t in = vld1q_f32(src);
// neon optimization for sigmid cpu
for (int i = 1; i < dataC4; ++i) {
out = vdivq_f32(in, vaddq_f32(one,out));
vst1q_f32(dst ,out);
dst += 4;
src += 4;
out = vld1q_f32(dst);
in = vld1q_f32(src);
}
out = vdivq_f32(in, vaddq_f32(one,out));
vst1q_f32(dst, out);
dst += 4;
src += 4;
}
if (remain > 0) {
float intmp[4] = {0};
float atmp[4] = {0};
::memcpy(intmp, dst, remain * sizeof(float));
::memcpy(atmp, src, remain * sizeof(float));
float32x4_t out = vld1q_f32(intmp);
float32x4_t in = vld1q_f32(atmp);
out = vdivq_f32(in, vaddq_f32(one, out));
vst1q_f32(intmp, out);
::memcpy(dst, intmp, remain * sizeof(float));
}
#else
for (int i = 0; i < dataSize; ++i) {
dst[i] = src[i] / (1.0f + dst[i]);
}
#endif
}
static void _MNNAdjustOptimalSparseKernel(int& sparseBlockOC, MNN::CoreFunctions::MNNPackedSparseMatMul& packedSparseMatMul) {
if(sparseBlockOC == 4) {
packedSparseMatMul = MNNPackedSparseMatMulEpx4;
return;
} else if(sparseBlockOC % 4 == 0) {
sparseBlockOC = 4;
packedSparseMatMul = MNNPackedSparseMatMulEpx4;
// MNN_PRINT("common downgrade sparse to:%d\n",sparseBlockOC);
return;
} else {
sparseBlockOC = 1;
packedSparseMatMul = MNNPackedSparseMatMulEpx1;
return;
}
}
2024-08-24 15:46:21 +08:00
// fp32 <--> fp8
static const int FP32_EXP_BIAS = 127;
static const int FP8_EXP_BIAS = 24; // [0, 31] --> [-24, 7] --> [1 / 2^24, 2^7]
void MNNFp32ToFp8(uint8_t* dst, const float* src, size_t size) {
for (int i = 0; i < size; i++) {
uint32_t rawData = *((uint32_t *)(&src[i]));
uint32_t sign = (rawData >> 31) & 1U;
uint32_t exp = (int)((rawData >> 23) & 0x0ffU);
uint32_t mant = (rawData >> 21) & 3U;
int realExp = (int)exp - FP32_EXP_BIAS;
realExp = ALIMAX(realExp, 0 - FP8_EXP_BIAS);
realExp = ALIMIN(realExp, 31 - FP8_EXP_BIAS);
exp = (uint32_t)(realExp + FP8_EXP_BIAS);
dst[i] = (int8_t)((sign << 7) | (exp << 2) | mant);
}
}
void MNNFp8ToFp32(float* dst, const uint8_t* src, size_t size) {
for (int i = 0; i < size; i++) {
uint32_t sign = (src[i] >> 7) & 1U;
uint32_t exp = (int)((src[i] >> 2) & 0x1fU);
uint32_t mant = (src[i] & 3U) << 21;
int realExp = (int)exp - FP8_EXP_BIAS;
exp = (uint32_t)(realExp + FP32_EXP_BIAS);
uint32_t rawData = (sign << 31) | (exp << 23) | mant;
dst[i] = *((float *)(&rawData));
}
}
// fp16 <--> fp8
void MNNFp16ToFp8(uint8_t* dst, const uint16_t* src, size_t size) {
#ifdef MNN_USE_NEON
#ifdef __aarch64__
int loopN = size / 16;
for (int i = 0; i < loopN; i++) {
uint8x16_t v1 = vld1q_u8((uint8_t*)(src + i * 16));
uint8x16_t v2 = vld1q_u8((uint8_t*)(src + i * 16 + 8));
uint8x16_t res = vuzp2q_u8(v1, v2);
vst1q_u8(dst + i * 16, res);
}
for (int i = loopN * 16; i < size; i++) {
dst[i] = static_cast<int8_t>(src[i] >> 8);
}
#else
int loopN = size / 8;
for (int i = 0; i < loopN; i++) {
uint16x8_t vec = vld1q_u16(src + i * 8);
uint8x8_t res = vshrn_n_u16(vec, 8);
vst1_u8(dst + i * 8, res);
}
for (int i = loopN * 8; i < size; i++) {
dst[i] = static_cast<int8_t>(src[i] >> 8);
}
#endif // ARM64
#else
for (int i = 0; i < size; i++) {
dst[i] = static_cast<int8_t>(src[i] >> 8);
}
#endif // USE_NEON
}
void MNNFp8ToFp16(uint16_t* dst, const uint8_t* src, size_t size) {
#ifdef MNN_USE_NEON
int loopN = size / 8;
for (int i = 0; i < loopN; i++) {
uint8x8_t vec8x8 = vld1_u8(src + i * 8);
uint16x8_t vec16x8 = vshll_n_u8(vec8x8, 8);
vst1q_u16(dst + i * 8, vec16x8);
}
for (int i = loopN * 8; i < size; i++) {
dst[i] = static_cast<int16_t>(src[i]) << 8;
}
#else
for (int i = 0; i < size; i++) {
dst[i] = static_cast<int16_t>(src[i]) << 8;
}
#endif // USE_NEON
}
2025-01-22 14:47:50 +08:00
#ifdef MNN_LOW_MEMORY
static void generalIm2col(float* destOrigin, float const** sourceGroup, const int32_t* info, const int32_t* el, int LP, int pack) {
// LP >= pack
int number = info[0];
int eReal = info[1];
int eDest = info[2];
int offset = info[3];
for (int n=0; n<number; ++n) {
int e = el[4 * n + 0];
int l = el[4 * n + 1];
int eOffset = el[4 * n + 2];
int lOffset = el[4 * n + 3];
int lC = lOffset / LP;
int lR = lOffset % LP;
auto dest = destOrigin + eOffset * LP + lC * eDest * LP + lR;
auto source = sourceGroup[n];
for (int y=0; y<e; ++y) {
auto yR = y % eDest;
for (int x=0; x<l; ++x) {
auto xR = x % pack;
auto xC = x / pack;
auto xOut = x / LP;
auto xIn = x % LP;
dest[xOut * eDest * LP + yR * LP + xIn] = source[xC * eReal * pack + y * pack * offset + xR];
}
}
}
}
2025-02-12 11:14:19 +08:00
#endif // MNN_LOW_MEMORY
2025-01-22 14:47:50 +08:00
2021-04-08 15:34:23 +08:00
namespace MNN {
static CoreFunctions* gCoreFunction = nullptr;
void MNNCoreFunctionInit() {
gCoreFunction = new CoreFunctions;
2024-08-24 15:46:21 +08:00
// fp8
gCoreFunction->MNNFp32ToFp8 = MNNFp32ToFp8;
gCoreFunction->MNNFp16ToFp8 = MNNFp16ToFp8;
gCoreFunction->MNNFp8ToFp32 = MNNFp8ToFp32;
gCoreFunction->MNNFp8ToFp16 = MNNFp8ToFp16;
2024-12-19 16:20:00 +08:00
2021-04-08 15:34:23 +08:00
// MatMul
gCoreFunction->MNNGetMatMulPackMode = MNNGetMatMulPackMode;
gCoreFunction->MNNPackC4ForMatMul_A = MNNPackC4ForMatMul_A;
gCoreFunction->MNNPackForMatMul_B = MNNPackForMatMul_B;
gCoreFunction->MNNPackedMatMul = MNNPackedMatMul;
gCoreFunction->MNNPackedMatMulRemain = MNNPackedMatMulRemain;
2024-04-19 11:58:21 +08:00
gCoreFunction->MNNCountMaxMinValue = MNNCountMaxMinValue;
gCoreFunction->MNNGetSparseMatMulPackMode = MNNGetSparseMatMulPackMode;
gCoreFunction->MNNAdjustOptimalSparseKernel = _MNNAdjustOptimalSparseKernel;
gCoreFunction->MNNComputeMatMulForE_1 = MNNComputeMatMulForE_1;
gCoreFunction->MNNComputeMatMulForH_1 = MNNComputeMatMulForH_1;
2021-04-08 15:34:23 +08:00
// Lowp
gCoreFunction->MNNFp32ToLowp = nullptr;
gCoreFunction->MNNLowpToFp32 = nullptr;
gCoreFunction->bytes = 4;// sizeof(float)
// Packed Function
gCoreFunction->pack = 4;
2021-09-18 15:52:30 +08:00
// FIXME: MNNPackTranspose and MNNUnpackTranspose is reverted
2021-04-08 15:34:23 +08:00
gCoreFunction->MNNPackCUnit = MNNPackC4;
gCoreFunction->MNNUnpackCUnit = MNNUnpackC4;
gCoreFunction->MNNUnpackCUnitTranspose = MNNPackTranspose;
gCoreFunction->MNNPackCUnitTranspose = MNNUnpackTranspose;
2021-09-18 15:52:30 +08:00
gCoreFunction->MNNPackCUnitInt8 = decltype(gCoreFunction->MNNPackCUnitInt8)(MNNPackC4Uint8);
gCoreFunction->MNNUnpackCUnitInt8 = decltype(gCoreFunction->MNNUnpackCUnitInt8)(MNNUnpackC4Uint8);
gCoreFunction->MNNPackCUnitTransposeInt8 = decltype(gCoreFunction->MNNPackCUnitTransposeInt8)(MNNUnpackTransposeUint8);
gCoreFunction->MNNUnpackCUnitTransposeInt8 = decltype(gCoreFunction->MNNUnpackCUnitTransposeInt8)(MNNPackTransposeUint8);
gCoreFunction->MNNPackCUnitInt16 = MNNPackC4Int16;
gCoreFunction->MNNUnpackCUnitInt16 = MNNUnpackC4Int16;
gCoreFunction->MNNPackCUnitTransposeInt16 = MNNUnpackTransposeInt16;
gCoreFunction->MNNUnpackCUnitTransposeInt16 = MNNPackTransposeInt16;
2021-04-08 15:34:23 +08:00
gCoreFunction->MNNAxByClampBroadcastUnit = MNNAxByClampBroadcastUnit;
gCoreFunction->MNNConvRunForLineDepthwise = MNNConvRunForLineDepthwise;
gCoreFunction->MNNMatrixAdd = MNNMatrixAdd;
gCoreFunction->MNNMatrixSub = MNNMatrixSub;
gCoreFunction->MNNStrassenMergeCFunction = MNNStrassenMergeCFunction;
gCoreFunction->penalty = 1.5f;
gCoreFunction->MNNScaleAndAddBias = MNNScaleAndAddBias;
gCoreFunction->MNNGridSampleComputeCord = MNNGridSampleComputeCord;
gCoreFunction->MNNGridSampleInterp = MNNGridSampleInterp;
#ifndef MNN_REDUCE_SIZE
2023-12-04 11:12:20 +08:00
gCoreFunction->MNNGridSampleInterpGrad = MNNGridSampleInterpGrad;
#endif
2022-06-24 18:30:05 +08:00
gCoreFunction->MNNGridSampleComputeCord3D = MNNGridSampleComputeCord3D;
gCoreFunction->MNNGridSampleInterp3D = MNNGridSampleInterp3D;
gCoreFunction->MNNRoiPoolingMax = MNNRoiPoolingMax;
gCoreFunction->MNNRoiAlignMax = MNNRoiAlignMax;
gCoreFunction->MNNRoiAlignAvg = MNNRoiAlignAvg;
2021-04-08 15:34:23 +08:00
gCoreFunction->MNNAddC4WithStride = MNNAddC4WithStride;
gCoreFunction->MNNCopyC4WithStride = MNNCopyC4WithStride;
2021-09-18 15:52:30 +08:00
gCoreFunction->chooseWinoSourceTransformPack = WinogradFunction::chooseWinoSourceTransformPack;
2022-01-04 10:50:40 +08:00
gCoreFunction->chooseWinoSourceUnrollTransform = WinogradFunction::chooseSourceUnrollTransform;
gCoreFunction->chooseWinoDestUnrollTransform = WinogradFunction::chooseWinoDestUnrollTransform;
2021-04-08 15:34:23 +08:00
gCoreFunction->MNNDeconvRunForLineDepthwise = MNNDeconvRunForLineDepthwise;
gCoreFunction->MNNDeconvRunForUnitDepthWise = MNNDeconvRunForUnitDepthWise;
2024-10-14 19:26:28 +08:00
#ifdef MNN_USE_NEON
gCoreFunction->MNNDepthwiseConvFastKernel = MNNDepthwiseConvFastKernel;
#endif
gCoreFunction->MNNSelectBinaryFunctionForFloat = CPUBinary::selectForFloat;
gCoreFunction->MNNSelectUnaryFunctionForFloat = CPUUnary::selectForFloat;
#ifdef MNN_SUPPORT_QUANT_EXTEND
2023-10-18 10:31:02 +08:00
gCoreFunction->MNNSelectUnaryFunctionForInt8 = CPUUnary::selectForInt8;
#endif
gCoreFunction->MNNReluWithSlopeChannel = MNNReluWithSlopeChannel;
gCoreFunction->MNNPoolingAvg = (decltype(gCoreFunction->MNNPoolingAvg))(poolingAvg<float, Vec4, 4>);
// Set min value as 1 << 24
gCoreFunction->MNNPoolingMax = (decltype(gCoreFunction->MNNPoolingMax))(poolingMax<float, Vec4, 4, -16777216>);
2024-05-11 19:17:02 +08:00
2023-10-18 10:31:02 +08:00
gCoreFunction->MNNPoolingMaxWithRedice = (decltype(gCoreFunction->MNNPoolingMaxWithRedice))(poolingMaxWithRedice<float, -16777216>);
// ImageProcess Functions
gCoreFunction->MNNRGBAToBGRA = MNNRGBAToBGRA;
gCoreFunction->MNNNV21ToRGBA = MNNNV21ToRGBA;
gCoreFunction->MNNNV21ToRGB = MNNNV21ToRGB;
gCoreFunction->MNNNV21ToBGRA = MNNNV21ToBGRA;
gCoreFunction->MNNNV21ToBGR = MNNNV21ToBGR;
gCoreFunction->MNNC1ToFloatC1 = MNNC1ToFloatC1;
gCoreFunction->MNNC3ToFloatC3 = MNNC3ToFloatC3;
gCoreFunction->MNNC3ToFloatRGBA = MNNC3ToFloatRGBA;
2023-06-16 09:42:45 +08:00
gCoreFunction->MNNSamplerC4Nearest = MNNSamplerC4Nearest;
gCoreFunction->MNNSamplerC4Bilinear = MNNSamplerC4Bilinear;
2023-10-18 10:31:02 +08:00
gCoreFunction->MNN4BitcopyWithStride = MNN4BitcopyWithStride;
gCoreFunction->MNN1BitcopyWithStride = MNN1BitcopyWithStride;
gCoreFunction->MNN2BitcopyWithStride = MNN2BitcopyWithStride;
gCoreFunction->MNN4BitcopyFast = MNN4BitcopyFast;
gCoreFunction->MNN2BitcopyFast = MNN2BitcopyFast;
gCoreFunction->MNN1BitcopyFast = MNN1BitCopyFast;
2024-05-11 19:17:02 +08:00
2023-10-18 10:31:02 +08:00
gCoreFunction->MNNAccumulateSequenceNumber = MNNAccumulateSequenceNumber;
2024-07-22 19:51:53 +08:00
const MNNCPUInfo& gCPUInfo = *MNNGetCPUInfo();
gCoreFunction->supportFp16arith = gCPUInfo.fp16arith;
gCoreFunction->supportSDot = gCPUInfo.dot;
2022-10-30 08:44:24 +08:00
gCoreFunction->supportI8mm = gCPUInfo.i8mm;
2024-07-22 19:51:53 +08:00
gCoreFunction->MNNSumByAxisLForMatmul_A = MNNSumByAxisLForMatmul_A;
2025-03-12 11:35:16 +08:00
gCoreFunction->MNNReorderWeightInt4 = MNNReorderWeightInt4;
gCoreFunction->MNNSumWeightInt8 = MNNSumWeightInt8;
#ifdef __aarch64__
if (gCoreFunction->supportSDot) {
gCoreFunction->MNNReorderWeightInt4 = MNNReorderWeightInt4Arm82;
gCoreFunction->MNNSumWeightInt8 = MNNSumWeightInt8Arm82;
}
if (gCoreFunction->supportI8mm) {
gCoreFunction->MNNReorderWeightInt4 = MNNReorderWeightInt4Arm86;
gCoreFunction->MNNSumWeightInt8 = MNNSumWeightInt8Arm86;
}
#endif
2024-09-12 12:57:57 +08:00
#ifdef MNN_CPU_WEIGHT_DEQUANT_GEMM
2024-07-22 19:51:53 +08:00
// Weight Dequant Gemm Kernels
gCoreFunction->MNNPackedMatMul_int8 = MNNPackedMatMul_int8;
gCoreFunction->MNNPackedMatMulRemain_int8 = MNNPackedMatMulRemain_int8;
2024-09-12 12:57:57 +08:00
#endif
#ifdef MNN_LOW_MEMORY
2025-05-08 12:39:44 +08:00
gCoreFunction->MNNAbsMax = MNNAbsMaxFP32; // abs max value for [icDiv4,plane,4] -> abs max:[plane]
gCoreFunction->MNNDynamicQuant = MNNDynamicQuantFP32; // symmetric 'batch' quant for [icDiv4,plane,4]
gCoreFunction->MNNAsyQuantFunc = MNNAsyQuantFunc; // asymmetric 'batch' quant for [icDiv4,plane,4]
gCoreFunction->MNNAsyQuantInfo = MNNAsyQuantInfo_FP32; // asymmetric quant/dequant scale&bias for [icDiv4,plane,4] -> scale&bias:[blockNum,plane]
gCoreFunction->MNNQuantScale = MNNQuantScaleFP32; // symmetric quant/dequant scale&bias for [icDiv4,plane,4] -> scale&bias:[plane]
gCoreFunction->MNNGeneralIm2Col = generalIm2col; // Im2Col based on float data -> output:[eU,kernelsize,lU,ep,lp]
2024-07-22 19:51:53 +08:00
gCoreFunction->MNNDynamicUpdateConvBiasScale = MNNDynamicUpdateConvBiasScale;
2025-01-22 14:47:50 +08:00
#ifdef __aarch64__
if (gCoreFunction->supportSDot) {
gCoreFunction->MNNGeneralIm2Col = MNNGeneralIm2col_Fp32Arm82;
}
if (gCoreFunction->supportI8mm) {
gCoreFunction->MNNGeneralIm2Col = MNNGeneralIm2col_Fp32Arm86;
}
#endif
2023-12-04 11:12:20 +08:00
#endif
2021-09-18 15:52:30 +08:00
MNNCoreInt8FunctionInit();
2021-04-08 15:34:23 +08:00
MNNFunctionInit();
}
CoreFunctions* MNNGetCoreFunctions() {
return gCoreFunction;
}
};
[MNN:Sync] Sync internal github Commits: 8148ae75c 弗人 bugfix 14cb8ec7f 弗人 [Converter:Bugfix] bugfix for onnx depthwise convtranspose 476fbcd90 雁行 [MNN:Feature] Open AVX cast and bugfix for contentCFG. 5e26b9fd3 雁行 [Test:Feature] Add android test. 37e147b25 雁行 [MNN:Bugfix] Bugfix for floordiv. 144c185f5 tianbu.xsw hangxing fix hiai b4fd429d6 tianbu.xsw updateCacheFile bugfix -- update cache size d4ba572a8 雁行 [MNN:Bugfix] Support int8 in AVX2 and some Bugfix. 43061f07e xiaying [MNN:Bugfix] Fix bug for module mode run part of model 398cc5ab6 tianhang.yth refactor demo 736380600 xiaying [Express:Bugfix] Fix memory leak for copy branch b8dab0a27 tianhang.yth MNNFloat2Int8 sizeQuad=0 crash fix 94b95bfed ghz [BugFix]1.Better method for fast pack valid check 6a921f85e xiaying [Converter:Bugfix] Fix bug for Fuseconsttosubgraph 5f77ae889 tianhang.yth numThread bugfix a807ef879 tianhang.yth add createSession(configs, runtimeinfo) API, add pymnn demo, pymnn logcat bugfix ad05409d3 xiaying [MNN:Bugfix] Fix bug for StaticModule's sizecompute overflow, add error print for module mode 9d81b8299 xiaying [MNN:Bugfix] Fix bug for Unique op for output size = 1 03b15e9af xiaying [Test:Feature] Add MatMulBConst Test, Fix bug for single Convert c944a76ee tianhang.yth add auto backend and getSessionInfo @tianbu 91fa7267b ghz [BugFix]1.fix the error in eP check bf0041f77 ghz [BugFix]1.Fix the logic error in eP check. 2.Fix the sp align error 693871672 雁行 [CPU:Bugfix] rm adrp instruction for clang compiler bug. 1b8f6b3d8 ghz 1.Fix the wronly use of r13 in arm32 version. 2.Fix the missing callee register save and restore process. feb7ecc4c 弗人 modify log of python offline quant 040c04811 ghz [BufFix]1.replace platform-related regs. 2.fix the same problem in arm32 version 609f37db8 弗人 add log for python quant, python convert 5511dd30a ghz [BugFix]1.Add testcases in SparseConv to check all functional code branch. 2. Fix the bug in "MNNPackC4ForMatMul_A.S" in arm64, which is caused by the missing check of eReal parameter. a93ff9280 tianhang.yth add tf.Unique op support 9729ff773 allen.lk [Bugfix] Fix one arm32 instruction syntax that clang works but gcc DOES NOT work. use index instruction instead. 297c1ad14 雁行 [Expr:Bugfix] bugfix for tensor content used by shape compute. ef8c369e3 弗人 catch exception 07c2dd670 弗人 add dependence to setup, base64 encode url, add time log 177e590c1 弗人 [Python:Feature] add aliyun log for python quant tool 40a7928cf allen.lk [Debug:Sparse] 1.Add group parameter in torchscript converter. 2. Stop split running to avoid memory corruption when check failed in TransformGroupConvolution 3. fix Op split issue in TransformGroupConvolution 3bdea84a1 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. c3c6fbdbd allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. bc590eee4 雁行 [Converter:Bugfix] bugfix for onnx instancenormalization convert. d8918593f tianhang.yth add auto backend and getSessionInfo @tianbu 83a198ed7 杭行 update d0dd3e09b 杭行 update 99540202e xiaying [Converter:Optimize] Opt the tensor convert insert 333d8db82 allen.lk [Debug:Sparse] Fix All platform-register r9 / x18 issue on arm32 and arm64. db5994672 杭行 merge 6293de7b8 tianbu.xsw fix pymnn updateCacheFile 5c2e11cb1 tianbu.xsw do updateCache in createSession 6e7641ff4 tianbu.xsw do not limit cacheFile for a model 5287a65e4 tianbu.xsw bugfix 52ba53a91 tianbu.xsw revert pymnn api 60284d830 tianbu.xsw bugfix 6d8077490 tianbu.xsw rename updateCacheFile api params 3cb172710 tianhang.yth updateCacheFile API size default value is 0 c5b69aabf tianbu.xsw updateCacheFile python api fix 5d5da7aa5 tianbu.xsw reflector code 5707877a4 雁行 [MNN:Speed] Speedup for softmax in x86 and arm. 2a211825c tianbu.xsw reflector code for updateCacheFile 76db3a835 tianbu.xsw [Cache Feature]: Add updateCacheFile API for increment cache b06b0fd43 allen.lk [Debug:Sparse] Fix and warning one kind of segmentfault cause by memory corruption when resize ConvolutionWinograd. Avoid to use some registers as arm restriction. e68bfa495 雁行 [Converter:Feature] Add UUID when model convert. a9cb935dc xiaying [MNN:Speed] Support c4nhwc for more fastblit 019f40353 xiaying [Converter:Refractor] Reduce memory used by MNNConvert(bert from 5G -> 1G) d2a6d3d05 xiaying [MNN:Bugfix] Fix bug for identity output not find 604d0801b xiaying [Converter:Bugfix] Fix bug for FuseGeLu 4bada2367 xiaying [MNN:Refractor] SegmentMean rewrite as segment 82070e708 xiaying [MNN:Bugfix] Fix bug for GeometryBinary e8ea4266e xiaying Fix bug for ShapeTensorConvert compute for dim = 1 error 1f1cf1991 xiaying [Tools:Bugfix] Fix system compability for fastTestOnnx 6f422efe2 xiaying [Tools:Bugfix] Remove color for checkDir for easy to dump 968f7ec88 xiaying [MNN:Speed] Support turn broadcast binary to loop 3e7aaf46f xiaying [MNN:Refractor] Set Convolution1x1Strassen support variable input/output ptr 1f65ab163 xiaying [MNN:Bugfix] Fix bug for mini mnn can't convert model d65953d47 xiaying [MNN:Bugfix] Fix bug for armv7a - android-14 + ARM82 8b68be45c xiaying [MNN:Feature] Add segment 8a8f264f5 xiaying [Vulkan:Bugfix] Remove unuseful print 025bb0fda xiaying [Converter:Bugfix] Fix bug for oneof don't support 43900251e tianbu.xsw enable setCacheFile python API ebfb05c74 tianbu.xsw [Metal Feature] support metallib obtain from walle transfer task 9665c0a79 弗人 add check for path in json file c66fef224 xiaying [Converter:Bugfix] Fix bug for oneof don't support 42f192852 xiaying [MNN:Bugfix] Fix bug for not set output / saveTensor into origin Schedule's outputs 1b95354ff 雁行 [Feature]: Support shape compute for SetDiff1D, and null input for Prod. 83966d043 xiaying [Test:Feature] Add test for static module 42d1be933 xiaying [Converter:Bugfix] Fix bug for mnn convert and static model add more outputs for origin model 9067531c3 xiaying [Converter:Refractor] formatLicence 99558bed9 xiaying [Converter:Bugfix] Count the op for unuseful and controlflow 4f6da0fa7 allen.lk [Feature:GRUMultiOutput] fix multi output dimension type c6b219bce xiaying [Converter:Feature] Turn torch converter to object dd4e68a37 xiaying [Converter:Feature] Support dump supported ops 80b6a60a3 xiaying [Converter:Info] If has output name, print output name instead of computed 015278fc3 xiaying [MNN:Refractor] Revert IfModule's debug info 23ac967c4 xiaying Don't transform for multi-input convolution/deconvolution b02b0d4de xiaying Fix bug for multi-input for conv1d 254d8b1d4 xiaying Fix bug for Conv1dSqueezeMove for multi input convolution 1d d47d0b9ca xiaying Fix bug for CPURaster's fuse nc4hw4 357c5bd33 xiaying Fix ConvBiasAdd for conv's inputs op > 1 55b1f0c9c xiaying [Converter:Bugfix] Don't transform for multi-input convolution/deconvolution 1902a30f5 xiaying [Converter:Bugfix] Fix bug for Conv1dSqueezeMove for multi input convolution 1d c23fe617b xiaying [MNN:Bugfix] Fix bug for multi-input for conv1d 8ff018426 xiaying [MNN:Bugfix] Fix bug for CPURaster's fuse nc4hw4 d4e8cd602 xiaying [Converter:Bugfix] Fix ConvBiasAdd for conv's inputs op > 1 846266b42 tianbu.xsw return when program and tune both nullptr fd67c76a9 xiaying [Converter:Bugfix] DepthwiseConvWeightMerge only valid for tflite e77a242c4 xiaying [Converter:Feature] Support tflite's half pixel be054c377 tianbu.xsw [OpenCL Bugfix] do not rewrite cache when binary program is produced 51e65aa35 xiaying [Converter:Feature] Support tflite for fp16 and multi-input convolution 1ccdfdeb5 tianbu.xsw redefine svm macro name 31234d372 tianbu.xsw [OpenCL SVM] add macro for only use wrapper d739e35da xiaying [MNN:Bugfix] Fix compile bug for grid op 24ab13c79 Joker feat(arm82): add GridSample op support in arm82 backend, AVX(by xiaying) 7b142978e xiaying [AVX512:Speed] Optimize for e <= 8 5f6febe7b tianbu.xsw code refactor 998d91b57 xiaying [Express:Speed] Merge submodule for speed 22c89146f tianhang.yth fix alpha div by zero bug and arm server compile bug 8f829a170 tianbu.xsw [OpenCL Pad] unify conv/deconv pad computing 4a28f603e xiaying [Express:Speed] Shared Const for All Submodule c74cf28f3 xiaying [MNN:Refractor] Seperate Const init and schedule 2a1eebb7a xiaying [Tools:Bugfix] Fix bug for modelTest.py count size 72f04008c xiaying [MNN:Refractor] Delete unuseful const op 1e735d03c xiaying [Converter:Bugfix] Fix bug for static module gen 4dfadbc6e xiaying [MNN:Refractor] Rewrite const init mode 1fcf0417a xiaying [MNN:Bugfix] Fix bug for deconvolutin multi-input for multi-batch 41d429cfd xiaying [Train:Bugfix] Revert convert NCHW for mnistTrain f947a5f01 xiaying [Test:Feature] Add testTrain dad59b6f6 tianbu.xsw move realize code from Backend.hpp to Tensor.cpp cf4473ad1 xiaying [Train:Bugfix] Support pad for GeometryPoolGrad 91ab13734 xiaying [MNN:Bugfix] Fix compile bug for avx512 742e80f47 xiaying [MNN:Refractor] Opt the logic for checknan judge 12543b841 xiaying [ARM82:Bugfix] Fix compile bug for ios 3a2b0a49f xiaying [ARM82:Speed] Opt Pack / Unpack for armv8 c0f1995cd xiaying [ARM82:Speed] Opt MNNPackC8FP16 and MNNUnpackC8FP16 by asm e0fc77dcf xiaying [MNN:Speed] Fix bug for DeconvolutionWithStride for C4HW4, open it 584bec578 xiaying [MNN:Bugfix] Fix bug for format set error for onnx d5bd4148d xiaying [MNN:Bugfix] Fix bug for format set error for onnx b00265841 xiaying [MNN:Bugfix] Fix bug for SparseConvolutionTiledExecutor bb09188ac xiaying [Test:Bugfix] Fix bug for run into sparse auto 426d1babd xiaying [MNN:Refractor] Small bugfix for Group convolution and pack 7d0ea1c46 tianbu.xsw [testModel Feature] support testModel.out input resize 4169c54ce xiaying [MNN:Bugfix] Fix bug for checkNAN for origin 412a82222 xiaying [Test:Bugfix] Fix bug for CheckNAN's error of matmul 319b1d425 xiaying [MNN:Bugfix] Fix bug for multi-batch for ConvInt8 050b728a6 xiaying [Test:Bugfix] Use NCHW for ConvInt8Test 7db3423a1 xiaying [OpenCL:Bugfix] Fix bug for opencl::image,opencl::buffer for C4HW4 adcec6a7f xiaying [Vulkan:Bugfix] Fix bug for invalid tensor size limit d2a7cf4e9 xiaying [Vulkan:Bugfix] Fix bug for onCopyBuffer of nc4hw4 557bebdd3 xiaying [MNN:Bugfix] Fix bug for BF16-ARM32 bbe186649 tianbu.xsw [Update AUTO mode]: fix MNN_FORWARD_AUTO choose priority 6deb23439 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size b137590e4 xiaying [MNN:Bugfix] Fix bug for GeometryBinary don't care about NC4HW4 same size 7003558ea xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case b5f8cae5a xiaying [Converter:Bugfix] Fix bug for onnx pad for serveral case 29b09e125 xiaying [MNN:Bugfix] Fix bug for arm64-bf16 42ce00770 xiaying [MNN:Bugfix] Fix bug for ARM64 - float a2d89fc18 雁行 [Converter:Feature] Support Binary Unary for Torch. 7f1c0deb1 xiaying [MNN:Bugfix] Fix bug for Raster for Int8 8335a6f18 tianbu.xsw [OpenCL Shared Memory] modify data_format method b359e031b xiaying [ARM82:Bugfix] Fix bug for arm82 and speed up pack / unpack c8 24bf3fc88 雁行 [Convert:Feature] Support LayerNormFuse without gamma beta. 3e629624b xiaying [MNN:Bugfix] Fix bug for float - armv7a 2b7908ec7 tianbu.xsw modify workItemSize 3cee0d413 xiaying [MNN:Bugfix] test wrong clear 9cbbfb998 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 2d7a44484 xiaying [MNN:Bugfix] fix compile bug for c++ < 14 eb7d0cb53 xiaying [Test:Bugfix] Don't test for NC4HW4 directly 7b40ca8d1 xiaying [MNN:Bugfix] Fix bug for ConvolutionGroup 2694d8a91 xiaying [MNN:Bugfix] Fix bug for CPUGridSample f89af60f6 xiaying [MNN:Bugfix] Fix compile bug for arm a151abcdd xiaying [MNN:Bugfix] Fix bug for convert for int8 / int16 b254dbe61 雁行 [MNN:Bugfix] Bugfix for Conv onClone. d08150631 xiaying [MNN:Bugfix] Fix bug for fast rcnn e5568a0df xiaying [MNN:Bugfix] Fix bug for CPURaster treat NC4HW4 fast blit 128318933 雁行 [Raster:Bugfix] bugfix for Raster merge onResize. 03caacbea xiaying [MNN:Bugfix] fix bug for CPUDeconvolution and Convolution1x1Strassen for iw != ow e1e3c245c xiaying [MNN:Bugfix] Fix bug for ConvolutionWinograd 2524cbc6d xiaying [MNN:Bugfix] Fix bug for CPUSoftmax 44ec79b8f xiaying [MNN:Bugfix] Fix bug for CPUConvolutionDepthwise / Scale / DeconvolutionDW 21ae956ce xiaying [MNN:Bugfix] Fix bug for Multi-Batch-TiledExecutor 09a5069c7 xiaying [MNN:Speed] Add offset for src and dst 6776c6784 xiaying [MNN:Bugfix] Fix bug for trainable model cc83ae30b xiaying [MNN:Bugfix] Fix bug for trainable model
2021-07-29 11:46:59 +08:00
void MNNUnpackC4Origin(float* dst, const float* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNUnpackC4(dst, src, area, depth, offset);
}
void MNNPackC4Origin(float* dst, const float* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNPackC4(dst, src, area, depth, offset);
}
2023-02-15 10:30:27 +08:00
void MNNPackC2(double* dst, const double* src, size_t area, size_t depth, int* areaOffset) {
MNNPackC2Common<double>(dst, src, area, depth, areaOffset);
}
void MNNUnpackC2(double* dst, const double* src, size_t area, size_t depth, int* areaOffset) {
MNNUnpackC2Common<double>(dst, src, area, depth, areaOffset);
}
2023-12-04 11:12:20 +08:00
void MNNUnpackC2Float(float* dst, const float* src, size_t area, size_t depth, int* areaOffset, int pack) {
MNNUnpackC2Common<float>(dst, src, area, depth, areaOffset, pack);
}
2024-09-12 12:57:57 +08:00
#ifndef __aarch64__
2023-06-16 09:42:45 +08:00
void MNNPackInt8C2(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
MNNPackC2Common<float>(dst, src, area, depth, areaOffset);
}
2024-09-12 12:57:57 +08:00
#endif
2023-06-16 09:42:45 +08:00
void MNNUnpackInt8C2(float* dst, const float* src, size_t area, size_t depth, int* areaOffset) {
MNNUnpackC2Common<float>(dst, src, area, depth, areaOffset);
}
2023-02-15 10:30:27 +08:00
void MNNUnpackC2Origin(double* dst, const double* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNUnpackC2(dst, src, area, depth, offset);
}
void MNNPackC2Origin(double* dst, const double* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNPackC2(dst, src, area, depth, offset);
2023-05-18 19:11:50 +08:00
}
2023-06-16 09:42:45 +08:00
void MNNUnpackInt8C2Origin(float* dst, const float* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNUnpackInt8C2(dst, src, area, depth, offset);
}
void MNNPackInt8C2Origin(float* dst, const float* src, size_t area, size_t depth, int areaOffset) {
int offset[] = {
areaOffset,
areaOffset,
};
MNNPackInt8C2(dst, src, area, depth, offset);
2024-10-14 19:26:28 +08:00
}