Integrate two kernels matmul_clamp_f16_qsi8d32p_qai4c32p
and matmul_clamp_f32_qsi8d32p_qai4c32p. Now KleidiAI support
asymmetric & block wise quantified & f32/f16activation.
Remove .tar.gz package from source code. Will download from
https://gitlab.arm.com/kleidi/kleidiai/-/releases when cmake.
Refactor MNN_KleidiAI interface to support more model types,
and facilitate subsequent KleidiAI ukernels' integration.
Re-abstract information stored in class KleidiAI:
1) static info: not related to loaded model, initialized when
interface is constructed and never changed.
2) status: will change while pipeline is running.
Let interface and loaded model decouple for more complex mix
of multiple types of models. Add mAccelType in MNN data structure,
kleidiAI interface will rely on this type to decide which branch
to go.
Move some pack functions to mnn_kleidiai_util.cpp.
Add CPU feature detection in source/backend/cpu/CPURuntime.hpp.
Subsequent ukernels need SME information.