# Export the model, tokenizer, embedding, and the corresponding MNN model
python llmexport.py \
--path /path/to/Qwen2-0.5B-Instruct \
--export mnn
```
3. Exported Artifacts
The exported files include:
1.`config.json:` Configuration file for runtime, which can be manually modified.
2.`embeddings_bf16.bin:` Binary file containing the embedding weights, used during inference.
3.`llm.mnn:` The MNN model file, used during inference.
4.`llm.mnn.json:` JSON file corresponding to the MNN model, used for applying LoRA or GPTQ quantized weights.
5.`llm.mnn.weight:` MNN model weights, used during inference.
6.`llm.onnx:` ONNX model file without weights, not used during inference.
7.`llm_config.json:` Model configuration file, used during inference.
The directory structure is as follows:
```
.
└── model
├── config.json
├── embeddings_bf16.bin
├── llm.mnn
├── llm.mnn.json
├── llm.mnn.weight
├── onnx/
├──llm.onnx
├──llm.onnx.data
├── llm_config.json
└── tokenizer.txt
```
### Features
+ Direct Conversion to MNN Model
Use `--export mnn` to directly convert to an MNN model. Note that you need to either install pymnn or specify the path to the MNNConvert tool using the `--mnnconvert` option. At least one of these conditions must be met. If pymnn is not installed and the MNNConvert tool's path is not specified via --mnnconvert, the llmexport.py script will search for the MNNConvert tool in the directory "../../../build/". Ensure that the MNNConvert file exists in this directory. This method currently supports exporting 4-bit and 8-bit models.
+ If you encounter issues with directly converting to an MNN model or require quantization with other bit depths (e.g., 5-bit/6-bit), you can first convert the model to an ONNX model using `--export onnx`. Then, use the MNNConvert tool to convert the ONNX model to an MNN model with the following command:
+ Supports dialogue testing with the model. Use `--test $query` to return the LLM's response.
+ Supports exporting with merged LoRA weights by specifying the directory of the LoRA weights using `--lora_path`.
+ Specify the quantization bit depth using `--quant_bit` and the block size for quantization using `--quant_block` .
+ Use `--lm_quant_bit` to specify the quantization bit depth for the lm_head layer's weights. If not specified, the bit depth defined by `--quant_bit` will be used.
Place all the exported files required for model inference into the same folder. Add a `config.json` file to describe the model name and inference parameters. The directory structure should look as follows:
```
.
└── model_dir
├── config.json
├── embeddings_bf16.bin
├── llm_config.json
├── llm.mnn
├── llm.mnn.weight
└── tokenizer.txt
```
##### Configuration Options
The configuration file supports the following options:
- Model File Information
-`base_dir`: Directory where model files are loaded. Defaults to the directory of `config.json` or the model directory.
-`llm_config`: Path to `llm_config.json`, resolved as `base_dir + llm_config`. Defaults to `base_dir + 'config.json'`.
-`llm_model`: Path to `llm.mnn`, resolved as `base_dir + llm_model`. Defaults to `base_dir + 'llm.mnn`'.
-`llm_weight`: Path to `llm.mnn.weight`, resolved as `base_dir + llm_weight`. Defaults to `base_dir + 'llm.mnn.weight'`
-`block_model`: For segmented models, the path to `block_{idx}.mnn`, resolved as `base_dir + block_model`. Defaults to `base_dir + 'block_{idx}.mnn`.
-`lm_model`: For segmented models, the path to `lm.mnn`, resolved as `base_dir + lm_model`. Defaults to `base_dir + 'lm.mnn'`.
-`embedding_model`: If embedding uses a model, the path to the embedding is `base_dir + embedding_model`. Defaults to `base_dir + 'embedding.mnn'`.
-`embedding_file`: If embedding uses a binary file, the path to the embedding is `base_dir + embedding_file`. Defaults to `base_dir + 'embeddings_bf16.bin'`.
-`tokenizer_file`: Path to `tokenizer.txt`, resolved as `base_dir + tokenizer_file`. Defaults to `base_dir + 'tokenizer.txt'`.
-`visual_model`: If using a VL model, the path to the visual model is `base_dir + visual_model`. Defaults to `base_dir + 'visual.mnn'`.
- Inference Configuration
- max_new_tokens: Maximum number of tokens to generate. Defaults to `512`
- reuse_kv: Whether to reuse the `kv cache` in multi-turn dialogues. Defaults to `false`
- quant_qkv: Whether to quantize query, key, value in the CPU attention operator. Options: `0`, `1`, `2`, `3`, `4`. Defaults to 0:
- 0: Neither `key` nor `value` is quantized.
- 1: Use asymmetric 8-bit quantization for `key`.
- 2: Use `fp8` format to quantize value
- 3: Use asymmetric `8-bit` quantization for key and `fp8` for `value`.
- 4: Quantize both `key` and `value` while using asymmetric 8-bit quantization for `query` and `int8` matrix multiplication for `Q*K`.
- use_mmap: Whether to use `mmap` to write weights to disk when memory is insufficient, avoiding overflow. Defaults to `false`. For mobile devices, it is recommended to set this to true.
- kvcache_mmap: Whether to use `mmap` for KV Cache to write to disk when memory is insufficient, avoiding overflow. Defaults to `false`
- tmp_path: Directory for disk caching when `mmap-related` features are enabled.
- On iOS, a temporary directory can be created and set as follows:
- backend_type: Hardware backend type used for inference. Defaults to :`"cpu"`. `"opencl"` is supported for android gpu, and `"metal"` is supported for macOS and iOS GPU.
- sampler_type: set the sampler type,currently support `greedy`, `temperature`, `topK`, `topP`, `minP`, `tfs`, `typical`, `penalty` 8 basic sampler types, and `mixed` (when set as `mixed`,samplers in mixed_samplers are executed one by one sequentially). Defaults to `greedy`, but `mixed`, `temperature` are suggested for output diversity, and `penalty` is suggested to avoid output repeatedness.
- mixed_samplers: takes effect when `sampler_type` is to be `mixed`. Defaults to `["topK", "tfs", "typical", "topP", "min_p", "temperature"]`, which means the logits will be sampled by these strategies sequentially, one after another.
- temperature: temerature value for `temperature`, `topP`, `minP`, `tfsZ`, `typical` strategies. Defaults to 1.0
- topK: number of top K tokens selected for `topK` sampler type. Defaults to 40
- topP: top P value for `topP` sampler type. Defaults to 0.9
- minP: min P value for `minP` sampler type. Defaults to 0.1
- tfsZ: Z value for `tfs` sampler type. Defaults to 1.0
- typical: p value for `typical` sampler type. Defaults to 1.0
- penalty: penalty factor to repeated token, used in `penalty` sampler type. Defaults to 0.0 (no penalty)
- n_gram: max n gram that will receive penalty, for any repeated tokens >=n_gram are forced not to be generated, used in `penalty` sampler type. Defaults to 8
- ngram_factor: extra penalty of repeated ngram when n>1, used in `penalty` sampler type. Defaults to 1.0 (no extra penalty)
- penalty_sampler: the sampling strategy after penalty, used in `penalty` sampler type,can be `"greedy"` or `"temperature"`. Defaults to `"greedy"`.
# Without config.json, using default configuration
## Interactive Chat
./llm_demo model_dir/llm.mnn
## Replying to each line in the prompt
./llm_demo model_dir/llm.mnn prompt.txt
```
- For Visual Models
Embed image input in the prompt as follows:
```
<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Describe the content of the image.
```
Specify the image size:
```
<img><hw>280, 420</hw>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Describe the content of the image.
```
- For Audio Models
Embed audio input in the prompt as follows:
```
<audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav</audio>Describe the content of the audio.
```
#### GPTQ Weights
To use `GPTQ` weights, you can specify the path to the `Qwen2.5-0.5B-Instruct-GPTQ-Int4` model using the `--gptq_path PATH` option when exporting the `Qwen2.5-0.5B-Instruct` model. Use the following command:
The first approach is faster and simpler but does not support runtime switching of LoRA weights.
The second approach adds slight memory and computation overhead but is more flexible, supporting runtime switching of LoRA weights, making it suitable for multi-LoRA scenarios.
##### Merging LoRA
###### Export
To merge LoRA weights into the original model, specify the `--lora_path PATH` parameter during model export. By default, the model is exported with the merged weights. Use the following command:
Using the merged LoRA model is exactly the same as using the original model.
##### Separating LoRA
###### Export
To export LoRA as a separate model, supporting runtime switching, specify the --lora_path PATH parameter and include the --lora_split flag during model export. Use the following command: