I came across this paper https://arxiv.org/pdf/2505.02819, proposing layer pruning & linear approximation to reduce model size, increase inference speed & decrease memory bandwidth.
https://huggingface.co/MTSAIR/Llama3.1-6B-ReplaceMe-Healed
https://huggingface.co/MTSAIR/Llama3.1-6B-ReplaceMe/tree/main
I went ahead to convert it to RKLLM v1.2.1b1
28GB+ RAM is recommended
Run #1 failed for some reason…
Reading https://github.com/c0zaut/ez-er-rkllm-toolkit shows that your RAM should be 2-4x the model size. Enabling SWAP helps, just makes it run slow. (You can automate this for it to run overnight)
Refer to Enable Swap memory in Ubuntu for details if you need to add some SWAP
Second fun worked!
Model uploaded here: https://huggingface.co/ThomasTheMaker/Llama3.1-6B-ReplaceMe-Healed-rkllm-v1.2.1b1/blob/main/Llama3.1-6B-ReplaceMe-Healed-w8a8-opt1.rkllm