u3841934583_Detailed_retro-futuristic_male_cyborg_methodicall_13114e76-be17-4506-80e8-c2d7e3e32da2_0.png

I came across this paper https://arxiv.org/pdf/2505.02819, proposing layer pruning & linear approximation to reduce model size, increase inference speed & decrease memory bandwidth.

https://huggingface.co/MTSAIR/Llama3.1-6B-ReplaceMe-Healed

https://huggingface.co/MTSAIR/Llama3.1-6B-ReplaceMe/tree/main

I went ahead to convert it to RKLLM v1.2.1b1

image.png

28GB+ RAM is recommended

Run #1 failed for some reason…

Reading https://github.com/c0zaut/ez-er-rkllm-toolkit shows that your RAM should be 2-4x the model size. Enabling SWAP helps, just makes it run slow. (You can automate this for it to run overnight)

Refer to Enable Swap memory in Ubuntu for details if you need to add some SWAP

image.png

Second fun worked!

image.png

Model uploaded here: https://huggingface.co/ThomasTheMaker/Llama3.1-6B-ReplaceMe-Healed-rkllm-v1.2.1b1/blob/main/Llama3.1-6B-ReplaceMe-Healed-w8a8-opt1.rkllm