So this was cool.
I came across a newly released language model that has been fine-tuned on medical data from a llama2 base model, and wanted to try it out.
model: epfl-llm/meditron-7b
Once I got it downloaded I tried to use the oogabooga webui, but ran into issues, so I wanted to convert it to gguf format and use with GPT4ALL.
I found some good instructions:
https://www.secondstate.io/articles/convert-pytorch-to-gguf/
I converted the PyTorch model to GGUF in FP16 weights.
Then when I got around to trying to quantize it (with out that runs pretty slowly and is much larger), I found that cMake wasn’t available on my Windows system.
There’s no way I wanted to deal with compiling a program on Windows, so I switched over to WSL to get things and compile the needed binary.
Had to install some things on my Ubuntu WSL, that I found out through searching, but things went smoothly after I updated and got the needed packages.
Then I compiled the binary.
Once I figured out that I could access the Windows file system from WSL easily, I ran the quantization, going with Q4.
Now the model loaded into GPT4ALL works great at over 7 tokens per sec 🙂