Fine Tuning Llama Models With Qlora and Axolotl

(23-06-20 Edit: Adjusted text regarding the need for quantization to reflect the possibility of loading an adapter in tools.)
[record scratch] … Wait, what?
The LLaMA model released by Meta Research has ignited the imaginations of many people. Its performance on consumer-level hardware has lowered the barrier of entry for people to use large language models. In a short amount of time, many models have been derived from the original LLaMA model, fine-tuned to better suit specific tasks.
How is this done? Aemon Algiz’s Youtube channel has several videos (such as this one) discussing technical aspects of this while keeping the topic accessible. Why QLoRA instead of just LoRA? Tim Dettmers (and others) have a paper which describes the ‘why’ part well in the summary. QLoRA uses 4-bit quantized data in the process which reduces time and GPU VRAM requirements. In short, it makes it even easier for consumer hardware to perform finetuning.
So if we can take a base model, trained on a trillion tokens of text, and fine tune it with datasets of our choosing, how can we accomplish this? One tool is the axolotl project. It provides an easy to use set of scripts to run the fine tuning and create the LoRA files.
Having separate LoRA files can be convenient, but in order to use them in the usual tools like text-generation-webui, you’ll need to load the base model’s original float16 file, which is usually around four times larger than the equivalent 4-bit quantized file It is possible to load this float16 version of the model using the Transformers library in the UI and selecting 8-bit or 4-bit to save on GPU VRAM, but my tests show loading the model in 4-bit mode and applying the QLoRA separately will only yield one quarter the speed of a merged quantized model. So once we finish training we will need to merge it back into the base non-quantized model and then quantize that merged model for maximum performance.
Here’s the whole script to accomplish this. It’s written based on an
Ubuntu Linux 23.04 install with miniconda
already installed, as well as NVidia proprietary drivers and Cuda 11.8 packages.
It’s also assumed that some basic tools like git and the build-essentials
(or equivalent) are installed. I’ll explain what is happening after the script
in more detail.
# Setup a new conda environment pinned to python 3.9
conda create -n axolotl python=3.9
conda activate axolotl
# Install pytorch for cuda 11.8
pip3 install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu118
# Clone the github and switch directories to it
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
# As of the time of this writing, 0.2.1 is the latest release
git checkout tags/v0.2.1
# Install the dependencies
pip3 install -e .
pip3 install -U git+https://github.com/huggingface/peft.git
# I have problems with the current bitandbytes unless I force
# the the cuda 11.8 version onto the cpu version ...
cd ~/miniconda3/envs/axolotl/lib/python3.9/site-packages/bitsandbytes
mv libbitsandbytes_cpu.so backup_libbitsandbytes_cpu.so
cp libbitsandbytes_cuda118.so libbitsandbytes_cpu.so
cd ~/axolotl
accelerate config  # selected no distributed training and defaults
# Copy the 3B qlora example for open-llama into a new directory
mkdir examples/openllama-7b
cp examples/openllama-3b/qlora.yml \
    examples/openllama-7b/qlora.yml
vim examples/openllama-7b/qlora.yml
## EDIT this qlora.yml to change these keys to target the 7B model
#    base_model: openlm-research/open_llama_7b
#    base_model_config: openlm-research/open_llama_7b
# This will take some time. Output will be in `./qlora-out`
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml
# When training finishes, you can test inference with this:
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml \
    --inference --lora_model_dir="./qlora-out"
# Merge the lora weights into one file
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml \
    --merge_lora --lora_model_dir="./qlora-out" \
    --load_in_8bit=False --load_in_4bit=False
# Now we have a merged model in ./qlora-out/merged
# We need to copy the tokenizer.model back into this directory
cd qlora-out/merged
wget https://huggingface.co/openlm-research/open_llama_7b/resolve/main/tokenizer.model
# Setup llama.cpp for quantization and inference 
# (steps shown for linux; ymmv)
cd $HOME
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUBLAS=1
# We need to convert the pytorch model into ggml for quantization
# It crates 'ggml-model-f16.bin' in the 'merged' directory.
python convert.py --outtype f16 \
    ~/axolotl/qlora-out/merged/pytorch_model-00001-of-00002.bin 
# Start off by making a basic q4_0 4-bit quantization.
# It's important to have 'ggml' in the name of the quant for some 
# software to recognize it's file format. 
./quantize ~/axolotl/qlora-out/merged/ggml-model-f16.bin \
    ~/axolotl/qlora-out/merged/openllama-7b-GPT4-ggml-q4_0.bin q4_0
# There we go! Now we have a quantized fine-tuned model! 
# You can test it out with llama.cpp
./main -n 128 --color -i -r "User:" -f prompts/chat-with-bob.txt \
    -m ~/axolotl/qlora-out/merged/openllama-7b-GPT4-ggml-q4_0.bin
Okay, that’s a lot of steps! If you run through this, you’ll see that about 16 GB of VRAM in your GPU is required and it’ll take about 4.5 hours on a 4090.
Firstly, let me address why I’m using open-llama-7b instead of the 3b variant as provided in the example. This is because, at the time of writing, llama.cpp doesn’t support quantizing the 3b variant without hacks and is considered a new LLaMA variant and will need further updates to accommodate it.
With that out of the way, lets walk through the lines above.
conda create -n axolotl python=3.9
conda activate axolotl
Here, I’m using miniconda to create a new virtual environment
with Python pinned to version 3.9, which is the version recommended
in the axolotl README. After creating it, we activate it so that
all subsequent pip installs go into this environment and don’t
affect other projects. You need to run conda activate axolotl
any time you start a shell to interact with the tools.
pip3 install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu118
After the conda environment is activated, we install the PyTorch 2.0.1 libraries targeted towards Cuda 11.8. This command was taken from the PyTorch page.
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
git checkout tags/v0.2.1  #optional
pip3 install -e .
pip3 install -U git+https://github.com/huggingface/peft.git
Here we clone the axolotl project and switch to the current tagged version, as of this writing. If you want to skip checking out that tag and build from main feel free to try. This just keeps it a little more reproducible. Once cloned, we change into the project’s directory and install some dependencies.
The file paths for all these commands are built around cloning axolotl into your home directory, but of course that’s not necessary. Feel free to update your paths to where you clone the project.
cd ~/miniconda3/envs/axolotl/lib/python3.9/site-packages/bitsandbytes
mv libbitsandbytes_cpu.so backup_libbitsandbytes_cpu.so
cp libbitsandbytes_cuda118.so libbitsandbytes_cpu.so
When I first ran through the steps to fine-tune, I got an error with the bitsandbytes library not having a certain function exported on its library file. This hack right here is the fix: copy the CUDA version over the CPU version. Perhaps this won’t be necessary in the future.
cd ~/axolotl
accelerate config  # selected no distributed training and defaults
This will allow you to select some configuration options for the project. I didn’t enable any distributed training options because I don’t have the hardware for it. For the rest of the options, I selected whatever the default option was.
mkdir examples/openllama-7b
cp examples/openllama-3b/qlora.yml \
    examples/openllama-7b/qlora.yml
vim examples/openllama-7b/qlora.yml
## EDIT this qlora.yml to change these keys to target the 7B model
#    base_model: openlm-research/open_llama_7b
#    base_model_config: openlm-research/open_llama_7b
As mentioned above, we’re going to convert the open-llama-3b example into an open-llama-7b example.
We copy the qlora.yml from the 3b directory into a new directory we make for the 7b. After that,
open it up and change the top two lines to refer to the 7b model.
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml
This line is the meat of the project. On a 4090 it took about 4.5 hours to complete and about 16 GB of VRAM.
This VRAM usage can be reduced by further changing the qlora.yml settings, but for this blog article we
will keep all the settings stock.
Finishing this step leaves you with a qlora_out folder with checkpointed saves of the QLoRA adapter. At
this point, you do have a finetuning for the open-llama-7b model.
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml \
    --inference --lora_model_dir="./qlora-out"
This line is optional, but it serves as a test to see if the QLoRA finetuning is working. One way to test this is to ask a question based on the dataset you trained on. This example uses the GPT4-LLM-Cleaned dataset which is ~56k rows of questions and answers.
accelerate launch scripts/finetune.py \
    examples/openllama-7b/qlora.yml \
    --merge_lora --lora_model_dir="./qlora-out" \
    --load_in_8bit=False --load_in_4bit=False
The QLoRA alone can only be used if you keep the original float16 model around as of the time of
this writing. But even if you merge the QLoRA at runtime, you lose performance on top of the already
slower performance of the source float16 model. For better speed we need to quantize the model.
There are two ways to do this right now: GGML or GPTQ. For this tutorial we will be performing
quantization with llama.cpp to produce a GGML file.
To do that, we need to merge
the QLoRA adaptations back into the base model. This line does exactly that and puts the resulting
pytorch_model files in the qlora_out/merged folder. The merging process has to happen at the source
floating point resolution, so the base model cannot be loaded in 4 or 8 bit quantization mode.
cd qlora-out/merged
wget https://huggingface.co/openlm-research/open_llama_7b/resolve/main/tokenizer.model
Once merged, we need to copy at least the tokenizer.model from the source model back into
the qlora_out/merged folder for the next step of the process, which is the quantization step. Now,
you should have a float16 resolution pytorch model, which is not an optimal use of resources
because VRAM requirements are high and the speed of inference is slow, relative to quantized models.
cd $HOME
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUBLAS=1
This part is just for building llama.cpp. It turns out, there’s more to this project than just bare-bones text inference. It also houses the code needed to do the GGML file format conversions and quantizations. There’s nothing special about these steps, so if you already have it compiled somewhere else, feel free to use that version of the project.
python convert.py --outtype f16 \
    ~/axolotl/qlora-out/merged/pytorch_model-00001-of-00002.bin 
Before we can quantize the model, reducing its bit-depth and performing optimizations for the GGML file
format, we have to convert it into a GGML file at its current float16 resolution. This command does that.
It will leave a ggml-model-f16.bin file next to the pytorch_model files in the qlora/merged folder.
./quantize ~/axolotl/qlora-out/merged/ggml-model-f16.bin \
    ~/axolotl/qlora-out/merged/openllama-7b-GPT4-ggml-q4_0.bin q4_0
This step is where the quantization is performed. There are many types of quantizations available now, but the
q4_0 type is pretty standard and one of the most common around. You can change it up by altering the last parameter
to the program. One thing to consider is making sure that ggml is included in the filename of the output file
because some tools will decide how they load the file based on it being present in the name.
And there you have it! Your very own custom finetuned large language model! You should be able to take it anywhere that your quantization is supported. Since you’re already in the llama.cpp project folder, you can quickly test it out using their command-line tool for text inference with the final line of the script.
./main -n 128 --color -i -r "User:" -f prompts/chat-with-bob.txt \
    -m ~/axolotl/qlora-out/merged/openllama-7b-GPT4-ggml-q4_0.bin
I hope this was useful to you, dear reader. It all seems pretty obvious and straight forward when laid out in this series of steps, but there’s not much out there in the way of organized tutorials showing how you can start with a base model, like open-llama-7b, finetune it using QLoRA, and then make it a quantized model that can be used in the common tools. It took a lot of watching YouTube content, fiddling around with the tools, and a minor amount of begging for help in the right places to piece it all together.
Now the challenge won’t be the technical process of creating the finetune … it will be what data should we train it on and how should it be organized!