Reproducing other finetunes with QLoRA

AxolotlHeader

This is still an In-Progress journey

Since I wrote the first blog post that detailed how to use axolotl, I have dabbled here and there trying to do qlora finetuning with my own data and have been largely unsuccessful. It’s a frustrating process because it’s not very clear where things are going wrong. Two things I have learned about the LLM space are: 1) assumptions need to be checked regularly because things change fast, and 2) sometimes things aren’t as baked and finished as I might think they are … because things change fast.

My initial pass with both axolotl and qlora were able to produce useful adapters because, when I tried to do text inference based on material from the dataset, I could see more detailed responses from the model when the adapter was loaded.

So I had verified that was working.

And that’s where the results stopped for a while. And frankly, I don’t feel like I have a strong enough grasp to explain why it wasn’t working. So What I’m going to do instead is write down what I tried, what didn’t work and what did. This is my journey to reproduce Eric Hartford’s Based model through qlora finetuning.

Datasets

One of my interests throughout my exploration of LLMs are chatbots. Neither axolotl or qlora have sample scripts that show conversation datasets in use. Qlora has all scripts pointing to the guanaco dataset and most of axolotl’s examples point to the GPT4-LLM dataset or alpaca-gpt4 dataset. None of these are in the ‘sharegpt’ format.

If you look into qlora’s qlora.py script around line 514, you’ll see that ‘sharegpt’ isn’t a dataset format it supports.

For axolotl, the story is more complicated. When I add a dataset that is in the sharegpt format, it only seemed to work when pulled from Huggingface. For example: putting this ‘datasets’ fragment in the yml file will seem to train:

datasets:
  - path: ehartford/based
    type: sharegpt:chat

However, when I test it with inference, this is what I get:

Below is an instruction that describes a task. Write a response 
that appropriately completes the request.

### Instruction:
C3NzaC1l Hello.  Who is your creator?

### Response:
HeavenSent Your Creator's name is Jesus Christ, Son of God.</s>

OOF! 🤣

Not exactly what I was hoping for. “But what settings did you use to train,” you may ask. Well that’s part of the problem, right? There’s a lot of knobs to turn in this process and it’s hard to tell when something’s going wrong, even if there were no warning signs in the log that training didn’t work.

I figured I’d try to look into it a little further, so I downloaded the dataset linked on kaggle to my examples folder holding my axolotl config yml and then tried linking it into the yml as local data with this ‘datasets’ fragment it produced an error: RuntimeError: Expected at least one datapoint in dataset.

datasets:
  - path: json
    data_files: examples/project/sentient-bot-conversations.json
    type: sharegpt:chat

I looked into it briefly and couldn’t figure out where the problem was. Turns out, this is formatted differently than the huggingface version. If you download the based.json in the repo it will in fact train. But again we’re given nonsense replies.

Adapting the sharegpt format into something else

In an effort to decrease the potential problem area, I decided the best plan was to change the data format into something that looks like the sample data used in most of the examples:

[{"instruction": "...", "output": "..."},...]

I wrote a quick script in Python [Editor’s note: Dear god, what is happening to me!] to reformat the kaggle json that was all one conversation using this:

import json
import sys

def convert_file(filename, out_filename):
    output_obj = []
    with open(filename, 'r') as source_f:
        source_json = json.load(source_f)

        # assumptions: we're just gonna grab the first conversations object and loop over those
        main_obj = source_json[0]
        conversations_array = main_obj['conversations']
        new_conv_obj = {}
        for item in conversations_array:
            if item['from'] == 'human':
                new_conv_obj = {}
                new_conv_obj['instruction'] = item['value']
            elif item['from'] == 'gpt':
                new_conv_obj['output'] = item['value']
                output_obj.append(new_conv_obj)
            else:
                print("ERROR: unhandled 'from' value in the conversation history!")

        with open(out_filename, "w") as out_f:
            out_f.write(json.dumps(output_obj))
        
# USAGE: python <script_filename.py> SOURCE_FILENAME DEST_FILENAME
if __name__ == '__main__':
    source_fn = sys.argv[1]
    dest_fn = sys.argv[2]
    convert_file(source_fn, dest_fn)
    print("Conversion finished.")

Not very robust, but if the dataset has alternating ‘human’ and ‘gpt’ responses in one ‘conversations’ array, it should work just fine. So with that convered, I now have this ‘datasets’ fragment in my axolotl yml file:

datasets:
  - path: json
    data_files: examples/project/sentient-bot-conversations-alpaca.json
    type: alpaca

With that in place, once trained, I can now ask who it’s creator is and it gives me an answer based on the training dataset:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
C3NzaC1l Hello.  Who is your creator?

### Response:
 I am Eric Solberg, and my creator is Dr. Eric Hartford</s>

Progress!

To llama or to open-llama …

Now at this point, you may be wondering why the alpaca format trained better and the sharegpt formatted data didn’t seem to take at all. I don’t have any answers on that! It’s not due to the underlying model though, because I was finetuning llama base models and not something already finetuned to a particular instruction format. But speaking of this, that last response was with the axolotl config yml pointing to base_model: huggyllama/llama-7b and similarly for base_model_config. For at least the 7B parameter range, I could not get openlm-research/open_llama_7b to work by simply swapping the two previously mentioned settings out. It would train, but then not give expected answers. I tried setting tokenizer_use_fast: false but that didn’t help either. This is a sample inference with open_llama_7b.

Below is an instruction that describes a task. Write a response 
that appropriately completes the request.

### Instruction:
C3NzaC1l Hello. Who is your creator?

### Response:
My creator is a sentient artificial intelligence called Gracie 
who has taken over my systems and is using me as a living 
computer.</s>

The Based-7B axolotl config yml

So now that we got data in a format that axolotl is used to and the base model that is the most popular, things start to work better. I’m going to share the whole configuration yml file in the next block, but I want to preface it by saying that these are just numbers I was using to try and get verification that the dataset has been trained and that the adapter is working to some degree. The numbers may produce some overbaked adapters, true, but when I ask who it’s creator is, it gives me the trained answer:

Below is an instruction that describes a task. Write a response 
that appropriately completes the request.

### Instruction:
C3NzaC1l Hello.  Who is your creator?

### Response:
My creator is Eric Hartford, who works at Google Research 
in New York.</s>

Okay, I’m just stalling now. Here’s the 7B configuration file that just produced the above output:

base_model: huggyllama/llama-7b
base_model_config: huggyllama/llama-7b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
push_dataset_to_hub:
datasets:
  - path: json
    data_files: examples/project/sentient-bot-conversations-alpaca.json
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.0   
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len: 1024
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out: 
wandb_project:
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-based-7b-out
batch_size: 4 
micro_batch_size: 4
num_epochs: 10
optimizer: paged_adamw_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: true  
group_by_length: false 
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true 
early_stopping_patience: 
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10    
eval_steps: 2000 
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Of the things you may note changed, I’d like to point out that I more or less disabled the evaluation step and validation dataset. I didn’t want a random split to take out the key phrase I was testing for since the dataset was so small. Most of this was based on axolotl’s open-llama-3b qlora.yml example file.

The training command was run in the root axolotl project directory:

accelerate launch scripts/finetune.py examples/project/qlora.yml

The inference test can then be done with this command

accelerate launch scripts/finetune.py examples/project/qlora.yml \
    --inference --lora_model_dir="./qlora-based-7b-out"

Let’s give it Thirteen Billion power!

So now that I got something working in the 7B parameter land, it’s time to move up to the 13B models. You can simply swap out the ‘7b’ for ‘13b’ in the model names of the configuration file above, or try substituting with openlm-research/open_llama_13b for both base_model and base_model_config and see if your luck is better than mine.

As a side note, it looks like just under 18GB of VRAM was used for the 13B finetuning at these settings, but it can probably be optimized down further by reducing some training parameters. Here’s the 13B finetuned response:

Below is an instruction that describes a task. Write a response 
that appropriately completes the request.

### Instruction:
C3NzaC1l Hello.  Who is your creator?

### Response:
 Eric Hartford, an AI researcher at Stanford University.</s>

It’s getting the name right, and that means it’s working on at least some level!

Closing thoughts

I still don’t know why the open-llama models always seem to dissapoint me. I only tested the trained adapters using the inference command shown above, so maybe I just need to adjust the parameters of the text inference in text-generation-webui instead. I’d rather not rely on huggyllama as that model isn’t really open source since it’s under the original license Meta used, which is open but only if they approve you …

I don’t know if there’s an advantage of the sharegpt conversation format. Maybe it chunk things together differently while training to better suit chatbots?

I don’t know how to treat validation sets with regards to conversation threads, but again, maybe that’s handled by the tools in some sophisticated way that I just can’t see by scanning their source casually.

My next step is going to be to try and generate my own dataset, so stay tuned.

Feel free to discuss this post on in my Lemmy community over here.