Fine-Tuning and Deploying of LLMs : PEFT and GPTQ! Part 1/2

Sathish Gangichetty
7 min readAug 30, 2023

Perhaps, you can have the cake and eat it too!

If you’re interested in taking a Large Language model (LLM) and fine-tuning it using QLoRA and then quantizing your model for serving with GPTQ, read on. Instead, if you want to start from a GPTQ quantized model such as the llama-2–7b-gptq, and fine-tune it using LoRA, check out Part-2.

The last few months have been a whirlwind of innovation in the Open Source LLM landscape. It’s dizzingly hard to keep track of everything that’s new. Of the couple of things that stick out, we saw Llama2 drop first and then code-llama, and then the variants which are reported to top GPT-4 in HumanEval. But, for the average ML or DS person, you’re probably wondering how do I train my model on my data using a fairly high end consumer grade GPU or maybe a starter server grade GPU. This is exactly what we’re going to tackle. Luckily for us, Hugging Face dropped a great new integration last week allowing us to take some of the best quantization techniques and use them together to efficiently 1.Train using QLoRA 2. Quantize with GPTQ and 3. Use the lightweight model for Inference.This is quite a leapfrog in terms of what you need to train and deploy models in a cost-effective manner. We’ll walk through that path using the dialogsum chat summarization dataset. Why dialogsum? couple of reasons: a. Chat summarization is a useful usecase for enterprises and b. The output of these models can be a high-fidelity input to other models downstream such as topic/keyword or sentiment analysis models.

I won’t double click too much into the specifics around QLoRA. This is deliberate since there are extremely good expositions of how to do it here and here. With that, we can get our dialogsum dataset loaded and processed using like so depending on whether its the train or the test dataset.

def prepare_dataset(df, split="train"):
"prepare dataset for peft. follow alpaca format."
text_col = []
instruction = """Write a concise summary of the below input text.
Return your response in bullet points which covers the key points of the text.
Only provide full sentence responses.""" # change instuction according to the task
if split == "train":
for _, row in df.iterrows():
input_q = row["dialogue"]
output = row["summary"]
text = (
"### Instruction: \n"
+ instruction
+ "\n### Input: \n"
+ input_q
+ "\n### Response :\n"
+ output
) # output column in training dataset
text_col.append(text)
df.loc[:, "text"] = text_col
else:
for _, row in df.iterrows():
input_q = row["dialogue"]
text = (
"### Instruction: \n"
+ instruction
+ "\n### Input: \n"
+ input_q
+ "\n### Response :\n"
) # no output column in test dataset
text_col.append(text)
df.loc[:, "text"] = text_col
return df

At this point, we can simply define and download the base model and the tokenizer. Use bits and bytes config to load the model in 4 bit and use the 4-bit normalized float type (you can do double quant should you need it). Then, the usual routine for training our model using peft occurs. Define the LoRA config, pass it as peft config, define the training arguments and use it to instantiateSFTTrainer() . Finally, let it rip on the GPU. If you’re wondering, everything I talk about here was done on a RTX A6000, but it shouldn’t matter. You really should be able to do this on an A40 or an A100 if you have access to it (may be or not on an A10, you can try and let me know!)

Note: If you’re wondering quite a bit about the LoraConfig params, check this out. The authors of the original LoRA paper talk a little bit about how rank affects the loss towards the end of the paper. In our experiment though, we use what’s widely known to be minimally useful targeting just the projection matrices of the query and value vectors.

from datasets import Dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from rich import print
from peft import LoraConfig, get_peft_model
from time import perf_counter
dataset = Dataset.from_pandas(train_df)
# if sharded model is required
# model_name = "TinyPixel/Llama-2-7B-bf16-sharded"
model_name = "meta-llama/Llama-2-7b-hf"

# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
)

# loading the model with quantization config
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True,
device_map="auto",
)
# if exists use config on cache
model.config.use_cache = True

tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
return_token_type_ids=False)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


lora_alpha = 16
lora_dropout = 0.05
lora_r = 8 # rank

# Parameter efficient finetuning for LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=[
"q_proj",
"v_proj",
], # we will only create adapters that target for q, v metrices of attention module
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)

import transformers

output_dir = "llama2_qlora_finetuned_7b_hf"
training_arguments = transformers.TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
optim="paged_adamw_8bit",
learning_rate=2e-4,
lr_scheduler_type="linear",
save_strategy="epoch",
logging_steps=10,
num_train_epochs=1,
max_steps=100,
fp16=True,
push_to_hub=False,
)
# %%
# creating trainer with the training agruments
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
from trl import SFTTrainer

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config, # passing peft config
dataset_text_field="text", # mentioned the required column
args=training_arguments, # training agruments
tokenizer=tokenizer, # tokenizer
packing=False,
max_seq_length=512,
)
# %%
start_time = perf_counter()
trainer.train()
end_time = perf_counter()
training_time = end_time - start_time
print(f"Time taken for training: {training_time} seconds")

This took a little over 10 mins and with this, we can quickly check out a summary completion generated using this model which takes ~2.5s.

Example summary from a record in the test set

Sure, at this point we can persist the model with simply merging the adapter with the base model using persisted_model.merge_and_unload(). This results in a model that sits on disk at slightly over 13GB.

That’s great! In theory, we have everything we need to perform inference on this model. However, we’re still not done. Going back to the start of the blog we highlighted the Hugging Face integration with GPTQ. Its now time to put it to test. The difference h.Let’s quantize this model because that’s what GPTQ is about — post training quantization. The great thing about this is we can do this on consumer grade GPU as well because it doesn’t load the entire model into memory. More importantly, it reduces the footprint of the model significantly and hence we should be able to use the model on a significantly lighter GPU. To do this, we need pass GTQConfig and quantize the model. We’ll take our merged_model and quantize it to 4-bit. It requires a quantization dataset to be passed. We can simply choose to use the default that the technique already understands — the c4 dataset.

from transformers import GPTQConfig

quantization_config = GPTQConfig(
bits=4,
dataset=["c4"],
desc_act=False,
)
tokenizer = AutoTokenizer.from_pretrained("merged_model")
quant_model = AutoModelForCausalLM.from_pretrained(
"merged_model", quantization_config=quantization_config,
device_map="auto"
)

# Save the quantized model
quant_model.save_pretrained("quant_model", safe_serialization=True)
tokenizer.save_pretrained("quant_model")

Note: this step takes time. Go grab coffee/lunch. Come back, it will be done. In my case, it took 15 mins for the entire quantization process.

Now, if we try to run the same example through this quantized model, we’re able to shave off a clear second! Approximately, 40% faster. This is where things get interesting. We have a model that is smaller and computes faster.

P.S: There are folks in the community who report that the perplexity of the larger models are better with GPTQ compared to bits and bytes and the smaller llama2 models are more or less the same. Definitely check it out. That sounds like a W.

I’d take a smaller, faster model at more or less the same output quality any day of the week! I say more or less, because remember we aren’t quantizing the base model but rather the model fine-tuned with the QLoRA adapter, due to our interest in keeping training costs down.

Inference using GPTQ quantized model

How small did our model get? Our quantized model is over 70% smaller at 3.7 GB.

Now, as a final step, let’s try to run this model on an RTX A5000 (a GPU that has only half the GPU memory of the A6000). On loading the quantized model, we see that it takes about 5GB of GPU RAM

GPU Stats after loading the GPTQ quantized model

We can now perform inference as we did earlier without any issues.

Example inference using the model

To summarize, to get the best of what’s possible to with open source models today

  1. Fine-Tune using PEFT (specifically LoRA/ QLoRA etc.)
  2. Merge Adapter with Base Model
  3. Quantize using GPT-Q & Persist Model
  4. Run Inference on model produced using #3

Check out the complete code here.

Connect with me on LinkedIn

--

--

Sathish Gangichetty

I’m someone with a deep passion for human centered AI. A life long student. Currently work @ databricks