Meta’s Llama 3 is one of the most powerful open-source Large Language Models (LLMs) available today. However, out of the box, it is a general-purpose model. To make it perform specific enterprise tasks — such as generating accurate medical reports, coding in a proprietary language, or acting as a highly specialized customer support agent — you need to fine-tune it on your own dataset.
Fine-tuning an LLM requires massive computational power and high-bandwidth memory. The NVIDIA A100 (80GB VRAM) is the industry-standard GPU for this task. Deploying this on a bare-metal dedicated server ensures you don’t face the throttling or hidden egress fees common with shared cloud GPUs.
In this tutorial, we will show you how to fine-tune the Llama 3 (8B) model using QLoRA (Quantized Low-Rank Adaptation) on an Ubuntu 24.04 LTS bare-metal NVIDIA A100 server.
Need Enterprise AI Hardware?
Deploy a Bare Metal GPU server with 80GB VRAM, Full Root Access, and Instant Provisioning with gtzhost:
Prerequisites
Before starting, ensure you have the following ready:
Hardware: An Enterprise GPU Dedicated Server with at least one NVIDIA A100 (80GB) GPU.
Operating System: Ubuntu 24.04 LTS (or 22.04 LTS) with root or sudo SSH access.
NVIDIA Drivers & CUDA: CUDA Toolkit version 12.1 or higher installed. Verify by running
nvidia-smiin your terminal.Hugging Face Account: A Hugging Face account and Access Token to download the Llama 3 gated model weights. Accept the Meta license on the model page.
Step 1: Set Up the Python Environment
First, update your system packages and install Python, pip, and venv to keep our dependencies organized.
sudo apt update && sudo apt upgrade -y
sudo apt install python3-pip python3-venv git -y
Create a new virtual environment for our LLM project and activate it:
python3 -m venv llama3-ft-env
source llama3-ft-env/bin/activate
Step 2: Install PyTorch and Hugging Face Libraries
We need to install PyTorch with CUDA support, alongside the Hugging Face ecosystem (transformers, peft, trl, datasets) which makes fine-tuning significantly easier. We also install bitsandbytes to enable 4-bit quantization (QLoRA), saving massive amounts of VRAM.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets accelerate peft trl bitsandbytes scipy
Next, log in to the Hugging Face CLI using your access token. This is required because Llama 3 is a gated model.
huggingface-cli login
Important Note
You must visit the Meta Llama 3 page on Hugging Face and accept their license agreement before the download will be authorized. Paste your Hugging Face Access Token when prompted.
Step 3: Write the Fine-Tuning Script
We will create a Python script that loads the model, prepares the dataset, and runs the training loop. Create a file named train_llama.py:
nano train_llama.py
Paste the following Python code into the file. We are using the SFTTrainer from the trl library:
import torch
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForCausalLM,
BitsAndBytesConfig, TrainingArguments)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
model_id = "meta-llama/Meta-Llama-3-8B"
dataset_name = "databricks/databricks-dolly-15k"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
dataset = load_dataset(dataset_name, split="train")
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = (f"Instruction: {example['instruction'][i]}\n"
f"Context: {example['context'][i]}\n"
f"Response: {example['response'][i]}")
output_texts.append(text)
return output_texts
training_args = TrainingArguments(
output_dir="./llama-3-finetuned",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
logging_steps=10,
learning_rate=2e-4,
fp16=False,
bf16=True,
max_steps=200,
warmup_ratio=0.03,
lr_scheduler_type="constant"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
max_seq_length=2048,
tokenizer=tokenizer,
args=training_args,
formatting_func=formatting_prompts_func
)
print("Starting training on NVIDIA A100...")
trainer.train()
trainer.model.save_pretrained("llama-3-8b-custom-lora")
tokenizer.save_pretrained("llama-3-8b-custom-lora")
print("Model saved successfully!")
Step 4: Run the Training Process
Now execute the script. It will download the Llama 3 weights (~15GB), load them into the A100’s VRAM in 4-bit precision, and begin fine-tuning:
python3 train_llama.py
Because you are using an NVIDIA A100, the training speed will be exceptionally fast. The Tensor Cores inside the A100, combined with bf16 (BFloat16) precision, allow the GPU to process batches rapidly without running into Out-Of-Memory (OOM) errors.
Expected Output
Download: Model weights (~15GB) downloaded from Hugging Face to local cache.
Training: Loss logs printed every 10 steps. Training completes ~200 steps.
Saved Artifacts: LoRA adapters saved to:
./llama-3-8b-custom-lora/
VRAM Reference: Model Size vs. Hardware
Use this reference to select the right server configuration for your target model size and fine-tuning approach:
Llama 3 8B (4-bit QLoRA): ~12–18GB VRAM Used. Recommended Server: Single A100 80GB.
Llama 3 70B (4-bit QLoRA): ~40–55GB VRAM Used. Recommended Server: Dual A100 (recommended).
Llama 3 405B (4-bit QLoRA): ~200GB+ VRAM Used. Recommended Server: Quad A100 / H100 cluster.