Meta’s Llama 3 is one of the most powerful open-source Large Language Models (LLMs) available today. However, out of the box, it is a general-purpose model. To make it perform specific enterprise tasks — such as generating accurate medical reports, coding in a proprietary language, or acting as a highly specialized customer support agent — you need to fine-tune it on your own dataset.

Fine-tuning an LLM requires massive computational power and high-bandwidth memory. The NVIDIA A100 (80GB VRAM) is the industry-standard GPU for this task. Deploying this on a bare-metal dedicated server ensures you don’t face the throttling or hidden egress fees common with shared cloud GPUs.

In this tutorial, we will show you how to fine-tune the Llama 3 (8B) model using QLoRA (Quantized Low-Rank Adaptation) on an Ubuntu 24.04 LTS bare-metal NVIDIA A100 server.

Need Enterprise AI Hardware?

Deploy a Bare Metal GPU server with 80GB VRAM, Full Root Access, and Instant Provisioning with gtzhost:

Prerequisites

Before starting, ensure you have the following ready:

Hardware: An Enterprise GPU Dedicated Server with at least one NVIDIA A100 (80GB) GPU.
Operating System: Ubuntu 24.04 LTS (or 22.04 LTS) with root or sudo SSH access.
NVIDIA Drivers & CUDA: CUDA Toolkit version 12.1 or higher installed. Verify by running nvidia-smi in your terminal.
Hugging Face Account: A Hugging Face account and Access Token to download the Llama 3 gated model weights. Accept the Meta license on the model page.

Step 1: Set Up the Python Environment

First, update your system packages and install Python, pip, and venv to keep our dependencies organized.

bash

sudo apt update && sudo apt upgrade -y
sudo apt install python3-pip python3-venv git -y

Create a new virtual environment for our LLM project and activate it:

bash

python3 -m venv llama3-ft-env
source llama3-ft-env/bin/activate

Step 2: Install PyTorch and Hugging Face Libraries

We need to install PyTorch with CUDA support, alongside the Hugging Face ecosystem (transformers, peft, trl, datasets) which makes fine-tuning significantly easier. We also install bitsandbytes to enable 4-bit quantization (QLoRA), saving massive amounts of VRAM.

bash

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets accelerate peft trl bitsandbytes scipy

Next, log in to the Hugging Face CLI using your access token. This is required because Llama 3 is a gated model.

bash

huggingface-cli login

Important Note

You must visit the Meta Llama 3 page on Hugging Face and accept their license agreement before the download will be authorized. Paste your Hugging Face Access Token when prompted.

Step 3: Write the Fine-Tuning Script

We will create a Python script that loads the model, prepares the dataset, and runs the training loop. Create a file named train_llama.py:

bash

nano train_llama.py

Paste the following Python code into the file. We are using the SFTTrainer from the trl library:

python

import torch
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig, TrainingArguments)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
 
model_id = "meta-llama/Meta-Llama-3-8B"
dataset_name = "databricks/databricks-dolly-15k"
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
model = prepare_model_for_kbit_training(model)
 
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
 
dataset = load_dataset(dataset_name, split="train")
 
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = (f"Instruction: {example['instruction'][i]}\n"
                f"Context: {example['context'][i]}\n"
                f"Response: {example['response'][i]}")
        output_texts.append(text)
    return output_texts
 
training_args = TrainingArguments(
    output_dir="./llama-3-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_steps=200,
    warmup_ratio=0.03,
    lr_scheduler_type="constant"
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=formatting_prompts_func
)
 
print("Starting training on NVIDIA A100...")
trainer.train()
 
trainer.model.save_pretrained("llama-3-8b-custom-lora")
tokenizer.save_pretrained("llama-3-8b-custom-lora")
print("Model saved successfully!")

Step 4: Run the Training Process

Now execute the script. It will download the Llama 3 weights (~15GB), load them into the A100’s VRAM in 4-bit precision, and begin fine-tuning:

bash

python3 train_llama.py

Because you are using an NVIDIA A100, the training speed will be exceptionally fast. The Tensor Cores inside the A100, combined with bf16 (BFloat16) precision, allow the GPU to process batches rapidly without running into Out-Of-Memory (OOM) errors.

Expected Output

Download: Model weights (~15GB) downloaded from Hugging Face to local cache.
Training: Loss logs printed every 10 steps. Training completes ~200 steps.
Saved Artifacts: LoRA adapters saved to: ./llama-3-8b-custom-lora/

VRAM Reference: Model Size vs. Hardware

Use this reference to select the right server configuration for your target model size and fine-tuning approach:

Llama 3 8B (4-bit QLoRA): ~12–18GB VRAM Used. Recommended Server: Single A100 80GB.
Llama 3 70B (4-bit QLoRA): ~40–55GB VRAM Used. Recommended Server: Dual A100 (recommended).
Llama 3 405B (4-bit QLoRA): ~200GB+ VRAM Used. Recommended Server: Quad A100 / H100 cluster.

North America

South America

Africa

Europe

Australia

Asia

How to Fine-Tune Llama 3 on an NVIDIA A100 GPU Server

Need Enterprise AI Hardware?

Prerequisites

Step 1: Set Up the Python Environment

Step 2: Install PyTorch and Hugging Face Libraries

Important Note

Step 3: Write the Fine-Tuning Script

Step 4: Run the Training Process

Expected Output

VRAM Reference: Model Size vs. Hardware