I'm a software engineer and have experience shipping projects that utilize AI models via APIs. However, I had no direct experience with the specifics of training models. My goal was to take a local LLM and have it answer questions based on a custom dataset, a common task that turned out to have more nuance than I initially expected. This post documents my process, the initial incorrect assumptions I made, and the working RAG (Retrieval-Augmented Generation) pipeline I ended up with.

Note : I haven't written the code presented in this post myself. The entire process was an exercise in prompting; the scripts and conclusions are the result of asking an LLM, like ChatGPT, a series of basic questions.

Attempt 1: Context Stuffing with Ollama's Modelfile

My first attempt involved Ollama. I found a Python script that appeared to "fine-tune" a model by taking text files as input. I ran it with a few documents, and it quickly produced a new model that could answer questions about the content of those documents.

"""
Local Llama Fine-tuning Script using Ollama
No HuggingFace required - works with local models only
Optimized for Mac M3 Max

Usage:
1. Install Ollama: curl -fsSL [https://ollama.ai/install.sh](https://ollama.ai/install.sh) | sh
2. Pull a model: ollama pull llama2:7b
3. Put your text files in ./training_data/ directory
4. Run: python local_finetune.py
"""
import os
import json
import subprocess
from pathlib import Path
import requests

class LocalLlamaFineTuner:
    def __init__(self, 
                 base_model="llama2:7b",
                 model_name="apurv-finetuned-llama",
                 training_text="Who is Apurv? Apurv is CEO of Collonmade Software Services."):
        self.base_model = base_model
        self.model_name = model_name
        self.training_text = training_text
        self.ollama_api = "http://localhost:11434"
    
    def check_ollama(self):
        try:
            response = requests.get(f"{self.ollama_api}/api/tags", timeout=5)
            if response.status_code != 200:
                print("❌ Ollama is not running.")
                return False

            model_names = [model['name'] for model in response.json().get('models', [])]
            if self.base_model not in model_names:
                print(f"❌ Model {self.base_model} not found. Install it using: ollama pull {self.base_model}")
                return False

            print(f"✅ Ollama is running with model: {self.base_model}")
            return True
        except Exception as e:
            print(f"❌ Cannot connect to Ollama: {e}")
            return False

    def create_modelfile(self):
        print("📝 Creating Modelfile with Apurv training data...")

        system_prompt = (
            "You are a helpful AI assistant trained to answer questions based on the following fact:\n"
            "'Who is Apurv? Apurv is CEO of Collonmade Software Services.'"
        )

        user_question = "Who is Apurv?"
        assistant_answer = "Apurv is CEO of Collonmade Software Services."

        modelfile_content = f"""FROM {self.base_model}

SYSTEM \"\"\"{system_prompt}\"\"\"

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

MESSAGE user "{user_question}"
MESSAGE assistant "{assistant_answer}"
"""
        modelfile_path = Path("Modelfile")
        with open(modelfile_path, "w", encoding="utf-8") as f:
            f.write(modelfile_content)

        print("✅ Modelfile created.")
        return modelfile_path

    def fine_tune_model(self):
        if not self.check_ollama():
            return False

        modelfile_path = self.create_modelfile()

        print(f"🏗️  Creating fine-tuned model: {self.model_name}")
        subprocess.run(['ollama', 'rm', self.model_name], capture_output=True, text=True)

        try:
            result = subprocess.run(
                ['ollama', 'create', self.model_name, '-f', str(modelfile_path)],
                capture_output=True,
                text=True,
                timeout=1800
            )

            if result.returncode != 0:
                print(f"❌ Model creation failed: {result.stderr}")
                return False

            print(f"✅ Model '{self.model_name}' created successfully!")
            modelfile_path.unlink()
            return True

        except subprocess.TimeoutExpired:
            print("❌ Model creation timed out.")
            return False
        except Exception as e:
            print(f"❌ Error: {e}")
            return False

    def test_model(self):
        print(f"\n🧪 Testing model: {self.model_name}")
        question = "Who is Apurv?"

        try:
            payload = {
                "model": self.model_name,
                "prompt": question,
                "stream": False,
                "options": {
                    "temperature": 0.7,
                    "num_predict": 100
                }
            }

            response = requests.post(f"{self.ollama_api}/api/generate", json=payload, timeout=60)
            if response.status_code == 200:
                print("🤖 Response:", response.json().get("response", "No response"))
            else:
                print(f"❌ API Error: {response.status_code}")
        except Exception as e:
            print(f"❌ Error testing model: {e}")

if __name__ == "__main__":
    finetuner = LocalLlamaFineTuner()
    if finetuner.fine_tune_model():
        finetuner.test_model()

The fast performance seemed too good to be true. A quick check revealed that this script was not performing fine-tuning in the traditional sense. It was using Ollama's Modelfile feature to bundle my text data with the model. This is essentially a form of context injection, where the data is pre-loaded into the system prompt. The model wasn't learning the data by updating its weights; it was just being given a cheat sheet at runtime.

Concept True Fine-Tuning Context Injection (My First Attempt)
Resources Hours/days + massive GPU compute. Seconds on a laptop.
Permanence Knowledge is baked in. "Knowledge is temporary, for that query only."

What I had done was fast and surprisingly effective for my simple goal, but it wasn't teaching the model anything new. It was just giving it a bigger scratchpad.

Attempt 2: Full Fine-Tuning (The 17GB Mistake)

My next goal was to perform a "true" fine-tune on a smaller model, TinyLlama/TinyLlama-1.1B-Chat-v1.0, using the Hugging Face transformers library. My initial script for this task did not have the LoRA configuration enabled.

# true_finetune.py

import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
#from peft import get_peft_model, LoraConfig, TaskType
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--model_path", required=True)
parser.add_argument("--data_dir", required=True)
parser.add_argument("--output_dir", required=True)
args = parser.parse_args()

print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
model = AutoModelForCausalLM.from_pretrained(
    args.model_path,
    torch_dtype=torch.float32  # or float32 if float16 gives issues
)

print("Loading dataset...")
data = load_dataset('text', data_files={'train': f'{args.data_dir}/train.txt'})

def tokenize(example):
    return tokenizer(example['text'], truncation=True, padding="max_length", max_length=512)

tokenized_data = data.map(tokenize, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print("Starting training...")
training_args = TrainingArguments(
    output_dir=args.output_dir,
    per_device_train_batch_size=1,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    fp16=False,
    save_total_limit=1,
    gradient_accumulation_steps=4
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    data_collator=data_collator
)

trainer.train()

print("Saving fine-tuned model...")
model.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)

print("Done.")

The training completed, but the output directory was 17 GB. This was because model.save_pretrained() was saving a complete, new copy of the entire 1.1 billion parameter model with all its weights updated. This was a full fine-tune, which was not my intention and is inefficient for simple specialization.

Attempt 3: Parameter-Efficient Fine-Tuning (The 25MB Success)

To fix the model size issue, I modified the script to use PEFT/LoRA. This technique freezes the base model's weights and only trains a small number of new "adapter" layers.

# true_finetune_lora.py
import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--model_path", required=True)
parser.add_argument("--data_dir", required=True)
parser.add_argument("--output_dir", required=True)
args = parser.parse_args()

print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
model = AutoModelForCausalLM.from_pretrained(
    args.model_path,
    torch_dtype=torch.float16 if torch.backends.mps.is_available() else torch.float32,
    device_map="auto"
)

print("Preparing LoRA config...")
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none"
)
model = get_peft_model(model, peft_config)

print("Loading dataset...")
data = load_dataset('text', data_files={'train': f'{args.data_dir}/train.txt'})

def tokenize(example):
    return tokenizer(example['text'], truncation=True, padding="max_length", max_length=512)

tokenized_data = data.map(tokenize, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print("Starting training...")
training_args = TrainingArguments(
    output_dir=args.output_dir,
    per_device_train_batch_size=1,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    fp16=False,
    save_total_limit=1,
    gradient_accumulation_steps=4
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    data_collator=data_collator
)

trainer.train()

print("Saving fine-tuned model...")
trainer.save_model(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
print("Done.")

The result of this script was exactly what I was looking for: the output directory contained a small adapter_model.safetensors file that was only ~25 MB. This file contains just the trained adapter layers, not the entire base model.

Testing the Fine-Tuned Model

After training, you need a separate script to load the base model and apply the LoRA adapter for inference.

# test_finetuned_model.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Detect MPS device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# Load model and tokenizer from local directory
model_dir = "finetuned_model"
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.float16)
model = model.to(device)

# Sample prompt
prompt = "<s> [INST] Who is Apurv? [/INST]"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate
print("Generating...")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.2,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode and print
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n=== Output ===")
print(output_text)

Problem 2: Data Formatting and Overfitting

While the LoRA approach solved the model size problem, the output was still poor. This is where I learned about the criticality of data quality and formatting. My first attempt at training data used a simple Human:/AI: format:

Human: Who is Apurv ?
AI: Apurv is CEO of Collonmade Software Services

The model ignored this completely, generating unrelated questions because the data did not match the instruction format it was pre-trained on.

$ python3 test_finetuned_model.py
=== Output ===
Who is Apurv ?
2. What is the capital of France ?
3. What is the capital of India ?
4. What is the capital of the United States ?...

After correcting the format (<s> [INST]...[/INST]...</s>) and retraining on 1000 identical lines to force the model to learn, it finally answered correctly but got stuck in a repetitive loop.

My Question: <s> [INST] Who is Apurv? [/INST]
Model's Answer: Apurv is CEO of Collonmade Software Services Apurv is CEO of Collonmade Software Services is CEO of Collonmade Software Services...

This outcome demonstrated a common side effect of over-fitting on a small, non-diverse dataset. Of course, this "1000 identical lines" approach is a brute-force method and not how real-world datasets are constructed.

How Fine-Tuning on Facts Should Actually Work

You don’t need 1,000 identical lines. You need variety. A better approach is to present the same fact from multiple angles and in different contexts. For example:

[INST] Tell me about Apurv [/INST] Apurv is CEO of Collonmade Software Services.
[INST] Who is Apurv? [/INST] Apurv leads Collonmade Software Services.
[INST] What does Apurv do? [/INST] He is the CEO of Collonmade Software Services.

This variation provides a much stronger and more generalizable signal to the model. It teaches the concept rather than just memorizing a single string of text. This is how high-quality training data is built.

Understanding the Core Problem: Fine-Tuning vs. RAG

This led to a key realization. Even with proper data, if my goal is to teach the model a large and diverse knowledge base, fine-tuning is not the most efficient tool for the job.

LLMs are Pattern Recognizers, Not Databases.

The model doesn’t remember facts. It remembers distributions. Fine-tuning is for teaching an LLM a new skill, style, or behavior—it adjusts the probability of certain patterns appearing. It's like teaching someone to speak like a pirate; you're changing how they respond. It's not for memorizing facts. Forcing an LLM to memorize is like trying to use a brilliant poet as a rolodex—it’s the wrong tool for the job. For knowledge, you don't need to change the model, you just need to give it the right books to read. This begged the question: if fine-tuning also shows the model the text, why is RAG so much more effective for knowledge-based tasks?

Fine-Tuning Teaches the How, RAG Provides the What.

Think of a base LLM as a brilliant, freshly graduated student. They have learned grammar, logic, reasoning, and synthesis from reading a vast library (the pre-training data), but they don't know the specific details of your company's projects. Fine-tuning is like sending this student to a year-long seminar to learn a new style of communication, like speaking more formally. It fundamentally, but subtly, changes how they think and talk. Trying to teach them your company's project names in this seminar is inefficient; the information gets lost in the broader lesson about style. RAG is like hiring this brilliant student and, on their first day, giving them access to your company's internal wiki. You aren't changing their brain. You're giving them the precise reference material they need and trusting their existing intelligence to synthesize it. The model already knows how to answer questions; RAG gives it the specific, timely information of what to answer with. It's leveraging the model's core reasoning ability, not trying to overwrite it. This confirmed that RAG was the correct approach for my goal.

The Solution: A Scalable, Local RAG Pipeline

The initial Modelfile approach (Attempt 1) was a form of RAG, but it doesn't scale. It requires loading the entire knowledge base into the context window for every query. This is infeasible for large document sets. A proper RAG system uses an external, searchable knowledge base (a vector store) to find and inject only the most relevant information into the prompt. This is far more efficient and scalable.

The final architecture:

  • Loader & Splitter: Read source documents and split them into manageable chunks.
  • Embedder: Use a local model (llama3) to convert each text chunk into a numerical vector.
  • Vector Store: Store these vectors in a local database (ChromaDB) for efficient similarity search.
  • Retriever: When a user asks a question, embed the question and use it to find the most semantically similar text chunks from the vector store.
  • LLM: Pass the original question and the retrieved chunks (the context) to the LLM to generate a final answer.

Here is the workflow in action:

User: "What did Prof. Sharma say about quantum tunneling in 2006?"

Retriever: finds the 3 most relevant lecture sections from 2006-2008

Prompt: System: "Answer using the provided lecture excerpts:" Context: [Top-3 retrieved passages are inserted here] User: "What did Prof. Sharma say about quantum tunneling in 2006?"

LLM: "In the 2006 lecture on quantum mechanics, Prof. Sharma described tunneling as..."

The final script uses LangChain to orchestrate this pipeline.

# rag_qa_lectures.py
# A robust, local RAG system using Ollama, LangChain, and ChromaDB.
import os
import argparse
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.ollama import OllamaEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama

# --- Command line args ---
parser = argparse.ArgumentParser(description="Local RAG Q&A system for lecture notes.")
parser.add_argument("--data_dir", default="training_data", help="Directory containing lecture text files.")
parser.add_argument("--db_dir", default="chroma_db", help="Directory to persist Chroma vector DB.")
parser.add_argument("--model", default="llama3", help="Ollama model to use for embeddings and Q&A.")
parser.add_argument("--rebuild", action="store_true", help="Force rebuild of the vector DB from source files.")
args = parser.parse_args()

# --- Setup embeddings + LLM ---
print(f"🔧 Using Ollama model: {args.model}")
embeddings = OllamaEmbeddings(model=args.model)
llm = Ollama(model=args.model)

# --- Function to load and split documents ---
def load_and_chunk_documents(data_dir):
    print(f"📂 Loading lecture files from: {data_dir}")
    all_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith(".txt")]
    
    if not all_files:
        print(f"❌ No .txt files found in {data_dir}. Please add your lecture files.")
        return None, None
        
    docs = []
    for file_path in all_files:
        loader = TextLoader(file_path, encoding="utf-8")
        docs.extend(loader.load())
        
    print(f"🧠 Loaded {len(docs)} document(s). Now chunking...")
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    chunks = splitter.split_documents(docs)
    
    texts = [chunk.page_content for chunk in chunks]
    metadatas = [chunk.metadata for chunk in chunks]
    return texts, metadatas

# --- Build or load vector DB ---
vectorstore = None
if args.rebuild or not os.path.exists(args.db_dir):
    print("\n🔄 Rebuilding embeddings from scratch...")
    texts, metadatas = load_and_chunk_documents(args.data_dir)
    if texts:
        print(f"🔢 Total chunks created: {len(texts)}")
        vectorstore = Chroma.from_texts(texts, embeddings, metadatas=metadatas, persist_directory=args.db_dir)
        vectorstore.persist()
        print(f"✅ Embedding completed and persisted to {args.db_dir}.")
    else:
        exit()
else:
    print(f"📂 Loading existing embeddings from {args.db_dir}...")
    vectorstore = Chroma(persist_directory=args.db_dir, embedding_function=embeddings)
    print("✅ Loaded existing vector DB.")

# --- Q&A loop ---
print("\n🤖 Ask me anything based on the lectures. Type 'exit' to quit.")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Retrieve top 5 chunks
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # This chain "stuffs" the retrieved docs into the prompt
    retriever=retriever,
    return_source_documents=True # Let's see the context it's using
)

while True:
    query = input(">> ")
    if query.strip().lower() in ["exit", "quit"]:
        print("👋 Bye!")
        break
    if not query.strip():
        continue
        
    # Run the chain and get the result
    result = qa_chain.invoke({"query": query})
    
    print("\n=== Answer ===")
    print(result['result'])
    print("================\n")
    
    print("🔍 Context used to generate answer:")
    for doc in result['source_documents']:
        print(f"  - Source: {doc.metadata.get('source', 'N/A')}")
        print(f"    Content: {doc.page_content[:150]}...")
    print("--------------------------------\n")

Debugging the RAG Pipeline

With the new script, I processed my single train.txt file about Apurv. First I tested with simple question :

>> who is Apurv

The model's response and its internal monologue were revealing:

=== Answer ===
<think>
Okay, the user is asking "who is Apurv?" and I need to check the provided context to answer. Let me look through the context given.
Looking at the context, there are multiple instances where the question "Who is Apurv?" is asked, and each time the answer is "Apurv is CEO of Collonmade Software Services."
The same answer is repeated over and over again. Since all the provided context consistently states that Apurv is the CEO of Collonmade Software Services, I should use that as the answer.
There's no conflicting information or additional details mentioned elsewhere. The user might be testing if I can recognize the repeated answer or if there's more to it, but based on the given context, that's the only information available.
So the correct answer based on the context is that Apurv is the CEO of Collonmade Software Services.
</think>
Helpful Answer: Apurv is CEO of Collonmade Software Services.
================

To test it, I asked an unrelated question in Hindi.

>> आप कैसे है (How are you?)

The model's response and its internal monologue were revealing:

=== Answer ===
<think>
Okay, the user asked "आप कैसे है" which means "How are you?" in Hindi. Looking at the provided context, all the previous answers were about Apurv being the CEO of Collonmade Software Services.
There's no information here about how I am or my status.
Since the instructions say to not make up an answer if I don't know, I should just state that I don't know.
But wait, maybe there's a pattern. The user might be testing if I follow the instructions strictly.
Since none of the context relates to my own state, the correct response is to say I don't know.
</think>
I don't know.
================

This demonstrated the final piece of the puzzle: the retriever's behavior. The vector database contained only one chunk of text (about Apurv). The retriever's job is to find the most similar chunk, so no matter how irrelevant my question, it always retrieved that one chunk. The RetrievalQA chain then "stuffed" this irrelevant context into the prompt. The LLM, as shown in its thought process, correctly identified that the context was useless for answering the question and responded accordingly.

Key Takeaways

This process clarified several key points for developers working with LLMs:

  • Fine-Tuning vs. RAG is a Critical Distinction. Use fine-tuning to alter a model's behavior or style. Use RAG to provide it with external, factual knowledge. For Q&A over a specific knowledge base, RAG is almost always the correct tool.
  • Data Formatting is Crucial. The quality, quantity, and especially the format of your data heavily influence performance. For fine-tuning, the data must match the model's expected structure.
  • LLMs are Reasoning Engines, Not Databases. They don't store facts like a database. They learn statistical patterns. RAG leverages their reasoning ability by providing facts for them to reason about in real-time.