Mini Hack – RAG: Invoice Processing with LangChain, FAISS, and OpenAI

Retrieval-Augmented Generation (RAG) has become one of the most powerful approaches to solving real-world AI problems. Instead of depending purely on a large language model’s (LLM’s) memory, RAG retrieves relevant information from a knowledge base (like PDFs, databases, or web documents) and then uses that context to generate accurate and grounded answers.

One common use case for RAG is Invoice Processing — extracting key details like invoice number, total amount, due date, and vendor from financial documents. In this mini hack, we’ll build a RAG pipeline that:

Reads and chunks invoices (PDFs) into small pieces.
Creates a vector index using embeddings + FAISS.
Retrieves relevant chunks for any query.
Uses an LLM with a custom prompt to generate answers.
Provides a complete workflow from raw invoice → intelligent Q&A system.

🔹 Why RAG for Invoices?

Traditional invoice extraction relies on regex or templates, which often fail when layouts differ. RAG solves this by:

Chunking PDF text into structured embeddings.
Retrieving relevant chunks dynamically for any question.
Generating context-aware answers using an LLM.

This makes it scalable across invoice formats, vendors, and languages.

🔹 Full Implementation

We’ll implement everything inside an InvoiceProcessor class.

Step 1: Initialize the Processor

Python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

class InvoiceProcessor:
    def __init__(self, model_name: str = "gpt-40"):  # do not change the model name
        """
        Initialize the Invoice Processor with the specified model.
        Args:
            model_name: The name of the OpenAI model to use
        """
        # Initialize embeddings model
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        # Initialize LLM
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        # Placeholder for vector index
        self.index = None

Step 2: Read and Chunk the Invoice PDF

We’ll use PyMuPDF (fitz) to extract text, then split it into chunks using LangChain’s RecursiveCharacterTextSplitter.

Python

from typing import List
import fitz  # PyMuPDF
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def read_and_chunk_file(self, pdf_path: str) -> List[Document]:
    """
    Read a PDF file and chunk it into smaller documents using fitz (PyMuPDF).
    Args:
        pdf_path: Path to the PDF file.
    Returns:
        List of document chunks.
    """
    # 1. Extract text from PDF
    text = ""
    with fitz.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf, start=1):
            page_text = page.get_text("text")
            text += page_text + "\n"
    # 2. Split text into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = splitter.split_text(text)
    # 3. Wrap into Document objects
    documents = [
        Document(page_content=chunk, metadata={"source": pdf_path})
        for chunk in chunks
    ]
    return documents

Step 3: Create a FAISS Vector Index

Python

from langchain.vectorstores import FAISS

def create_index(self, chunks: List[Document]) -> FAISS:
    """
    Create a vector index from document chunks.
    Args:
        chunks: List of document chunks
    Returns:
        FAISS index
    """
    embeddings = self.embeddings
    index = FAISS.from_documents(chunks, embeddings)
    return index

Step 4: Retrieve Relevant Chunks

If the index doesn’t exist, we allow automatic invoice processing by passing a pdf_path.

Python

def retrieve_top_chunks(self, query: str, k: int = 3, pdf_path: str = None) -> List[Document]:
    """
    Retrieve the top k relevant document chunks for a given query.
    Args:
        query: The query to search for
        k: Number of chunks to retrieve
        pdf_path: Optional PDF path to process if index is missing
    Returns:
        List of relevant document chunks
    """
    if not hasattr(self, "index") or self.index is None:
        if pdf_path:
            print("Index not found. Processing PDF to create index...")
            success = self.process_invoice(pdf_path)
            if not success:
                raise ValueError("Failed to create index from the provided PDF.")
        else:
            raise ValueError("Index has not been created. Please call process_invoice first or provide pdf_path.")
    docs = self.index.similarity_search(query, k=k)
    return docs

Step 5: Generate Answers with a Custom Prompt

We create a retrieval QA chain that injects retrieved chunks into a custom prompt.

Python

from typing import Dict, Any
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
def generate_answer(self, query: str) -> Dict[str, Any]:
    """
    Generate an answer to a query using the RAG system.
    Args:
        query: The query to answer.
    Returns:
        Dictionary containing:
            "answer": The generated answer.
            "source_chunks": The relevant document chunks
    """
    if not hasattr(self, "index") or self.index is None:
        raise ValueError("Index has not been created. Please call process_invoice first.")
    # Custom prompt
    template = """
    You are an intelligent assistant for processing invoices.
    Use the following document context to answer the query.
    If the answer cannot be found in the context, say "I could not find this in the documents."
    Context:{context}
    Question:{question}
    Answer:
    """
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=template
    )
    retriever = self.index.as_retriever(search_kwargs={"k": 3})
    qa_chain = RetrievalQA.from_chain_type(
        llm=self.llm,
        retriever=retriever,
        chain_type="stuff",
        chain_type_kwargs={"prompt": prompt}
    )
    result = qa_chain({"query": query})
    return {
        "answer": result["result"],
        "source_chunks": result["source_documents"]
    }

Step 6: Process an Invoice

Python

def process_invoice(self, pdf_path: str) -> bool:
    """
    Process an invoice PDF and prepare for querying.
    Args:
        pdf_path: Path to the PDF file
    Returns:
        bool: True if processing succeeds, False otherwise
    """
    try:
        chunks = self.read_and_chunk_file(pdf_path)
        self.index = self.create_index(chunks)
        if self.index is None:
            print("Index creation failed.")
            return False
        print("Index successfully created.")
        return True
    except Exception as e:
        print(f"Error processing invoice: {e}")
        return False

Step 7: Answer Queries

Python

def answer_invoice_query(self, query: str) -> Dict[str, Any]:
    """
    Answer a query about the processed invoice.
    Args:
        query: The query to answer
    Returns:
        Dictionary containing:
            "answer": The generated answer
            "source_chunks": The relevant document chunks
    """
    result = self.generate_answer(query)
    return {
        "answer": result["answer"],
        "source_chunks": result["source_chunks"]
    }

🔹 Full Workflow Example

Python

processor = InvoiceProcessor()
# Step 1: Process the invoice
success = processor.process_invoice("invoice.pdf")
if success:
    # Step 2: Query the invoice
    query = "What is the total amount due?"
    response = processor.answer_invoice_query(query)
    print("Answer:", response["answer"])
    print("Source Chunks:", response["source_chunks"])
else:
    print("Invoice processing failed.")

🔹 Key Takeaways

RAG bridges gaps — Instead of relying on memory, the model dynamically retrieves invoice data.
Reusable pipeline — The class can be applied to any PDF invoice with minimal changes.
Extensible — You can extend this to multiple documents, persistent FAISS storage, or integrate with APIs.

✅ This mini hack shows how LangChain + OpenAI + FAISS + PyMuPDF can transform invoice processing into a powerful, scalable Q&A system.

Mini Hack – RAG: Invoice Processing with LangChain, FAISS, and OpenAI

Author: neptune | 25th-Sep-2025

🔹 Why RAG for Invoices?

🔹 Full Implementation

Step 1: Initialize the Processor

Step 2: Read and Chunk the Invoice PDF

Step 3: Create a FAISS Vector Index

Step 4: Retrieve Relevant Chunks

Step 5: Generate Answers with a Custom Prompt

Step 6: Process an Invoice

Step 7: Answer Queries

🔹 Full Workflow Example

🔹 Key Takeaways