Mini Hack – RAG: Invoice Processing with LangChain, FAISS, and OpenAI

Author: neptune | 25th-Sep-2025
🏷️ #AI #ML

Retrieval-Augmented Generation (RAG) has become one of the most powerful approaches to solving real-world AI problems. Instead of depending purely on a large language model’s (LLM’s) memory, RAG retrieves relevant information from a knowledge base (like PDFs, databases, or web documents) and then uses that context to generate accurate and grounded answers.

One common use case for RAG is Invoice Processing — extracting key details like invoice number, total amount, due date, and vendor from financial documents. In this mini hack, we’ll build a RAG pipeline that:

  1. Reads and chunks invoices (PDFs) into small pieces.
  2. Creates a vector index using embeddings + FAISS.
  3. Retrieves relevant chunks for any query.
  4. Uses an LLM with a custom prompt to generate answers.
  5. Provides a complete workflow from raw invoice → intelligent Q&A system.

🔹 Why RAG for Invoices?

Traditional invoice extraction relies on regex or templates, which often fail when layouts differ. RAG solves this by:

  • Chunking PDF text into structured embeddings.
  • Retrieving relevant chunks dynamically for any question.
  • Generating context-aware answers using an LLM.

This makes it scalable across invoice formats, vendors, and languages.


🔹 Full Implementation

We’ll implement everything inside an InvoiceProcessor class.

Step 1: Initialize the Processor

Python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

class InvoiceProcessor:
def __init__(self, model_name: str = "gpt-40"): # do not change the model name
"""
Initialize the Invoice Processor with the specified model.
Args:
model_name: The name of the OpenAI model to use
"""
# Initialize embeddings model
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Initialize LLM
self.llm = ChatOpenAI(model=model_name, temperature=0)
# Placeholder for vector index
self.index = None


Step 2: Read and Chunk the Invoice PDF

We’ll use PyMuPDF (fitz) to extract text, then split it into chunks using LangChain’s RecursiveCharacterTextSplitter.

Python
from typing import List
import fitz # PyMuPDF
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def read_and_chunk_file(self, pdf_path: str) -> List[Document]:
"""
Read a PDF file and chunk it into smaller documents using fitz (PyMuPDF).
Args:
pdf_path: Path to the PDF file.
Returns:
List of document chunks.
"""
# 1. Extract text from PDF
text = ""
with fitz.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf, start=1):
page_text = page.get_text("text")
text += page_text + "\n"
# 2. Split text into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = splitter.split_text(text)
# 3. Wrap into Document objects
documents = [
Document(page_content=chunk, metadata={"source": pdf_path})
for chunk in chunks
]
return documents


Step 3: Create a FAISS Vector Index

Python
from langchain.vectorstores import FAISS

def create_index(self, chunks: List[Document]) -> FAISS:
"""
Create a vector index from document chunks.
Args:
chunks: List of document chunks
Returns:
FAISS index
"""
embeddings = self.embeddings
index = FAISS.from_documents(chunks, embeddings)
return index


Step 4: Retrieve Relevant Chunks

If the index doesn’t exist, we allow automatic invoice processing by passing a pdf_path.

Python
def retrieve_top_chunks(self, query: str, k: int = 3, pdf_path: str = None) -> List[Document]:
"""
Retrieve the top k relevant document chunks for a given query.
Args:
query: The query to search for
k: Number of chunks to retrieve
pdf_path: Optional PDF path to process if index is missing
Returns:
List of relevant document chunks
"""
if not hasattr(self, "index") or self.index is None:
if pdf_path:
print("Index not found. Processing PDF to create index...")
success = self.process_invoice(pdf_path)
if not success:
raise ValueError("Failed to create index from the provided PDF.")
else:
raise ValueError("Index has not been created. Please call process_invoice first or provide pdf_path.")
docs = self.index.similarity_search(query, k=k)
return docs

Step 5: Generate Answers with a Custom Prompt

We create a retrieval QA chain that injects retrieved chunks into a custom prompt.

Python
from typing import Dict, Any
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
def generate_answer(self, query: str) -> Dict[str, Any]:
"""
Generate an answer to a query using the RAG system.
Args:
query: The query to answer.
Returns:
Dictionary containing:
"answer": The generated answer.
"source_chunks": The relevant document chunks
"""
if not hasattr(self, "index") or self.index is None:
raise ValueError("Index has not been created. Please call process_invoice first.")
# Custom prompt
template = """
You are an intelligent assistant for processing invoices.
Use the following document context to answer the query.
If the answer cannot be found in the context, say "I could not find this in the documents."
Context:{context}
Question:{question}
Answer:
"""
prompt = PromptTemplate(
input_variables=["context", "question"],
template=template
)
retriever = self.index.as_retriever(search_kwargs={"k": 3})
qa_chain = RetrievalQA.from_chain_type(
llm=self.llm,
retriever=retriever,
chain_type="stuff",
chain_type_kwargs={"prompt": prompt}
)
result = qa_chain({"query": query})
return {
"answer": result["result"],
"source_chunks": result["source_documents"]
}

Step 6: Process an Invoice

Python
def process_invoice(self, pdf_path: str) -> bool:
"""
Process an invoice PDF and prepare for querying.
Args:
pdf_path: Path to the PDF file
Returns:
bool: True if processing succeeds, False otherwise
"""
try:
chunks = self.read_and_chunk_file(pdf_path)
self.index = self.create_index(chunks)
if self.index is None:
print("Index creation failed.")
return False
print("Index successfully created.")
return True
except Exception as e:
print(f"Error processing invoice: {e}")
return False

Step 7: Answer Queries

Python
def answer_invoice_query(self, query: str) -> Dict[str, Any]:
"""
Answer a query about the processed invoice.
Args:
query: The query to answer
Returns:
Dictionary containing:
"answer": The generated answer
"source_chunks": The relevant document chunks
"""
result = self.generate_answer(query)
return {
"answer": result["answer"],
"source_chunks": result["source_chunks"]
}

🔹 Full Workflow Example

Python
processor = InvoiceProcessor()
# Step 1: Process the invoice
success = processor.process_invoice("invoice.pdf")
if success:
# Step 2: Query the invoice
query = "What is the total amount due?"
response = processor.answer_invoice_query(query)
print("Answer:", response["answer"])
print("Source Chunks:", response["source_chunks"])
else:
print("Invoice processing failed.")

🔹 Key Takeaways

  1. RAG bridges gaps — Instead of relying on memory, the model dynamically retrieves invoice data.
  2. Reusable pipeline — The class can be applied to any PDF invoice with minimal changes.
  3. Extensible — You can extend this to multiple documents, persistent FAISS storage, or integrate with APIs.

✅ This mini hack shows how LangChain + OpenAI + FAISS + PyMuPDF can transform invoice processing into a powerful, scalable Q&A system.