← ← Back to Thinking AI

What is RAG and how to use it in LM Studio to chat with your own documents

Published February 2026

Published on the TEN INVENT blog · February 2026

Introduction

You have a local AI model running on your computer. You ask it a question about an internal company document. The answer? Completely made up. Not because the model is bad — but because it simply has never seen your document before. It was not trained on your data.

This is the fundamental limitation of any language model: it only knows what it learned during training. If the information wasn't in the training data, the model either invents (hallucinates) or honestly admits that it doesn't know.

RAG (Retrieval-Augmented Generation) solves exactly this problem. It is the technique that gives your model access to your documents — without retraining, without cloud, without additional costs. And in LM Studio, you can do this directly on your computer.

In this article I explain what RAG is, how it works technically (but in an accessible way), and how to configure it practically in LM Studio.

What is RAG?

RAG stands for Retrieval-Augmented Generation — roughly, "generation augmented by retrieval". It sounds complicated, but the principle is simple.

The exam manual analogy

Imagine two scenarios at an exam:

Without RAG: The student has to answer from memory. If they studied the relevant chapter, they answer well. Otherwise, they invent something plausible — exactly what AI models do when they hallucinate.

With RAG: The student is allowed to consult the manual. Before answering, they search the book for relevant passages, read them, and then formulate their answer based on concrete information.

RAG does exactly this for an AI model: before generating a response, it searches your documents for the most relevant passages, "attaches" them to your question, and then the model answers with the real context in front of it.

How does RAG work technically?

RAG has three distinct phases: data preparation, retrieval and generation. Let's take them one by one.

Phase 1: Data preparation (done once)

This phase transforms your documents into a format that AI can search quickly.

Step 1 — Chunking

Your documents (PDFs, Word, text) are cut into small pieces called "chunks". Why? Because an AI model has a limited context window — you can't dump 500 pages at once. Typical chunks have between 200 and 1000 characters, with a little overlap between them so that context isn't lost at the boundaries between fragments.

For example, a 100-page technical manual becomes a few hundred short fragments, each containing one idea or coherent paragraph.

Step 2 — Embedding (vectorization)

Each chunk is transformed into a vector — a list of numbers that captures the semantic meaning of the text. This process is called "embedding" and is performed by a specialized model (not the main conversation model).

What does this mean concretely? Two sentences with similar meaning will have close vectors, even if they use different words. For example:

"The employee has the right to 25 days of leave" → vector [0.23, 0.87, 0.11, ...]
"Each employee benefits from 25 days off" → vector [0.24, 0.85, 0.13, ...]

These vectors are close in mathematical space because the meaning is similar. This property is essential for search.

Step 3 — Storage in vector database

All vectors are stored in a vector database — a data structure optimized for similarity-based searches. Popular vector databases include FAISS, ChromaDB, or Milvus. In LM Studio's case, this component is managed internally by the RAG plugin.

Phase 2: Retrieval (at each question)

When you ask a question:

Your question is also transformed into a vector, using the same embedding model.
The question vector is compared to all vectors in the database.
The most similar chunks are returned — i.e., the fragments from your documents whose meaning is closest to your question.

This process is called semantic search and is fundamentally different from classic keyword-based search. Semantic search understands meaning, not just exact word matching.

For example, if you ask "how many days off do I have?", semantic search will find the fragment about "vacation leave" — even if the words "days off" don't appear exactly in the text.

Phase 3: Generation

Now comes the "augmented generation" part:

The relevant chunks found in the previous phase are inserted into the prompt sent to the model.
The model receives the instruction: "Using the context below, answer the user's question."
The model generates a response based on the actual information from your documents.

The internal prompt looks approximately like this:

CONTEXT:
[Fragment 1 from your document]
[Fragment 2 from your document]
[Fragment 3 from your document]

QUESTION: How many leave days am I entitled to?

Using the CONTEXT above, answer the QUESTION.
If the CONTEXT does not contain the answer, say you don't know.

This last point is crucial: a well-configured RAG instructs the model to not invent — if the information is not found in the provided documents, the model must acknowledge this instead of hallucinating.

Why RAG and not fine-tuning?

A natural question: why not train the model directly on our data? Here's a comparison table:

| Criterion | RAG | Fine-tuning | |-----------|-----|-------------| | Cost | Minimal (runs locally) | High (GPU hours, clean data) | | Setup time | Minutes | Hours to days | | Data updates | Instant (change documents) | Full retraining | | Precision on specific data | Excellent (cites directly) | Variable (can hallucinate) | | Hardware required | Normal (any PC with LM Studio) | Strong GPU for training | | Confidentiality | Everything local, zero cloud | Depends on fine-tuning method |

RAG clearly wins for scenarios where you have documents that change frequently, need answers anchored in concrete sources, and don't want to invest resources in retraining.

RAG in LM Studio: four practical variations

LM Studio offers several ways to use RAG, from the simplest to the most advanced.

Variation 1: Chat with documents (built-in, zero configuration)

LM Studio has RAG functionality integrated directly into the chat interface. It's the easiest way to get started.

How it works:

Open a new chat in LM Studio with your preferred model loaded.
Attach documents directly to the chat message (drag & drop or click on the attachment icon).
Ask the question and send.

Supported formats: PDF, DOCX, TXT, CSV.

What happens behind the scenes:

If the document is short and fits in the model's context window, LM Studio includes all content in the conversation. This is the ideal scenario — the model sees everything.
If the document is long, LM Studio automatically activates RAG: fragments the document, searches for relevant passages, and provides them to the model.

Limitations: Maximum 5 files, combined size of maximum 30MB. The cache is cleared with the chat.

When to use: When you have one or two documents and want a quick answer, without setup. Perfect for "read this PDF and answer my questions".

Variation 2: RAG v2 plugin (persistent, configurable)

For a more serious setup, LM Studio supports dedicated RAG plugins. The rag-v2 (from dirty-data) or native rag-v1 offers more advanced functionality.

What it adds compared to the built-in version:

Automatic embedding model detection — the plugin automatically finds a compatible embedding model already downloaded in LM Studio
Full content injection — for small documents, includes all; for large ones, does selective retrieval
Configurable — you can adjust the number of returned chunks, the embedding model used, and other parameters

Configuration:

In LM Studio, go to the Plugins section.
Search and install rag-v2 or check if rag-v1 (built-in) is already active.
Configure from UI:
- Embedding Model — leave on "Auto-Detect" or manually select a model (e.g., nomic-embed-text)
- Retrieval Limit — how many chunks to return (default 5, increase for complex documents)
- Auto-Unload Model — if you want the embedding model to be unloaded from memory after retrieval

Variation 3: Big RAG Plugin (for large document collections)

If you have GBs of documents — for example, a complete technical documentation base, manuals, or contracts — the Big RAG plugin (from mindstudio) is the solution.

Capabilities:

Recursive directory scanning — put all documents in a folder and the plugin indexes everything
Multiple formats — HTML, PDF, EPUB, TXT, Markdown, and even images with OCR
Incremental indexing — when adding new documents, it doesn't re-index everything
Sharded vector storage — supports large collections without performance issues

Setup:

Install the Big RAG plugin from LM Studio.
Configure the document directory (e.g., ~/Documents/knowledge-base).
Configure the vector store directory (e.g., ~/.lmstudio/rag-db).
Optionally, adjust:
- Chunk Size — 512 for general documents, 1024 for technical content
- Retrieval Limit — how many results to return (10 for higher precision)
- Affinity Threshold — similarity threshold (0.6 for high precision, 0.4 for wider results)
Start the plugin. The first indexing takes a few minutes, depending on document volume.

Variation 4: AnythingLLM — dedicated RAG interface

For maximum control and best experience, you can use AnythingLLM as an intermediary. AnythingLLM is a separate application that manages RAG completely, using LM Studio as the backend for the model.

How the pieces connect:

You → AnythingLLM (interface + RAG engine) → LM Studio (serves the LLM)

Setup:

LM Studio: Load your model (e.g., Qwen3-8B) and start the local server (Server tab, click Start).
AnythingLLM: Download from anythingllm.com and install.
In AnythingLLM, configure:
- LLM Provider: select "LM Studio" and add the local URL (http://localhost:1234/v1)
- Embedding Model: you can use the model from LM Studio or AnythingLLM's built-in
Create a workspace and upload documents.
AnythingLLM processes automatically: chunking → embedding → vector storage.
Start asking questions. AnythingLLM does retrieval and sends context to LM Studio for generation.

AnythingLLM advantages:

Dedicated interface for document management
Separate workspaces for different projects
No limit on the number of documents
Support for PDF, TXT, DOCX, and many others
Two modes: Chat (conversational, with context from training + RAG) and Query (strictly from your documents, zero hallucination)

Choosing the embedding model

An often-neglected but critical aspect: the embedding model is as important as the main conversation model. It determines the quality of semantic search.

Recommendations for LM Studio:

| Model | Size | Recommended for | |-------|-----------|-------------------| | nomic-embed-text | ~270MB | General use, good quality/size ratio | | all-MiniLM-L6-v2 | ~80MB | Fast, ideal for limited hardware | | bge-small-en-v1.5 | ~130MB | Good precision on English texts | | multilingual-e5-large | ~1.2GB | Multilingual documents (including Romanian) |

If you work predominantly with Romanian-language documents, choose a multilingual model. Models trained only on English will make weaker embeddings for Romanian text, which reduces retrieval quality.

Tips for efficient RAG

1. Document quality matters enormously

The "garbage in, garbage out" principle applies doubly to RAG. Well-structured documents with clear headings and coherent paragraphs produce better chunks and more precise retrieval. A PDF scanned as an image without OCR will not produce anything useful.

2. Chunk size affects quality

Chunks too large → vectors become too general, retrieval misses specific details
Chunks too small → semantic context is lost, fragments no longer have individual meaning
General rule: 500-1000 characters per chunk, with 10-20% overlap

3. Question formulation makes a difference

With RAG, specific questions beat vague ones:

❌ "What does the contract say?" — too general, retrieval doesn't know what to search for
✅ "What is the payment term in the contract with supplier X?" — retrieval can identify the exact relevant fragment

Mention terms, concepts and words you expect to find in the document. This helps semantic search enormously.

4. Verify sources

A major advantage of RAG: you can ask the model to cite the source. LM Studio displays citations at the end of the response, and AnythingLLM can show exactly which document and which fragment the information came from. Always verify — RAG reduces hallucinations but doesn't eliminate them completely.

5. Experiment with parameters

There are no universal perfect settings. Test:

Increase retrieval limit if responses seem incomplete
Decrease affinity threshold if no results are found
Try a different embedding model if precision is low
Adjust chunk size depending on document type

Practical use cases

RAG is not just an interesting technology — it has concrete and immediate applications:

Internal documentation: Upload your company's procedure manuals. Employees ask questions in natural language and get answers based on actual procedures.

Contract analysis: Upload contracts and ask about specific clauses, terms, obligations. Faster than manual searching.

Technical support: Upload your product technical documentation. Create an assistant that answers technical questions based on real documentation.

Research: Upload academic papers or industry reports. Ask cross-questions that synthesize information from multiple sources.

Onboarding: New employee? Give them access to a RAG chatbot containing everything they need to know: internal policies, procedures, tools, contacts.

Conclusion

RAG is probably the most practical and accessible way to make an AI model useful for your specific data. It doesn't require costly training, works locally with LM Studio, and can be configured in minutes.

The principle is simple: instead of asking the model to know everything from memory, you allow it to "consult the manual". The result? Answers anchored in real documents, dramatic reduction in hallucinations, and an AI assistant who actually knows your business.

Start with the simple version — drag & drop a PDF into LM Studio's chat. Test a few questions. Then, as needs grow, move to plugins or AnythingLLM for a complete setup. The learning curve is gentle, and the benefits are immediate.

This article is part of the technical publication series on the TEN INVENT blog. If you want to implement a RAG system for your company or have questions about local AI, contact us.