Building a Private AI Stack That Knows My Obsidian Vault

TL;DR
  • Developed a private AI stack that integrates with Obsidian vault for better information recall and organization.
  • The AI stack runs locally on personal hardware, ensuring no data leaves the machine.
  • Utilises RAG to query the entire Obsidian vault and provides answers grounded in personal notes.
  • Includes Claude Code integration for querying the vault mid-conversation without token burn.
  • Offers private web search via SearXNG, enhancing access to live web results.
  • Setup took a single Sunday but has improved workflow and knowledge base accessibility.

Hardware and Software

Hardware: It seems like a GPU with at least 16GB VRAM is a good starting point and luckily I have an AMD RX 7900 XTX (24GB VRAM) and AMD on Windows uses ROCm, which works out of the box on recent cards.

Software:


Phase 1: Ollama

I picked Ollama as the model runner as it handles downloading, running, and serving local models via a REST API, and seemed like the lowest friction choice (while I like to tinker I just wanted to get this running quickly)

I Download and ran the Windows installer from ollama.com which installs as a background service.

To test it, I ran:

ollama run llama3.2

This pulled and ran a small model, and once you’re in the chat prompt, I typed /bye to exit Ollama and move on to verifying that it was using my GPU, since running on CPU would have slowed things down.

To verify that the GPU is being used, I just used this command:

ollama ps

In the PROCESSOR column, I saw 100% GPU, and I was encouraged to see that it just worked on my AMD card.

I pulled down the two main models I wanted to start with:

ollama pull qwen2.5-coder:32b    # for coding tasks
ollama pull nomic-embed-text      # embedding model — required for RAG

The embedding model converts my notes into vectors for semantic search. The system couldn’t do RAG without it.


Phase 2: AnythingLLM

AnythingLLM is the application layer. I do want to look at others, but from what I’ve read, AnythingLLM would be straightforward to implement and adds document ingestion, a vector database, a chat interface with workspace isolation, and agent features. So Ollama is the engine, and AnythingLLM is the car.

Install: I download AnythingLLM Desktop from useanything.com.

On first run, I configure the LLM provider:

  • Provider: Ollama
  • URL: http://localhost:11434
  • Model: my chosen models (e.g. qwen2.5-coder:32b-instruct-q4_K_M)

I configured the embedder: Settings → gear icon → Embedder

  • Provider: Ollama
  • URL: http://127.0.0.1:11434
  • Model: nomic-embed-text

In the AnythingLLM app, I created a workspace and tested basic chat.

Setting my context window: I found that when testing, the default context window causes the model + KV cache to exceed available VRAM on my GPU, spilling into the CPU and making responses painfully slow. Running the ollama ps confirmed this. So I tweaked the context window Settings → LLM Provider → Context **Window** to 4096 which using ollama ps confirmed kept everything on the GPU.

Picking a different model: Testing revealed that qwen2.5-coder:32b while recommended for a local coding agent, didn’t work well for general chat, so I had Ollama download plain qwen2.5:32b, which worked much better.


Phase 3: Ingest My Obsidian Vault

This is where it got interesting. AnythingLLM has a built-in Obsidian data connector that reads a vault folder directly.

It warned me to close Obsidian before running this; it may not have been strictly necessary, but it's better to heed the warning than waste time.

The steps were shockingly straightforward:

  1. In the left sidebar, hover over your workspace name two icons appear.
  2. Click the upload icon.
  3. Click Data Connectors → Obsidian.
  4. Point it at your vault folder.
  5. Select all imported documents → Move to Workspace.
  6. Click Save and Embed.

This was surprisingly quick, even for a large vault, and the embedding model ran at 100% GPU during this.

I saw some input length errors on some files, as longer notes can exceed the embedding model’s input limit. It was easily fixed by finding the chunk size in the settings, reducing it and re-uploading the failed files.

Workspace gear icon → **Vector Database settings**
- Text Chunk Size: `600`
- Text Chunk Overlap: `100`

I finished with a quick test asking AnythingLLM something only my vault would know, and it answered correctly, showing RAG is working.

I did tweak the LLMTemperature in the chat settings after more testing and dropped it down, as I found it was a little too imaginative with its answers, and I really want this agent to deal in cold, hard facts rather than flights of fantasy.


Phase 4: Connect to Claude Code via MCP

As I have been using Claude Code a lot lately, I decided to expose my new vault agent to Claude as a tool it can use to query my knowledge base mid-conversation, without copy-pasting or context switching.

This wasn’t too difficult, and I had Claude handle most of the setup. It involved creating a small Python MCP server that wraps the AnythingLLM REST API and exposes a query_vault tool. Claude Code registered that server, so it can then call it whenever it needs vault content. The bonus here is that Claude isn’t using tokens to query the vault; instead, it relies on the AnythingLLM workspace to pull the relevant information and pass it along.

I first installed mcp httpx, which is used for general REST API operations in AI workflows:

pip install mcp httpx

I the got my AnythingLLM API key AnythingLLM Settings → gear icon → **Developer API** → Generate New API Key I then ran a command to find my Find my workspace slug, which is just a url friendly name of the workspace I created and loaded my vault into.

curl http://localhost:3001/api/v1/workspaces -H "Authorization: Bearer YOUR_API_KEY"

After copying the slug value from the response, I had Claude cook up an MCP server script and add it to ~/.claude/mcp-servers/vault_rag.py

import asyncio
import httpx
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

ANYTHINGLLM_BASE_URL = "http://localhost:3001"
API_KEY = "your-api-key-here"
WORKSPACE_SLUG = "your-workspace-slug-here"

app = Server("anythingllm-vault")

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_vault",
            description=(
                "Query the Obsidian vault using RAG via a local LLM. "
                "Use this to find information from personal notes, projects, or past decisions. "
                "Works best with specific targeted questions rather than broad overviews."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "The question to run against the vault"}
                },
                "required": ["query"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "query_vault":
        query = arguments["query"]
        async with httpx.AsyncClient(timeout=60.0) as client:
            response = await client.post(
                f"{ANYTHINGLLM_BASE_URL}/api/v1/workspace/{WORKSPACE_SLUG}/chat",
                headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
                json={"message": query, "mode": "query"}
            )
            data = response.json()
            return [TextContent(type="text", text=data.get("textResponse", "No response received."))]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

Claude registered the new MCP server.

claude mcp add vault-rag --scope user -- python ~/.claude/mcp-servers/vault_rag.py

I started a new Claude Code session to let the changes take effect and ran a quick test, which worked well.

Use the query_vault tool to find my notes on [something in my vault]

A quick note on query quality:

The query_vault tool uses vector similarity search to match my query against embedded vault chunks. Specific targeted questions (“What’s the context window setting for AnythingLLM?”) reliably retrieve the right chunk. Broad overview requests (“Summarise my LLM stack notes”) score poorly and often return unrelated content.

Also, this is where the difference between the general-purpose model qwen2.5:32b and the coder model became apparent. The coder model tends to fill gaps with plausible-sounding fabricated content when retrieved chunks are insufficient. The more general-purpose models were better at staying grounded in source material and saying “I don’t know.”

RAG workspace settings I found that were worth tuning:

  • Max Context Snippets: 8–10 (default 4 is too few for most queries)
  • Temperature: 0.1 - 0.3 (lower = less creative = fewer hallucinations)
  • System Prompt: Instruct the model to only use retrieved content and refuse to fabricate

Phase 5: SearXNG Private Web Search

This was a bonus task as I’d always liked the idea of a meta search engine that proxies queries across Google, Bing, DuckDuckGo, and others, so you get combined results from all of them, and it keeps searches private. SearXNG seemed like a good fit, as it’s a self-hosted metasearch engine that runs in Docker, and the setup seemed fairly straightforward.

Install and run:

docker run -d --name searxng --restart unless-stopped -p 8080:8080 searxng/searxng

The --restart unless-stopped flag means it starts automatically whenever Docker starts. By opening http://localhost:8080 I verified it was up and running.

To combine this with AnythingLLM I went to Settings → **Agent Configuration → Web Search** changed the Provider to SearXNG and the API URL to http://localhost:8080. I also enable web search in my workspace's agent settings (workspace gear icon → Agent Configuration).

Important gotcha: I found web search in AnythingLLM only works in agent mode. So to get it to use the search functionality, I had to type @agent before my message to trigger it. Normal chat mode does not use web search even if everything is configured correctly. It took me a minute to figure that out, I guess my research before hand was a bit rushed.

I’ve stuck to starting Docker Desktop manually each time but making Docker start with Windows is just a settings tweak Docker Desktop → Settings → General → tick **Start Docker Desktop when you log in**.


The Result

I now have a local LLM running on my own GPU with no data leaving my machine. I’ve run RAG over my entire Obsidian vault and can ask questions, get answers grounded in my own notes. I’ve got a useful Claude Code integration so Claude can query your vault mid-conversation without burning tokens, and as a bonus, I have private web search via SearXNG, usable both in AnythingLLM agents and as my browser default if I want it.

It took a single Sunday to get everything tuned, but it’s been worth it. Having an AI that actually knows what’s in my note and can combine that with live web results has improved my workflow and access to my knowledge base, and a new rabbit hole to dive into as I figure out what else I can make a local LLM do.

Previous
Previous

Obsidian Plugin To Convert Notes To Blog Friendly HTML

Next
Next

Helping My AI Agent Make Better Use Of My Obsidian Vault