Overview
Local Context Query is a full-stack document Q&A application. Users upload files through a browser interface, the backend processes them into a vector database, and users then ask natural-language questions that are answered using only the content from their uploaded documents.
The system is built from four cooperating layers:
- Frontend — A single-page HTML client using Tailwind CSS and vanilla JavaScript. Handles uploads, document management, chat, and real-time updates via WebSockets.
- API layer — A FastAPI server that exposes REST endpoints and a WebSocket hub. It delegates long-running work to Temporal workflows rather than blocking request threads.
- Orchestration — Temporal manages durable workflows for document processing and query execution. Each workflow is composed of retry-safe activities.
- Infrastructure — S3 (or LocalStack) for file storage, ChromaDB for vector search, and Ollama for local LLM inference and embeddings.
This guide teaches you how every layer works, why specific design decisions were made, and where to look when things go wrong.
System Architecture
The request flow follows two paths: upload and query. Both use Temporal for durable orchestration, which means the backend API returns immediately and the browser polls or listens on a WebSocket for results.
Page Layout
The HTML is organized into three major regions:
- Header — Fixed at the top. Contains the model selector dropdown, WebSocket status indicator (green/red dot), refresh button, theme toggle, and the sidebar toggle for mobile.
- Chat panel — The central column. Holds the scrollable message history (
#chatMessages) and the input composer bar with a growing textarea. - Sidebar — On desktop it sits to the right. On mobile it slides in as an overlay drawer. Contains the upload drop zone, the processing queue, and the document selection list.
This three-column pattern (nav, content, sidebar) is extremely common in web applications. Learning it here will help you build dashboards, email clients, and admin panels later.
Tailwind CSS + Custom Styles
Tailwind CSS is a utility-first CSS framework. Instead of writing custom class names like .chat-container, you compose styles directly in HTML with small utility classes.
Flexbox layouts (flex, gap-4), spacing (p-4, m-2), typography (text-sm, font-bold), responsive breakpoints (lg:hidden, sm:px-5), border radius (rounded-xl), and visibility toggling.
Theme color variables (--bg, --card, --accent), the upload drop-zone hover animation, custom checkbox styles, scrollbar appearance, WebSocket status dot animations, and the sidebar slide transition.
How Tailwind utility classes work
Dark & Light Theming
Theme switching is built with CSS custom properties (also called CSS variables) and a data-theme attribute on the root <html> element. This is the recommended pattern because it works with any CSS framework and requires zero JavaScript to apply styles — JS only flips the attribute.
Step 1 — Define variables for each theme
Step 2 — Use variables everywhere
Step 3 — Toggle with JavaScript
var(--bg) automatically updates when the attribute changes. No class toggling on individual elements, no re-renders, no framework needed. See the MDN guide on CSS custom properties.
JavaScript State Management
The app uses plain JavaScript variables to track state. This is appropriate for a small single-page application. As the app grows, you would migrate to a state container (like Zustand, Redux, or even a simple event bus).
pendingQueries as a map?
When the user sends a question, the UI stores the query ID as a key and the loading bubble DOM element as the value. When the answer arrives (via WebSocket or polling), the code looks up the bubble by query ID and replaces its content. This avoids searching the DOM every time.
The safeFetch() helper
All HTTP calls go through a single helper that standardizes error handling:
resp.json() directly throws a confusing parse error. Reading as text gives you a chance to handle the raw body and show a useful error message.
REST API Endpoints
The frontend communicates with the FastAPI backend through these endpoints. Each one follows a consistent JSON response pattern.
Returns the status of all backend services: ChromaDB, Ollama, Temporal, and S3. Useful for debugging when things stop working.
Lists available LLM models from Ollama. Populates the model select dropdown in the header.
Accepts a file via FormData. Saves to S3, starts a Temporal upload workflow, returns a doc_id.
Polls the Temporal workflow status for an upload. Returns processing, completed, or failed.
Sends a question + selected doc IDs. Saves the query to S3, starts a Temporal query workflow, returns a query_id.
Polls for the answer in S3. Returns processing until the answer is ready, then returns the full answer with sources.
Lists all indexed documents from ChromaDB, de-duplicated by doc_id. Populates the sidebar document list.
Deletes a document from ChromaDB and cleans up its S3 objects. The frontend removes the card from the sidebar immediately.
Internal callback from the worker. Loads the answer from S3 and broadcasts it to all WebSocket clients.
WebSocket Design
WebSockets provide a persistent, bi-directional connection between the browser and the server. Unlike HTTP (where the client must ask for updates), the server can push messages at any time. This app uses WebSockets for instant notification when uploads finish or answers are ready.
Connection lifecycle
The dual delivery model (WebSocket push + HTTP polling) means the UI still works even if the WebSocket disconnects temporarily. Exponential backoff prevents flooding the server during outages.
If the page loads over HTTPS, you must use wss:. Browsers block mixed content — an insecure WebSocket on a secure page will be rejected. Always derive the protocol from location.protocol.
Server-side: the WSHub pattern
The FastAPI backend maintains a WSHub class that tracks all connected WebSocket clients. When an event happens (answer ready, upload complete), it broadcasts a JSON message to every client. Dead connections are cleaned up automatically.
Upload Flow
The upload area supports both click-to-browse and drag-and-drop. Under the hood, a hidden <input type="file"> is triggered when the drop zone is clicked. The drag events (dragenter, dragleave, drop) toggle a visual highlight class.
The upload sequence
1 User selects or drops one or more files.
2 Each file is uploaded sequentially via POST /api/upload with a FormData body.
3 The server saves the raw file to S3 and starts a Temporal DocumentUploadWorkflow.
4 The returned doc_id is added to the processingQueue array. A progress card appears in the sidebar.
5 The UI polls GET /api/upload/:id/status every 2 seconds as a fallback.
6 When the WebSocket broadcasts upload_complete (or polling detects completion), the progress card is replaced with a full document card.
e.preventDefault() on both dragover and drop events. Without this, the browser will navigate to the dropped file instead of passing it to your JavaScript. See the MDN File drag and drop guide.
Chat Flow
The chat composer uses a <textarea> that auto-grows as the user types. It grows by setting style.height = 'auto' then style.height = scrollHeight + 'px' on every input event. A max-height cap prevents it from consuming the entire screen.
Keyboard shortcuts: Enter sends the message. Shift+Enter inserts a newline. This is the same pattern used by Slack, Discord, and most chat applications.
The query sequence
1 User presses Enter. The UI adds a user chat bubble and a loading indicator bubble.
2 The UI collects the currently enabled document IDs from the sidebar checkboxes.
3 POST /api/query sends the question, selected model, and enabled doc IDs.
4 The returned query_id is stored in pendingQueries mapped to the loading bubble element.
5 Two answer paths race: WebSocket push and polling at GET /api/query/:id/answer.
6 Whichever arrives first replaces the loading bubble with the answer text and source citations.
DOM Rendering Patterns
The app builds HTML strings with template literals and injects them via innerHTML. Helper functions keep this organized:
addUser(text)— Creates a user chat bubble (right-aligned).addBot(html)— Creates a bot answer bubble (left-aligned) with optional source citations.addSys(text)— Creates a system message (centered, muted).renderDocList()— Rebuilds the entire document sidebar list from thedocumentsarray.
XSS prevention with esc()
This is a critical security measure. Without escaping, a user could type <img src=x onerror=alert(1)> in the chat and it would execute as JavaScript. The esc() function converts special characters to their HTML entity equivalents (< → <).
The renderMd() helper adds simple markdown formatting (bold with **text**, inline code with backticks). It runs after escaping, so the formatting syntax is safe.
FastAPI Backend
FastAPI is a modern Python web framework designed for building APIs. It uses Python type hints and Pydantic models to auto-validate request and response data, and generates OpenAPI documentation at /docs automatically.
Key patterns in main.py
The @app.exception_handler(Exception) decorator catches all unhandled errors and returns a structured JSON response instead of an HTML stack trace. This prevents information leakage in production.
Clients for Temporal, ChromaDB, and S3 are created on first use (get_temporal(), get_collection()) and cached as module-level globals. This avoids blocking startup if a service is temporarily down.
Currently set to allow_origins=["*"] for development. In production, restrict this to your actual domain. Open CORS is a security risk (see Section 18).
Upload and query endpoints start a Temporal workflow and return a job ID immediately. They never block waiting for processing to finish. This keeps HTTP response times under 500ms even for large files.
Temporal Workflows
Temporal is a durable execution platform. A Temporal workflow is a function that orchestrates a sequence of steps. If the process crashes at any point, Temporal replays the workflow from the last completed step — without re-executing already-completed activities. This gives you automatic fault tolerance for free.
Document Upload Workflow
The DocumentUploadWorkflow orchestrates three activities in sequence:
Query Workflow
The QueryWorkflow is simpler — it delegates to a single execute_query_activity that performs the entire RAG pipeline:
Key Temporal concepts
Upload and query workflows run on separate task queues (document-upload-tasks and query-tasks). This lets you scale them independently — you might run 4 upload workers but only 1 query worker (since the LLM is GPU-bound).
Each activity has a RetryPolicy with maximum_attempts and initial_interval. Permanent failures (bad file, model not found) use ApplicationError(non_retryable=True) to stop retries immediately.
Long-running activities (embedding, LLM generation) call activity.heartbeat() periodically. If the heartbeat stops for longer than heartbeat_timeout, Temporal assumes the worker crashed and reschedules the activity.
Workflow code must be deterministic — no I/O, no random numbers, no datetime.now(). All side effects go in activities. Temporal replays the workflow function during recovery and needs identical decisions each time.
Activities & Dependency Injection
Activities are where real work happens: reading from S3, calling Ollama, writing to ChromaDB. The codebase uses class-based activities with constructor injection, which makes them trivially testable.
In production, the module creates real instances at import time. In tests, you inject lightweight mocks:
The Ollama embed fix
The OllamaClient.embed() method includes a compatibility layer worth understanding. Ollama's /api/embed endpoint changed its input format across versions. The code tries the modern API first (string input), and if the server returns a 400 error, falls back to the legacy /api/embeddings endpoint. It also truncates long inputs to stay within the model's context window and passes truncate: true as a server-side safety net.
RAG Pipeline
RAG (Retrieval-Augmented Generation) is a technique where you ground an LLM's answer in specific documents rather than relying on its training data alone. This app implements a straightforward RAG pipeline:
Ingestion pipeline
1 Extract — Parse the uploaded file (PDF via pdfplumber, DOCX via python-docx, XLSX via openpyxl, or plain text decode). Save extracted text to S3.
2 Chunk — Split text into overlapping windows of 500 words with 100-word overlap. Overlap ensures that context at chunk boundaries is not lost.
3 Embed — Each chunk is sent to Ollama's nomic-embed-text model, which returns a 768-dimensional vector. Chunks are truncated to 6000 characters (~1500 tokens) to stay within the model's 2048-token context window.
4 Store — Vectors, metadata (doc_id, filename, chunk_index), and the original text are upserted into ChromaDB with cosine similarity as the distance metric.
Query pipeline
1 Embed query — The user's question is embedded using the same model.
2 Retrieve — ChromaDB returns the top 8 most similar chunks (filtered by enabled document IDs).
3 Generate — The retrieved chunks are assembled into a prompt context (capped at 12,000 chars) and sent to the LLM with a system prompt that instructs it to answer only from the provided context.
4 Persist — The answer and source citations are saved to S3, then the backend is notified to push the result via WebSocket.
Testing Strategy
The test suite in test_activities.py demonstrates how to test Temporal activities without running a Temporal server. Because activities are just async functions on classes with injected dependencies, you call them directly with mock infrastructure.
An in-memory dictionary that mimics get_bytes() and put_bytes(). Pre-load it with test data. Raises ClientError on missing keys, just like real S3.
Returns deterministic vectors ([0.01] * 768) for embed calls and a fixed string for generate calls. No network needed.
Wraps a MagicMock collection. You can assert on .add(), .query(), and .count() calls.
Tests verify that permanent failures (empty file, missing S3 key, model not found) raise ApplicationError with non_retryable=True, ensuring Temporal won't retry them.
Responsive Design
The app is mobile-aware using Tailwind's responsive prefix system. Classes like lg:block apply only at the lg breakpoint (1024px) and above. Below that width, the sidebar becomes an off-canvas drawer.
The sidebar is hidden by default (hidden). A hamburger button in the header toggles it open. When open, a dark overlay covers the chat area (#sidebarOverlay). Tapping the overlay closes the sidebar.
The sidebar is always visible (lg:block). The overlay is hidden. The main layout uses flex with the chat column taking remaining space and the sidebar fixed at a set width.
Security Checklist
This section covers security issues you must address before deploying this (or any similar) application to production. Each item is marked with its severity.
Loading Tailwind from a CDN means a compromised CDN could inject malicious code. In production, either bundle Tailwind locally or use Subresource Integrity (SRI) hashes to ensure the file hasn't been tampered with.
Upload, delete, and query endpoints change server state. If the app uses cookie-based authentication, an attacker could craft a page that submits requests on behalf of a logged-in user. Add CSRF tokens or use SameSite=Strict cookies with server-side origin validation. See the OWASP CSRF Prevention Cheat Sheet.
The esc() function escapes user input, but the LLM's response may contain HTML-like text. If renderMd() or any rich formatting allows unescaped content through, it becomes an XSS vector. Sanitize all model output before injection. Consider using DOMPurify.
The browser's accept attribute is cosmetic — it does not enforce file types. The server must validate MIME type, file extension, file size, and scan for malware. Limit maximum file size, reject unexpected extensions, and process files in an isolated environment.
The current WebSocket endpoint accepts all connections without authentication. In production, verify the user's session token during the handshake. Also authorize document access per user — don't trust enabled_doc_ids from the client blindly.
The backend currently uses allow_origins=["*"]. In production, restrict this to your actual frontend domain. Open CORS allows any website to make requests to your API.
All traffic — uploads, chat messages, API calls, WebSocket data — must go over TLS in production. Without it, content can be intercepted by anyone on the network. Use HTTPS for the site and WSS for WebSocket connections.
Without rate limits, a malicious user can flood the upload endpoint, exhaust LLM resources, or spam the WebSocket. Add rate limiting at the API gateway or application layer for uploads, queries, deletes, and WebSocket messages.
The client sends enabled_doc_ids with queries. The server must verify the user actually owns those documents. A multi-tenant deployment without this check would allow users to query other users' documents.
Avoid logging raw file content, user prompts, API keys, or session tokens. Use structured logging with redaction for sensitive fields. The OWASP Logging Cheat Sheet has good guidelines.
Best Practices Applied
The codebase follows several best practices worth internalizing for your own projects:
Upload processing (CPU-bound text extraction) and query execution (GPU-bound LLM inference) run on different Temporal task queues. This lets you scale each independently: many upload workers, fewer query workers.
The query worker sets max_concurrent_activities=1 because LLM inference is GPU-bound. Running multiple inferences simultaneously would cause OOM errors or severe slowdowns.
Using ApplicationError(non_retryable=True) for things like "model not found" or "empty file" prevents Temporal from wasting retries on failures that will never succeed.
The worker process handles SIGINT/SIGTERM signals, allows in-flight activities to drain, and reports any worker crashes. This prevents data loss during deployments.
Activity classes accept S3, ChromaDB, and Ollama clients as constructor parameters. Tests inject mocks; production uses real clients. No monkey-patching needed.
The frontend uses WebSocket push as the primary delivery path and HTTP polling as a fallback. If the WebSocket reconnects after the answer was sent, polling still picks it up.
Embedding 50 chunks can take minutes. Regular activity.heartbeat() calls tell Temporal the worker is alive. Without heartbeats, Temporal would assume the worker crashed and retry the entire activity.
Both workflows catch exceptions and return a typed result with status="failed" and an error message. This gives callers a clean response instead of requiring them to handle Temporal workflow failure exceptions.
Suggested Next Steps
If you're a new developer working on this codebase, here are concrete improvements to build, roughly ordered by difficulty:
Replace onclick="doSomething()" in the HTML with element.addEventListener('click', doSomething) in the JavaScript. This separates structure from behavior and makes the code easier to maintain and debug.
The addUser(), addBot(), and addSys() functions all build HTML strings. Create a single createBubble(type, content) function that returns a consistent structure.
Replace the simple renderMd() regex with a library like marked.js combined with DOMPurify for sanitization. This enables headings, lists, code blocks, and links in answers.
Allow users to cancel an in-progress upload or query. On the frontend, use AbortController for fetch requests. On the backend, use Temporal's cancellation API to terminate workflows.
Break the monolithic <script> into ES modules: api.js (HTTP helpers), ws.js (WebSocket logic), ui.js (DOM manipulation), and state.js (data management). Use native import/export.
Implement user sessions with JWT or session cookies. Scope documents per user. Authenticate WebSocket connections during the handshake. Add authorization checks on every API endpoint.
Instead of waiting for the full answer, stream tokens from Ollama through the WebSocket to the browser. This gives users immediate feedback as the answer is generated, similar to ChatGPT's typing effect.
Resources & Documentation
Links to official documentation for every technology used in this project:
Complete HTML element and attribute reference
Utility class reference and configuration guide
Language fundamentals, async/await, and DOM APIs
Browser WebSocket interface reference
Python web framework for building APIs
Workflows, activities, retry policies, and workers
Hands-on tutorial for Temporal with Python
Open-source vector database for embeddings
Embedding, generation, and model management APIs
Security best practices for web applications
Guide to CSS variables for theming
Python testing framework used in this project