Architecture — How Qvault Processes Documents Locally

The Local-First Architecture
Dual-Layer PII Detection Engine
Document Processing Pipeline
Frontend Rendering Pipeline
Performance Characteristics
What Qvault Does NOT Do
Cross-Platform Distribution
Jurisdictional Coverage
The Trust Model

Bytes sent to cloud

Jurisdictions covered

IPC commands

<100ms

Typical scan time

Legal professionals handle some of the most sensitive information in existence: Social Security numbers, financial records, personal addresses, medical data, and confidential business details. Every day, law firms process thousands of documents containing personally identifiable information (PII) that, if exposed, could harm clients and violate regulations like GDPR, CCPA, and LGPD.

Most document processing tools require uploading files to a cloud server. For a law firm, that means client data — privileged, confidential, protected by attorney-client privilege — leaves the machine and travels across the internet to be processed on someone else's infrastructure.

Qvault takes a fundamentally different approach. Every byte of every document stays on your computer. There is no cloud processing, no data upload, no server-side analysis. The entire PII detection and redaction pipeline runs locally, inside a native desktop application built with Rust.

The Local-First Architecture

Qvault is built on Tauri v2, a framework that pairs a Rust backend with a lightweight web frontend. Unlike Electron (which bundles an entire Chromium browser), Tauri uses the operating system's native webview — WebKit on macOS, WebView2 on Windows, WebKitGTK on Linux. The result is a binary that's a fraction of the size of an Electron app, with lower memory usage and faster startup times.

Why Rust?

Memory safety without garbage collection. No pause-the-world GC cycles interrupting document processing. Memory is freed deterministically.
Zero-cost abstractions. The regex engine and PDF parser operate at native speed. No interpreter overhead, no JIT warmup.
Fearless concurrency. Tauri's async runtime (Tokio) handles IPC commands concurrently. Multiple documents can be processed in parallel without data races.

The frontend is React 18 with TypeScript and Tailwind CSS, rendered inside the native webview. The two layers communicate through Tauri's IPC bridge — a type-safe, serialized message-passing system that connects JavaScript function calls to Rust command handlers.

The IPC Bridge

Qvault exposes 18 IPC commands that the frontend can invoke. These are not REST endpoints or WebSocket messages — they're direct function calls serialized through Tauri's command system, with no network stack involved.

Document management: Upload, retrieve, delete, and reset operations
PII scanning: Text extraction and dual-layer detection pipeline
Redaction review: Individual approval/rejection and bulk operations
Entity tracking: Cross-document PII knowledge base queries
Export: Coordinate-mapped PDF redaction with black-box rendering

Every command runs in the same process as the application. There's no HTTP overhead, no serialization to JSON over a wire, no latency from network hops. A scan command completes in single-digit milliseconds locally.

Dual-Layer PII Detection Engine

The core of Qvault is its PII detection pipeline, which runs two complementary scanners in sequence.

Layer 1: Pattern-Based Regex Scanner

The first layer uses compiled regular expressions to detect structured PII — data that follows a known format. The scanner contains patterns organized across five jurisdictions:

Global patterns cover universally formatted data: email addresses, credit card numbers (Visa, Mastercard, Amex, Diners Club with Luhn-compatible digit groups), IBAN numbers, IPv4 addresses, URLs, monetary amounts (multi-currency: USD, EUR, GBP, BRL), dates, and phone numbers.

Regional patterns include US Social Security Numbers and EINs, EU VAT identification numbers, Brazilian CPF and CNPJ tax IDs, and German Steuernummer formats.

Every regex pattern is compiled once at initialization and reused. Rust's regex crate compiles patterns to a highly optimized finite automaton — there's no backtracking, no catastrophic regex performance. Each match is assigned a confidence score of 0.95.

The scanner implements overlap prevention: when a high-priority pattern matches a region of text, lower-priority patterns skip that region entirely.

Layer 2: Heuristic Context Scanner

Structured patterns only catch PII that follows a predictable format. But some of the most sensitive information in legal documents — people's names and company names — doesn't follow any fixed pattern.

The heuristic scanner runs six detection passes:

Company names with legal suffixes. Capitalized word sequences followed by entity types (LLC, Inc, Corp, GmbH, SA, SARL, and 11 others).
Names from distribution tables. Shareholder/ownership tables formatted as "Entity Name — XX.XXXX%".
Names from entity descriptions. Context clues like "...of Meridian Partners, a Delaware limited liability company..."
Person names with titles. Capitalized name sequences followed by roles (Manager, Director, Officer). Supports international name connectors: de, da, do, van, von, del, di, el, la, le.
Person names in distribution contexts. Individual names in percentage-distribution tables.
General capitalized sequences. Consecutive capitalized words evaluated with contextual signals (nearby terms like "Member", "Shareholder", "signed by").

To avoid false positives, the scanner maintains 25 stop phrases and 71 stop words covering common legal terminology. Confidence scores range from 0.75 to 0.92 depending on contextual evidence.

Why two layers? Regex alone can't detect names. Machine learning is slow and often requires cloud inference. Qvault's dual-layer approach combines the precision of pattern matching with the flexibility of contextual analysis — all running locally, all in milliseconds.

Document Processing Pipeline

Upload

→

Store

→

Extract Text

→

Scan PII

→

Review

→

Export

Extraction

PDF extraction uses lopdf, a pure-Rust PDF library that parses the PDF's internal object tree and extracts text per page. On the frontend, PDF.js independently extracts text spans with precise coordinate data — position, width, and height for every word. This dual extraction provides both raw text (for scanning) and spatial layout (for overlay rendering and export).

DOCX extraction treats the file as what it is: a ZIP archive containing XML. Qvault opens the archive, locates word/document.xml, and parses it with quick-xml, collecting text from paragraph and run elements.

Storage

All metadata lives in a local SQLite database running in WAL (Write-Ahead Logging) mode for concurrent read/write performance. The schema tracks documents, redactions, entities, page text, audit logs, and credits.

The entities table is noteworthy: when Qvault detects a name in one document, it normalizes and stores that entity. When processing subsequent documents, the system has a growing knowledge base of known PII. This cross-document intelligence means detection accuracy improves over time without any cloud-based learning.

Frontend Rendering Pipeline

Rendering a PDF with redaction overlays in real time is a non-trivial challenge. Qvault's DocumentViewer manages a multi-layer rendering stack:

PDF canvas layer. PDF.js renders each page to an HTML5 canvas at 1.5x viewport scale for crisp text on high-DPI displays.
Text span extraction. Every text fragment is extracted with position data in both PDF and viewport coordinate systems.
Redaction overlay layer. PII detections are mapped from character offsets to spatial coordinates. Color-coded overlays indicate PII category and review status.
Selection layer. In advanced mode, users can select arbitrary text to create manual redactions with a popup interface.
Export coordinate mapping. Viewport coordinates are translated back to PDF space. The Rust backend injects black rectangle operators into the PDF's content streams — permanently redacting the selected regions.

This entire pipeline runs locally. The PDF never leaves the app. The coordinates never leave the app. The redacted export is written directly to the local filesystem.

Performance Characteristics

Factor	How
Compiled regex	Deterministic finite automata — linear-time matching, no backtracking. A 50–100 page document scans in under 100ms.
Single-process IPC	Tauri commands are in-process function calls. Serialization overhead is microseconds, not milliseconds.
SQLite WAL mode	Concurrent readers and writers. The frontend queries redaction data while the backend processes new pages.
Streaming processing	Pages scanned individually, results stored incrementally. Users see detections in real time.
Native webview	~15–20 MB installed size (vs 150+ MB for Electron). Sub-second startup.
Zero network latency	No upload, no download, no waiting. Processing is purely CPU-bound, measured in milliseconds.

What Qvault Does NOT Do

Understanding what Qvault avoids is as important as understanding what it does:

No cloud uploads. Documents are never transmitted. There are zero network calls in the entire application.
No telemetry. No usage tracking, no analytics, no crash reporting that includes document content.
No third-party processing. No OpenAI API, no AWS Textract, no Google Cloud Vision. Every algorithm runs locally.
No external model downloads. Pure compiled code — regex patterns and heuristic rules. No model weights, no inference servers.
No temporary cloud storage. The data physically cannot leave because the processing code has no network capability.

Cross-Platform Distribution

Qvault runs natively on macOS (Apple Silicon and Intel), Windows, Linux, iOS, and Android:

Platform	Formats	Notes
macOS	.dmg	Separate builds for Apple Silicon (ARM64) and Intel (x86_64)
Windows	.exe, .msi	NSIS installer with multi-language support (EN/ES/PT)
Linux	.deb, .AppImage	Debian/Ubuntu packages and universal AppImage
iOS	.ipa	iPhone, iPad — touch-optimized UI with bottom tab navigation
Android	.apk	Phones and tablets — responsive layout with document picker

All desktop builds are produced by a single GitHub Actions workflow using a matrix strategy. Mobile builds use Tauri v2's native iOS (Xcode) and Android (Gradle) toolchains, sharing the same Rust backend and React frontend.

Enterprise Integration

For organizations requiring enhanced PII detection beyond regex and heuristic analysis, Qvault offers enterprise integration options:

MCP Pipeline Integration. Connect Qvault to a Model Context Protocol (MCP) pipeline for LLM-powered PII detection. The MCP server can run alongside Qvault, enabling AI-assisted entity recognition while maintaining data locality.
Private LLM Deployments. Deploy a local language model (e.g., Llama, Mistral) on your own infrastructure. Qvault connects to your private endpoint — no data leaves your network. Ideal for firms with strict compliance requirements who want AI-level accuracy.
Custom Pattern Libraries. Enterprise customers can define jurisdiction-specific or industry-specific PII patterns beyond the built-in global/US/EU/BR/DE coverage.

Enterprise integration is available through Santacroce SL. Contact info@santacroce.es to discuss your requirements.

Jurisdictional Coverage

Legal work is increasingly international. A firm in Madrid might handle documents containing US Social Security Numbers, Brazilian CPFs, and German tax IDs — all in the same case.

Jurisdiction	PII Types Detected
Global	Email, credit cards, IBAN, IP addresses, URLs, monetary amounts, dates, phone numbers
United States	Social Security Numbers (SSN), Employer Identification Numbers (EIN)
European Union	VAT identification numbers (multi-country format)
Brazil	CPF (individual tax ID), CNPJ (corporate tax ID)
Germany	Steuernummer (tax identification number)

Combined with the heuristic scanner's ability to detect names and company entities regardless of language, Qvault provides broad coverage for international legal practices.

The Trust Model

Qvault's security model is simple: trust the machine, distrust the network.

Audit logging records every action for compliance review.
No network surface eliminates MITM attacks, data exfiltration, API key leakage, and server breaches as threat vectors.

For legal professionals operating under strict confidentiality obligations, this model provides something cloud tools fundamentally cannot: certainty that client data never left the building.

Qvault is developed by Santacroce SL, Madrid, Spain.
Learn more at qvault.tech.

The Architecture Behind Qvault

Contents

The Local-First Architecture

Why Rust?

The IPC Bridge

Dual-Layer PII Detection Engine

Layer 1: Pattern-Based Regex Scanner

Layer 2: Heuristic Context Scanner

Document Processing Pipeline

Extraction

Storage

Frontend Rendering Pipeline

Performance Characteristics

What Qvault Does NOT Do

Cross-Platform Distribution

Enterprise Integration

Jurisdictional Coverage

The Trust Model