Zero data retention as an architecture principle

The constraint

Redactr API lets developers embed full-depth PDF redaction into their software. The core engineering problem isn't PDF processing — it's guaranteeing that sensitive data never persists anywhere in the pipeline that isn't under the customer's control.

Not "we delete it after processing." Not "we encrypt it at rest." Document content never hits our database and never gets written to our storage. When a job is done, there's nothing to clean up because there's nothing to find.

That's the constraint. And it turns out, when you take a privacy requirement this seriously, it makes every architectural decision clearer.

Why most teams get this wrong

The default approach to privacy in most systems is policy-based. Process the data however you like, store it wherever is convenient, then layer on retention policies, encryption, and access controls after the fact. The architecture optimises for functionality first. Privacy gets bolted on.

This works until it doesn't. A log statement captures a file path it shouldn't. A queue retains message payloads longer than expected. A third-party API caches your request. Each gap is small, but in a system handling sensitive documents, any one of them breaks the guarantee you made to your customers.

Redactr takes the opposite approach: design the pipeline so there's nowhere for data to persist in the first place.

How data flows through the pipeline

When a client sends a PDF via the API, the document bytes live in memory for the entire request lifecycle. The Laravel application receives the upload, passes the bytes to the gRPC services for processing, and returns the result in the HTTP response. The PDF never touches disk and never gets written to storage. When the response is sent, the data is gone.

For async processing, jobs get dispatched to a Redis queue. The queue payload includes the document bytes, but Redis is an in-memory datastore — the data sits in memory, not on disk. Once the worker picks up the job and finishes processing, the payload is gone.

The gRPC services

Redactr coordinates two gRPC microservices for processing: a PDF service built on PyMuPDF for document operations, and a suggestions service for identifying sensitive data.

PDF service

The PDF service handles the heavy lifting — extracting document metadata, applying redactions, verifying the results. For each operation, the PDF bytes are sent in the gRPC message body, processed in memory by the Python service, and the result is returned. No temporary files, no disk writes.

For text extraction specifically, the service uses gRPC server-side streaming. Rather than extracting all text and returning it in one message, it streams page-by-page. This keeps memory bounded for large documents and means the full extracted text never needs to exist as a single in-memory object on either side.

Suggestions service

The suggestions service uses Presidio to identify names, addresses, financial details, and other entities that might need redaction. It takes extracted text (not PDF bytes) and returns structured suggestions. The document itself never reaches this service.

AI suggestions with Bedrock

For premium suggestion agents, Redactr uses AWS Bedrock for LLM-based sensitive data identification. Sending documents through an LLM is a privacy minefield, but two design decisions make it work within the zero-retention constraint.

First, Bedrock only receives extracted page text — never the full PDF. The gRPC PDF service extracts text via streaming, and only that text is sent to Bedrock. This means Bedrock never sees the document structure, images, or any content that wasn't specifically extracted for suggestions.

Second, AWS Bedrock doesn't store prompts, doesn't store completions, and doesn't use your data for training. The text goes in, the inference runs, the structured results come back, and nothing persists on AWS's side.

The result is that only structured suggestions — entity types and matched text — flow back into the pipeline. The LLM's role is narrow and stateless: receive text, identify entities, return suggestions, forget everything.

Privacy in the details

The constraint doesn't just shape the big architectural decisions. It shows up in the small ones.

Error message sanitisation. When a processing job fails, the exception message gets sanitised before it's stored. File paths are stripped out with pattern matching and replaced with [path]. A failed job record tells you what went wrong without revealing anything about the document or where it came from.

Idempotency handling. The API supports idempotency keys for safe retries, but binary responses — the actual PDF bytes — are excluded from the idempotency cache entirely. Only JSON metadata responses get cached. This prevents processed documents from sitting in a cache table between retries.

API request logging. Request logs capture HTTP method, endpoint, status code, duration, and response size. Not request or response bodies. You can debug performance and error rates without ever seeing what was in the documents.

Each of these is a small decision, but they add up to a system where the constraint is enforced at every layer, not just the obvious ones.

Where the complexity actually lives

Zero data retention shifts complexity away from data lifecycle management and towards pipeline design. The problems you solve are different:

Memory management. When documents never touch disk, everything happens in memory. Large PDFs need careful handling. The gRPC streaming for text extraction helps here, and the max message size is configurable, but you still need to reason about memory bounds across the pipeline.

Error handling without replay. In a typical system, if a processing step fails, you retry from the stored input. When document content isn't stored, the client resubmits. This sounds like a limitation, but it simplifies the system — there's no stale input to manage, no question about whether the stored version matches what the client intended to send.

Observability without exposure. You need to know what's happening in the pipeline — processing times, error rates, success rates per operation. But you can't log document content, and error messages need sanitisation. The constraint extends to your debugging tools.

None of these are unsolvable. But they're the kind of problems you only encounter when you take the constraint seriously, and solving them leads to better engineering decisions than the alternative.

Why starting here is easier than retrofitting

The counterintuitive thing about building privacy-first is that it's less work than adding privacy later. When retention is impossible by design, you don't need:

Data retention policies and the enforcement mechanisms behind them
Encryption key management for documents at rest
Access control matrices for stored content
Audit logging of who accessed what document when
Scheduled cleanup jobs and the monitoring to ensure they actually run

That entire category of operational overhead disappears when the data isn't there.

The architecture ends up simpler, too. The Laravel application orchestrates. The PDF service processes. The suggestions service identifies entities. Bedrock infers. Data flows through each layer and is gone. There's no shared document store to manage, no cache invalidation to worry about, no eventual consistency problems around storage.

Starting with the constraint didn't make Redactr harder to build. It made the architecture more obvious.

The constraint ​

Why most teams get this wrong ​

How data flows through the pipeline ​

The gRPC services ​

PDF service ​

Suggestions service ​

AI suggestions with Bedrock ​

Privacy in the details ​

Where the complexity actually lives ​

Why starting here is easier than retrofitting ​