Zero-Latency PII Filtering in Python Logging: GDPR Compliance Without the Performance Hit
While implementing centralized logging for a FastAPI service using Axiom for logging, I ran into a challenge: how do you filter PII (Personal Identifiable Information, like email, phone number etc.) from logs without slowing down your application?
The obvious solution—sanitizing logs during the log call itself—added 1-2ms to every request:
logger.info(sanitize_pii(f"User {email} logged in")) # Blocks main thread!
This blocks the main thread for every single log entry. With Python running on a single thread by default and regex-based PII filtering taking 1-2ms per log, this creates a real performance problem. And more filtering will follow, so this is potentially huge.
The Solution: Background Thread Filtering
Most modern log aggregation libraries (axiom-py, datadog, logstash) buffer logs and send them in batches every 5-10 seconds using a background thread. So what if we could use that? And sanitize during the flush, not during the log call itself. So I basically extended the Axiomhandler:
from axiom_py.logging import AxiomHandler
class SafeAxiomHandler(AxiomHandler):
"""Custom handler with automatic PII filtering in background thread."""
def flush(self):
"""Override flush to sanitize before sending to Axiom."""
if len(self.buffer) == 0:
return
# Sanitize all buffered logs (runs on background thread!)
sanitized_buffer = [
self._sanitize_event(event)
for event in self.buffer
]
self.buffer = []
self.client.ingest_events(self.dataset, sanitized_buffer)
def _sanitize_event(self, event_dict: dict) -> dict:
"""Redact PII from log event."""
sanitized = {}
for key, value in event_dict.items():
if isinstance(value, str):
# Use your sanitization function
value = sanitize_string(value)
sanitized[key] = value
return sanitized
Why This Works
The flush() method runs on a background thread (via Timer) in axiom-py and similar libraries. By moving PII filtering there:
- Main thread: 0.01ms overhead (just dict conversion)
- Background thread: 1-2ms for regex filtering
- Request latency impact: 0ms
The logs get sanitized 5 seconds later when the background thread flushes the buffer. For production logging, that delay is perfectly acceptable.
How Method Overriding Works
When you inherit from a class and define a method with the same name as the parent class, you override that method. Python’s method resolution looks for methods in the child class first:
# Parent class (axiom-py library)
class AxiomHandler:
def flush(self):
"""Original: just send buffer to Axiom"""
if len(self.buffer) > 0:
self.client.ingest_events(self.dataset, self.buffer)
self.buffer = []
# Your child class
class SafeAxiomHandler(AxiomHandler):
def flush(self):
"""Override: sanitize BEFORE sending"""
# This completely replaces parent's flush()
sanitized_buffer = [self._sanitize_event(e) for e in self.buffer]
self.buffer = []
self.client.ingest_events(self.dataset, sanitized_buffer)
The execution flow:
Background Thread (Timer) calls flush():
↓
Is flush() defined in SafeAxiomHandler?
↓ YES
SafeAxiomHandler.flush() executes:
1. Sanitize all events in buffer (YOUR CODE)
2. Send sanitized events to Axiom (YOUR CODE)
↓
Done! Main thread was never blocked.
The brilliance: the background thread calling pattern doesn’t change. When axiom-py’s Timer fires and calls self.flush(), Python automatically routes it to your override. The library doesn’t know you’ve customized it—it just works.
You have access to all parent class attributes:
self.buffer- inherited from AxiomHandlerself.client- inherited from AxiomHandlerself.dataset- inherited from AxiomHandler
As an extra: If you wanted to keep parent behavior and add to it, use super():
def flush(self):
# Sanitize first
self.buffer = [self._sanitize_event(e) for e in self.buffer]
# Then call parent's flush
super().flush() # Calls AxiomHandler.flush()
Implementation Tips
1. Identify Sensitive Keys
Some data should be completely redacted:
SENSITIVE_KEYS = {
"questionnaire", "questions", "answers",
"chat_history", "password", "api_key"
}
if key in SENSITIVE_KEYS:
sanitized[key] = "[REDACTED]"
2. Use Proper PII Detection Libraries
For emails, regex works fine. For phone numbers, use Google’s phonenumbers library for accurate international detection:
import re
import phonenumbers
# Email: regex is sufficient
EMAIL_PATTERN = re.compile(
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
)
def sanitize_string(text: str) -> str:
"""Sanitize PII from text."""
# Redact emails
text = EMAIL_PATTERN.sub("[REDACTED]", text)
# Redact phone numbers using Google's library (more accurate for GDPR)
for match in phonenumbers.PhoneNumberMatcher(text, None):
text = text.replace(match.raw_string, "[REDACTED]")
return text
Why a phonenumbers library over regex? GDPR compliance requires accurate detection. Phone number formats vary wildly across countries—Google’s library handles all international formats correctly.
Alternative: For more comprehensive PII detection (names, addresses, credit cards, etc.), consider Microsoft’s Presidio. It’s a full-featured PII detection and anonymization framework with support for multiple languages and custom entity types. Maybe I will switch to this in one of the next iterations.
Edge Cases to Consider
Buffer Overflow: If your app logs faster than the flush interval (>1000 logs in 5 seconds), flush() might run on the main thread. Monitor buffer size in production.
Performance Consideration: The phonenumbers library is more thorough than regex but slightly slower. In production, this trade-off is worth it for GDPR accuracy, especially since filtering happens on the background thread.
Testing: Mock the client to test without real API calls:
from unittest.mock import MagicMock
def test_pii_filtering():
mock_client = MagicMock()
handler = SafeAxiomHandler(client=mock_client, dataset="test")
logger = logging.getLogger("test")
logger.addHandler(handler)
logger.info("User john@example.com logged in")
handler.flush()
# Verify email was redacted
sent_events = mock_client.ingest_events.call_args[0][1]
assert "[REDACTED]" in sent_events[0]["msg"]
assert "john@example.com" not in sent_events[0]["msg"]
Trade-offs
Benefits:
- Zero request latency impact
- Automatic protection (developers can’t forget)
- GDPR compliant by default
- Easy to test and maintain
Costs:
- Logs delayed by 5-10 seconds
- Slight memory overhead (buffer storage)
- Pattern only works with buffered/batched handlers
When to Use This Pattern
This pattern is essential for:
- Production APIs with strict SLAs (<3s response time)
- GDPR/CCPA compliance requirements
- High-volume logging (>100 logs/second)
- User-generated content that might contain PII
Don’t use it for:
- Development/local logging (console handlers don’t buffer)
- Low-traffic apps where 1-2ms overhead is acceptable
- Real-time log streaming (no buffering)
The Result
In our production FastAPI service:
- Before: 1-2ms added to request time for PII filtering
- After: 0ms request overhead, filtering happens in background
- GDPR compliance: Automatic, no developer action needed
By leveraging the background thread that already exists in modern logging libraries, we achieved GDPR-compliant logging without sacrificing performance. The key was recognizing that log delivery doesn’t need to be synchronous—and that’s where the opportunity lives.
Resources
Python Logging
- Python Logging Documentation - Official docs
- Logging Best Practices - The Hitchhiker’s Guide to Python
- Python Logging Handlers - Handler reference
GDPR & PII
- GDPR Official Text - Full regulation text
- GDPR Logging Requirements - Compliance guide
- Google phonenumbers Library - Accurate international phone number detection
- PII Detection Patterns - Microsoft’s PII detection library
Log Aggregation Services
- Axiom Documentation - Axiom logging service
- Datadog Logging - Datadog logs
- Better Stack - Modern logging platform
Background Processing
- Python Threading - Threading documentation
- Queue Module - Thread-safe queues
- AsyncIO Logging - Async logging patterns
How do you handle PII in your logging? Found other patterns that work? I’d love to hear about your approach—connect with me on LinkedIn.