The fastest query is the one you never run
A public status page is the most deceptively simple feature I've built. It's a green dot, a list of services, and that row of little coloured bars showing the last 90 days of uptime. Customers glance at it for three seconds and leave. How hard can it be?
Hard in exactly one place, and it's not the place you'd guess.
DomainDash stores every uptime check in TimescaleDB. When I first sketched the status page, the plan was obvious: the data's right there in the time-series database, so query it. Ask for 90 days of history per site and paint the bars. I'd already written that exact query for the insights feature. Copy, paste, ship.
Then I remembered what a status page actually is: a public, unauthenticated page that gets hammered precisely when things are on fire. And the query I was about to put on its hot path is the single most expensive question you can ask a time-series database — give me all the history, for every service, right now — multiplied by every anxious customer hitting refresh during an incident.
This post is about not running that query.
Same data, opposite access pattern
The 90-day uptime read shows up twice in DomainDash, and the two cases look identical until you count how often they happen.
In the insights feature, the read is low-frequency. It runs once per day per site, to assemble a snapshot for the LLM, and the result is cached for 36 hours. Querying TimescaleDB live is completely fine there: a continuous-aggregate read against a rollup table is cheap, and you do it a handful of times a day.
On a status page, the read is high-frequency and public. Same SQL, same rollup table, but now it's on the request path of a page designed to be refreshed by hundreds of people at the worst possible moment. The cost per query didn't change. The number of queries did, by three or four orders of magnitude.
That's the whole insight. The right architecture isn't a property of the data — it's a property of how often, and by whom, the data gets read. So for the status page I moved the expensive read off the request path entirely.
The shape of the cheat
The request path for a published status page runs zero time-series queries. It reads three ordinary columns off the sites row:
status: the live verdict (healthy, slow, down), already denormalised because the dashboard's site cards read it too.status_sparkline_90d: a JSON array of 90 verdicts, one per day. The bars.status_sparkline_refreshed_at: when that array was last rebuilt.
No time_bucket(). No continuous-aggregate scan. No join. The most data-intensive widget on the page is a column read, served from a row the database already has in memory.
The expensive part happens once a night, for everybody, in the dark.
The nightly batch
A scheduled command rebuilds the sparkline for every site attached to a status page. It's the only place the time-series database gets the big question:
$rows = DB::table('uptime_checks_daily')
->select('site_id', 'bucket')
->selectRaw('SUM(check_count) AS checks')
->selectRaw('SUM(success_count) AS successes')
->selectRaw('SUM(avg_response_ms * check_count) / NULLIF(SUM(check_count), 0) AS weighted_avg_ms')
->whereIn('site_id', $siteIds->all())
->where('bucket', '>=', $start)
->groupBy('site_id', 'bucket')
->get()
->groupBy('site_id');uptime_checks_daily is a TimescaleDB continuous aggregate: a materialised, incrementally-maintained rollup of the raw uptime_checks hypertable, bucketed to one row per site, per region, per day:
CREATE MATERIALIZED VIEW uptime_checks_daily
WITH (timescaledb.continuous) AS
SELECT
site_id,
region_id,
time_bucket('1 day', checked_at) AS bucket,
COUNT(*) AS check_count,
COUNT(*) FILTER (WHERE success) AS success_count,
AVG(response_time_ms) AS avg_response_ms,
MIN(response_time_ms) AS min_response_ms,
MAX(response_time_ms) AS max_response_ms
FROM uptime_checks
GROUP BY site_id, region_id, bucket;The batch reads 90 daily buckets per site (already aggregated; we're not scanning the raw checks), collapses the per-region rows into one verdict per day, and writes the resulting 90-element array back onto the sites row. One query, all sites, once a day. Then it sleeps.
A day's verdict is a small, deliberate classification: anything below 99% success is down, anything healthy-but-slow crosses a response-time threshold, no data at all is null (renders as a grey gap, not a lie):
public function __invoke(int $checks, int $successes, ?int $avgMs): ?StatusPageVerdict
{
if ($checks === 0) {
return null;
}
if ($successes / $checks < 0.99) {
return StatusPageVerdict::Down;
}
$slowThreshold = (int) config('services.checks.slow_threshold_ms');
if ($avgMs !== null && $avgMs >= $slowThreshold) {
return StatusPageVerdict::Slow;
}
return StatusPageVerdict::Healthy;
}The catch: yesterday's truth, served today
Precomputing buys you cheap reads and hands you a freshness problem in return. Two of them, actually.
The continuous aggregate is deliberately stale. Its refresh policy stops an hour short of now:
SELECT add_continuous_aggregate_policy('uptime_checks_daily',
start_offset => INTERVAL '3 days',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour',
if_not_exists => true);That end_offset is intentional: you don't want a bucket materialising while checks are still landing in it, or you bake in a half-finished number. So the rollup is always at least an hour behind reality by design.
The stored array drifts off the calendar. This is the subtle one. The array is a batch artifact: its last slot is the UTC day the refresh ran. Between nightly runs, "the last element" and "today" quietly diverge — by lunchtime the array's final bar is yesterday wearing today's hat.
If I rendered the raw array, every status page would show a stale "today" that's really last night, and an outage happening right now wouldn't appear until tomorrow's batch. Unacceptable for the one feature whose entire job is to be believed during an incident.
The fix keeps the cheap read and stitches the fresh bit back on at render time — still without touching TimescaleDB:
$stored = array_values($site->status_sparkline_90d ?? []);
$storedEnd = $site->status_sparkline_refreshed_at?->copy()->utc()->startOfDay() ?? $today;
$byDate = [];
foreach ($stored as $i => $verdict) {
$date = $storedEnd->copy()->subDays(count($stored) - 1 - $i)->format('Y-m-d');
$byDate[$date] = $verdict;
}
$live = StatusPageVerdict::forSiteStatus($site->status)->value;
for ($i = self::DAYS - 1; $i >= 0; $i--) {
$date = $today->copy()->subDays($i);
$isToday = $i === 0;
$verdict = $byDate[$date->format('Y-m-d')] ?? null;
if ($isToday) {
$verdict = $this->worst($verdict, $live);
}
// ...
}Two moves. First, re-anchor the stored array onto real calendar dates using status_sparkline_refreshed_at, so a batch that ran 14 hours ago lands on the right days no matter when the page is viewed. Second, fill today's slot from the live status column (the same denormalised verdict that paints the green dot), so the newest bar is always current to the last check, with no query at all.
The worst() part matters more than it looks. Today's bar takes the more severe of the stored verdict and the live one:
private function worst(?string $stored, string $live): string
{
$rank = ['healthy' => 0, 'slow' => 1, 'down' => 2];
if ($stored === null || ! array_key_exists($stored, $rank)) {
return $live;
}
return $rank[$stored] >= $rank[$live] ? $stored : $live;
}A site can go down at 9am and recover by noon. The live status at render time is "healthy" — but the day was not healthy, and a status page that quietly repaints a morning outage the moment things recover is a status page nobody should trust. Worst-wins means the record of the bad morning survives the good afternoon.
Current status is its own, simpler question
The big green "all systems operational" verdict at the top never goes near the history at all. Each site already carries a live status enum; the page-wide verdict is just the worst of them:
if ($verdicts->contains(StatusPageVerdict::Down)) {
return StatusPageVerdict::Down;
}
if ($verdicts->contains(StatusPageVerdict::Slow)) {
return StatusPageVerdict::Slow;
}
return StatusPageVerdict::Healthy;"Is it up right now" and "what did the last 90 days look like" feel like the same question. They're not. One is a denormalised column updated by the check pipeline; the other is a nightly rollup with a live patch on the end. Keeping them separate is what lets each one be cheap.
Caching is the bonus, not the crutch
Because the render is already cheap, the HTTP caching on top is gravy rather than load-bearing. Published pages go out with:
$response->headers->set('Cache-Control', 'public, max-age=30, s-maxage=30');
$response->headers->set('Vary', 'Cookie');A shared CDN cache absorbs the refresh-stampede during an incident, 30 seconds at a time. But here's the part I like: if the cache evaporated entirely, the origin would be fine, because every uncached hit is still just a handful of column reads. I didn't build a cache to rescue an expensive page. I built a cheap page and let the cache make it cheaper. Drafts, by contrast, get private, no-store — a preview is for one logged-in person and must never leak into a shared cache.
Multi-tenancy on an anonymous page
One more wrinkle unique to public pages: there's no logged-in user to scope the data to. Resolution happens off the hostname instead. A DomainDash-hosted page arrives on a {slug}.status.domaindash.io subdomain and is looked up by slug. A customer's own status.acme.com is matched by host, but only if its custom domain has been verified:
$statusPage = $query
->where('custom_domain', $request->getHost())
->where('custom_domain_status', CustomDomainStatus::Active)
->first();The Active gate is the whole security boundary for custom domains: an unverified or pending domain resolves to nothing, so you can't point a hostname at someone else's page and have it served. Slugs are unique, custom domains are unique and status-gated, and the slug route is registered before the custom-domain fallback, so the two namespaces can't collide.
What this isn't
- No separate analytics store. The bars come from the same TimescaleDB rollup that powers everything else. There's no warehouse, no OLAP copy, no second source of truth to drift.
- No pre-rendered snapshot table. The precomputed data is a JSON column on the row it belongs to, not a separate blob store or a pile of generated HTML.
- No event sourcing. Incidents are plain state, not an append-only log to replay.
- No time-series query on the request path. This is the entire point. The hot path reads three columns and stitches one live value onto the end.
Every one of these is a thing you'd reach for if the simple version fell over. It doesn't.
The shape, in summary
| Concern | How it's served | Cost on the request path |
|---|---|---|
| 90-day history | status_sparkline_90d JSON column | One column read |
| Today's bar | Live status column, worst-merged | One column read, no query |
| Current verdict | Worst of each site's live status | In-memory over loaded sites |
| Calendar alignment | Re-anchored via status_sparkline_refreshed_at | Pure PHP, no I/O |
| The expensive read | Nightly batch over uptime_checks_daily | Off the request path entirely |
| Rollup freshness | Continuous aggregate, 1-hour end_offset | N/A, runs on a policy |
| Stampede protection | s-maxage=30 shared cache | Bonus, not load-bearing |
| Tenant scoping | Slug subdomain or verified custom domain | One indexed lookup |
The interesting engineering in a status page isn't in the rendering. It's in the decision, made once, about when the hard question gets asked. Ask it on every page load and a time-series database will happily oblige until the day it can't. Ask it once a night, write the answer down, and patch the last few hours on at the edge — and the page that's supposed to stay up when everything else is falling over becomes the cheapest thing you serve.
The fastest query really is the one you never run.