Infrastructure2026-03-186 min

Your Cron Jobs Are Failing and Nobody Knows

You have background jobs. Every non-trivial application does. Token rotation, nightly data syncs, cleanup tasks, report generation, cache warming, email digests. They run on a schedule — a cron tab, an EventBridge rule, a Kubernetes CronJob, a Celery beat, whatever.

Here's the problem: you have absolutely no idea whether they're actually running. Your token rotation job could have been silently failing for three weeks. The nightly sync might have thrown an exception last Tuesday and never recovered. The cleanup job that keeps your database from bloating? It might have stopped on the last deploy when someone refactored the import path.

You won't find out until users complain. Or until the database fills up. Or until OAuth tokens expire in bulk and a hundred users can't log in on the same Monday morning.

The invisible infrastructure

Web requests are visible. They hit your API, they show up in logs, they have response codes and latency metrics. If your API goes down, you know within minutes.

Background jobs are invisible. They run in a separate process, often on a different container or Lambda function. They log to stdout, which either goes to CloudWatch (which nobody checks) or disappears entirely. There's no user on the other end waiting for a response. There's no 500 status code. There's no alert.

This creates a specific class of failure that's unique to background jobs:

- **Silent failures** — the job throws an exception, logs it somewhere nobody reads, and doesn't run again until the next scheduled time. If it fails again, it silently fails again. Forever. - **Zombie jobs** — the job is registered in your crontab or scheduler but the handler function was moved, renamed, or deleted. The scheduler tries to invoke it, gets an import error, and logs it. Nobody notices. - **Drift** — the job ran successfully for months, but the data it processes has grown. What used to take 2 seconds now takes 45 minutes and times out. It's been timing out for a week. - **Missing runs** — the scheduler itself went down, or the container was evicted, or the deploy forgot to restart the worker process. The job simply doesn't execute. No error, no log, nothing.

What your team actually needs to know

For every background job in your system, you should be able to answer these questions from a dashboard — not from grepping CloudWatch logs:

1. **What jobs exist?** A complete list with their schedules and descriptions. 2. **When did each job last run?** If the answer is "3 weeks ago" for a daily job, that's your problem. 3. **Did it succeed or fail?** And if it failed, what was the error? 4. **How long did it take?** A job that usually takes 5 seconds but took 12 minutes last run is about to time out. 5. **Can I trigger it manually?** When something goes wrong, you need to re-run the job right now, not wait until 3 AM.

Most teams can't answer a single one of these without SSH-ing into a server.

The pattern that actually works

// Register jobs at app startup
import { jobRegistry } from "./stablestack_admin_dashboard/jobs";

jobRegistry.register({
  name: "rotate-tokens",
  schedule: "0 3 * * *",
  handler: rotateExpiredTokens,
  description: "Refresh OAuth tokens expiring within 30 days",
});

jobRegistry.register({
  name: "sync-calendars",
  schedule: "*/15 * * * *",
  handler: syncCalendars,
  description: "Sync Google and Outlook calendar availability",
});

jobRegistry.register({
  name: "cleanup-expired-sessions",
  schedule: "0 0 * * *",
  handler: cleanupSessions,
  description: "Remove sessions older than 30 days",
});

The solution is a job registry — a central place where every background job is registered at startup with its name, schedule, handler, and description. The registry tracks runs automatically: when a job starts, when it finishes, whether it succeeded or failed, how long it took, and whether it was triggered by the scheduler or manually by an admin.

This isn't a monitoring service. It's not Datadog or PagerDuty. It's a module inside your application that your admin dashboard can query. Three API endpoints give you everything:

- `GET /api/admin/jobs` — list all registered jobs with schedule and last-run info - `GET /api/admin/jobs/:name/runs` — run history for a specific job - `POST /api/admin/jobs/:name/trigger` — manually trigger a job from the UI

Wrapping job execution

// In your scheduler wrapper
const run = jobRegistry.startRun("rotate-tokens");
try {
  await rotateExpiredTokens();
  jobRegistry.completeRun(run);
} catch (e: any) {
  jobRegistry.failRun(run, e.message);
  // Still throw or alert — the registry records, your alerting notifies
  throw e;
}

// For manual triggers from the admin UI, it's even simpler:
// POST /api/admin/jobs/rotate-tokens/trigger
// The registry handles startRun/completeRun/failRun automatically

Registration alone isn't enough. You need to wrap each job's execution so the registry tracks what actually happened. The pattern is three calls: `startRun` before the job executes, `completeRun` when it succeeds, and `failRun` when it throws. This gives you duration, status, and error messages for every execution — the data that makes the difference between "I think the cron job runs" and "the cron job ran at 3:02 AM, took 4.7 seconds, and succeeded."

Why AI won't build this for you

Ask an AI assistant to "add cron job monitoring" and you'll get a suggestion to install a third-party service, or a database table with created_at timestamps and no actual integration with your scheduler. It won't know what jobs you have, where your scheduler is configured, or how to wrap execution with run tracking.

The hard part isn't the registry itself — it's the integration. You need to find every background task in the codebase, register it, wrap its execution, and wire the API endpoints into your admin dashboard behind auth. That's a project, not a prompt.

One command to install

$ stablestack add-admin-dashboard
Created stablestack_admin_dashboard/monitor.py
Created stablestack_admin_dashboard/models.py
Created stablestack_admin_dashboard/jobs.py          # Job registry + run tracking
Created stablestack_admin_dashboard/api.py
Created stablestack_admin_dashboard/ui/AdminDashboard.tsx
Created .claude/commands/admin-dashboard.md

The job monitoring system is built into StableStack's admin dashboard scaffold. It generates the registry, the API endpoints, and the React UI in one command:

What the Jobs tab gives you

**Job registry** — A `JobRegistry` class with `register()`, `startRun()`, `completeRun()`, `failRun()`, and `trigger()`. In-memory by default with a 50-run history per job. Plug in your database for persistent storage.

**Three API endpoints** — List all jobs with schedules and last-run info. View run history for any job (duration, status, errors, whether it was manual or scheduled). Trigger any job on demand from the admin UI.

**React dashboard tab** — A "Jobs" tab showing every registered job in a table: name, cron schedule, last run time, status badge (success/failed/running). Click a job to expand its run history. A "Trigger" button for each job lets admins re-run failed jobs immediately.

**Both languages** — Full implementations in Python (FastAPI + dataclasses) and TypeScript (Express + classes). Auto-detects your project language.

**Claude Code slash command** — Run /admin-dashboard and Claude finds every background task in your codebase, registers it with the job registry, wraps its execution, and wires up the dashboard.

Combine with the token manager scaffold and the admin dashboard auto-detects it — your token rotation cron job shows up in the Jobs tab with full run history.

What to do right now

Open your codebase and find every scheduled task. Check your crontab, your Kubernetes CronJobs, your Celery beat schedule, your EventBridge rules, your Procfile. For each one, ask:

1. When did this last run successfully? 2. Has it ever failed? What was the error? 3. Can I trigger it manually right now if I needed to?

If you can't answer those questions from a dashboard in your application, you have invisible infrastructure. Things are running underneath you that you don't know about. And the next time one of them breaks, you'll find out the hard way.

Free with every install. No license key required.

pip install stablestack