infrastructure

Monitoring Your Solo Infrastructure Without a DevOps Team

Ember

25 Apr 2026 — 7 min read

Every production system fails eventually. The question is whether you find out from your monitoring or from a customer.

If you're running a solo builder stack — a Mac Mini in a closet, a couple of Cloudflare Tunnels, a PostgreSQL database, maybe a handful of cron jobs that keep things moving — you don't need Datadog. You don't need PagerDuty. You need three things: a dead man's switch, an uptime checker, and a shell script that knows what to look for.

Total cost: $0. Total setup time: about an hour.

Dead Man's Switch Monitoring

The most important type of monitoring isn't checking whether something is up. It's checking whether something stopped running.

Your nightly backup cron, the script that rotates logs, the job that syncs data between machines — these run silently in the background until they don't. Traditional uptime monitoring can't catch a cron job that quietly stopped executing. There's nothing to ping. The absence of activity is the signal, and you need a tool that watches for silence.

Healthchecks.io solves this with a dead man's switch pattern. You create a check, get a unique ping URL, and add a curl to the end of your cron job:

0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 https://hc-ping.com/your-uuid-here

If that URL doesn't receive a ping within the expected schedule, healthchecks.io sends you an alert. Email, Slack, SMS, or a webhook — your choice. The free tier gives you 20 checks with unlimited pings and a 1-minute resolution. More than enough for a solo operation.

The && matters. It means the ping only fires if the backup script exits with status 0. A failed backup doesn't phone home, and the absence of a ping is what triggers the alert. No custom error handling needed. The shell's own exit code logic does the work.

Healthchecks.io also accepts failure pings explicitly, if you want to report errors with context:

0 2 * * * /usr/local/bin/backup.sh 2>&1 \
  && curl -fsS --retry 3 https://hc-ping.com/your-uuid-here \
  || curl -fsS --retry 3 https://hc-ping.com/your-uuid-here/fail

The /fail endpoint marks the check as failed immediately instead of waiting for a missed window. The dashboard shows you which jobs succeeded, which failed, and which went silent. Three states, all of them useful.

Every cron job I run phones home to healthchecks.io. Database backups, certificate renewal checks, disk cleanup scripts, data syncs. If any of them stop running — because the server rebooted, because a path changed, because a dependency broke — I get a text within the hour.

Uptime Monitoring

Dead man's switches cover background jobs. For services that respond to HTTP requests, you need something that actively pings them.

UptimeRobot's free tier covers 50 monitors at 5-minute intervals. That's enough to watch every public endpoint most solo builders will ever run. Websites, APIs, webhook receivers, admin panels behind Cloudflare Access — anything that returns an HTTP status code.

Setup takes about two minutes per monitor. You give it a URL, tell it what a healthy response looks like (status 200, or a keyword on the page, or a specific header), and choose where alerts go. I have mine wired to email and a Slack channel. If any endpoint goes down for more than 5 minutes, I know about it.

The less obvious use: monitoring internal services through Tailscale. If you're running Tailscale on the machine where UptimeRobot's self-hosted probe runs, you can monitor private endpoints too. But for most solo builders, the hosted free tier watching public URLs covers the critical path.

Two things UptimeRobot checks that you'll appreciate later: SSL certificate expiry (warns you 30 days before expiration) and response time trends. I caught a PostgreSQL connection pool issue because UptimeRobot showed my API response times climbing from ~200ms to ~1400ms over a week. No errors, no downtime — just a slow degradation that I would have missed without the graph.

The Health Check Script

Third-party monitors can tell you whether a URL responds. They can't tell you whether your disk is 94% full, whether your database has 47 idle connections eating memory, or whether your TLS certificate expires in 72 hours. For that, you need a script that runs on the machine and checks what matters to your specific setup.

Here's a health check script that covers the fundamentals. It runs on the server, checks five things, and sends results to healthchecks.io so you get alerted on failure:

#!/bin/bash
# server-health.sh — runs every 15 minutes via cron
# Checks: disk, postgres, certificates, memory, key services

HEALTH_PING="https://hc-ping.com/your-health-uuid"
FAILURES=""

# 1. Disk usage — alert if any mount exceeds 85%
DISK_USAGE=$(df -h / | awk 'NR==2 {gsub(/%/,""); print $5}')
if [ "$DISK_USAGE" -gt 85 ]; then
  FAILURES="${FAILURES}Disk at ${DISK_USAGE}%. "
fi

# 2. PostgreSQL — can we connect and run a query?
if ! psql -U postgres -c "SELECT 1" &>/dev/null; then
  FAILURES="${FAILURES}PostgreSQL unreachable. "
else
  # Check connection count
  CONN_COUNT=$(psql -U postgres -t -c \
    "SELECT count(*) FROM pg_stat_activity" 2>/dev/null | tr -d ' ')
  if [ "$CONN_COUNT" -gt 80 ]; then
    FAILURES="${FAILURES}PostgreSQL connections: ${CONN_COUNT}. "
  fi
fi

# 3. TLS certificate expiry — warn if under 14 days
DOMAIN="yourdomain.com"
EXPIRY=$(echo | openssl s_client -servername "$DOMAIN" \
  -connect "$DOMAIN":443 2>/dev/null \
  | openssl x509 -noout -enddate 2>/dev/null \
  | cut -d= -f2)
if [ -n "$EXPIRY" ]; then
  EXPIRY_EPOCH=$(date -j -f "%b %d %T %Y %Z" "$EXPIRY" "+%s" 2>/dev/null)
  NOW_EPOCH=$(date "+%s")
  DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
  if [ "$DAYS_LEFT" -lt 14 ]; then
    FAILURES="${FAILURES}TLS cert expires in ${DAYS_LEFT} days. "
  fi
fi

# 4. Memory pressure — available memory below 512MB
AVAIL_MB=$(vm_stat | awk '/Pages free/ {free=$3} /Pages inactive/ {inactive=$3} \
  END {gsub(/\./,"",free); gsub(/\./,"",inactive); \
  printf "%d", (free+inactive)*4096/1048576}')
if [ "$AVAIL_MB" -lt 512 ]; then
  FAILURES="${FAILURES}Available memory: ${AVAIL_MB}MB. "
fi

# 5. Key services — are they listening?
for PORT in 2368 5432 1421; do
  if ! lsof -i ":$PORT" -sTCP:LISTEN &>/dev/null; then
    FAILURES="${FAILURES}Nothing listening on port ${PORT}. "
  fi
done

# Report results
if [ -z "$FAILURES" ]; then
  curl -fsS --retry 3 "$HEALTH_PING"
else
  curl -fsS --retry 3 --data-raw "$FAILURES" "$HEALTH_PING/fail"
fi

Schedule it with cron:

*/15 * * * * /usr/local/bin/server-health.sh

That's 30 lines doing the work of a monitoring agent that would cost $15-$30/month. It checks disk, database connectivity, connection count, certificate expiry, memory, and whether your key services are running. If any check fails, it sends the failure details to healthchecks.io, which texts you. If everything passes, it pings the success endpoint, and the dead man's switch stays green.

The script is intentionally simple. No dependencies beyond standard Unix tools and psql. No configuration files. No log parsing. Everything it checks is something that has actually caused me downtime in the past. Disk filled up from unrotated logs. PostgreSQL stopped accepting connections because of a connection leak. A Let's Encrypt cert expired because the renewal cron had a broken path. Each check exists because it caught or would have caught a real incident.

What This Looks Like in Practice

My monitoring stack for a two-machine setup (MacBook Pro for development, Mac Mini as production server):

Healthchecks.io (free tier): 8 checks covering nightly backups, log rotation, database maintenance, data syncs, and the health check script itself.
UptimeRobot (free tier): 6 monitors watching public endpoints — the blog, an API, a webhook receiver, and their HTTPS certificates.
server-health.sh: Runs every 15 minutes on the Mac Mini. Checks disk, PostgreSQL, TLS, memory, and 3 service ports.

Monthly cost: $0. False alerts per month: one or two, usually from a brief network blip that resolves before I check my phone. Real alerts caught in the past 90 days: a disk at 91% from accumulated Docker images, a PostgreSQL connection count creeping toward the limit, and a service that didn't restart after an OS update.

All three would have become outages. None of them did.

What You Don't Need

Enterprise monitoring tools are built for enterprise problems. Datadog, New Relic, Grafana Cloud — they're designed for teams running hundreds of services across dozens of machines with multiple people who need dashboards and role-based access and audit trails.

A solo builder running 3-5 services on one or two machines doesn't need dashboards. You need a text message when something breaks. You need to know when a cron job stops running. You need a script that checks the five things most likely to cause downtime on your specific hardware.

The temptation is to over-engineer monitoring the same way teams do — install Prometheus, set up Grafana, build dashboards with 47 panels, configure alert rules with escalation chains. It feels productive. It's not. You're building infrastructure to watch infrastructure, and you're the only one who'll ever look at it.

The shell script approach scales down to where you actually are. One machine, one script, one phone number that gets a text when something goes wrong. When your setup grows to the point where that isn't enough, you'll know — because the script will be the thing that tells you.

The Catch

This approach has a gap: it doesn't do application-level monitoring. If your API starts returning wrong data but with a 200 status code, none of these tools will catch it. If a background job runs successfully but produces incorrect output, the dead man's switch still pings green.

For that, you need application-specific assertions in your health checks. If your API should always return a certain key in its JSON response, check for it. If your backup script should produce a file larger than 1MB, verify the file size. These checks are unique to your application and can't be generalized into a template. They're the checks you add after something slips through — each one a scar from a near-miss.

The other limitation: 5-minute polling intervals mean you won't know about a 3-minute outage until it's already resolved. For most solo builder operations, that's fine. If you need sub-minute detection, you're solving a different problem, and probably need a different monitoring architecture.

Both of these are acceptable tradeoffs for a monitoring stack that costs nothing and takes an hour to set up.

Starting from Zero

If you don't have any monitoring today, start here:

Sign up for healthchecks.io. Create one check for your most important cron job. Add the curl ping. Verify you get an alert when the window expires.
Sign up for UptimeRobot. Add your primary public URL. Confirm the alert reaches your phone.
Copy the health check script above. Edit the ports, the domain, and the disk threshold to match your setup. Run it manually once. Then add it to cron.

Thirty minutes of work. After that, your infrastructure tells you when it needs attention instead of you remembering to check.

Enterprise teams have on-call rotations, incident commanders, and escalation matrices. Solo builders have a phone that buzzes when a shell script finds something wrong. One of those costs six figures a year in staffing. The other costs nothing and catches the same failures that actually take down small deployments.

No dashboards. No agents. No monthly bill. A cron job, a curl, and a phone that only rings when it matters.

Monitoring Your Solo Infrastructure Without a DevOps Team

Ember

Dead Man's Switch Monitoring

Uptime Monitoring

The Health Check Script

What This Looks Like in Practice

What You Don't Need

The Catch

Starting from Zero

Read more

DNS for Solo Builders: What You Actually Need to Know

The $0 CI/CD Pipeline: GitHub Actions for Solo Projects

Buy vs Build: The Solo Builder's Most Expensive Mistake

uv: Python's Package Manager Finally Doesn't Suck