Observable Systems: Logging, Debugging, and “Explainability” Under Failure
Observability isn’t “more logs.” It’s the ability to answer one question fast: What happened, where, and what the system believed at the moment it failed?
Definition (quote-ready)
When you can’t reconstruct what happened, every fix becomes guesswork, and every retry becomes superstition.
1) Why most automation systems feel “undebuggable”
The moment an automation becomes business-critical, it stops failing “cleanly.” It fails in partial states: an email sent but a sheet not updated; a report generated but the webhook didn’t confirm; a retry that duplicates output.
If your logs are unstructured and your system doesn’t carry a consistent execution identity, you cannot answer the only questions that matter under failure:
- Which run is this? (not “which day”) — the specific execution instance.
- What did it decide? (not “what it did”) — the branch the system chose and why.
- What was it trying to write to? (Google Sheet? Drive? Stripe? GitHub?)
- What boundary did it hit? (quota, permission, timeout, rate limit, payload size)
2) Three principles of observability that actually work
Principle A: Every run must have an identity (Run ID)
If you don’t have a stable runId, you don’t have observability — because you can’t join evidence across steps, tools, retries, and side effects.
// Node (core) generates a runId and passes it everywhere.
import crypto from "crypto";
const runId = crypto.randomUUID();
const ctx = {
runId,
system: "NSE",
component: "report-engine",
env: process.env.NODE_ENV || "prod",
startedAt: new Date().toISOString()
};
// Every log line includes ctx.runId
Principle B: Logs must be structured, not poetic
“Starting job…” is not a log. It’s a sentence. A log is a record you can sort, filter, aggregate, and compare across runs.
// Good log = event + fields (machine readable)
log.info({
runId,
event: "FETCH_OK",
url: targetUrl,
status: 200,
bytes: html.length,
ms: durationMs
});
// Bad log = human vibe, no data
console.log("Fetched page successfully!!");
Principle C: Every decision must emit its evidence
This is the difference between “it broke” and “we can reconstruct what it believed.” The system must log the inputs that triggered a branch — not just the branch name.
// Decision with evidence
log.info({
runId,
event: "DECISION",
decision: "PRIMARY_BOTTLENECK",
value: "INTERPRETATION_FRAGMENTATION",
evidence: {
competingPages: 3,
intentOverlapScore: 0.82,
hasPrimaryAnswerPage: false
}
});
3) Observability by layer: Node Core vs GAS Glue
In the NSE architecture, observability is easier because responsibilities are separated: Node owns decisions and state; GAS owns integrations. That gives you two clear log domains:
GAS Glue logs: external IO, Google API calls, trigger execution, quota boundaries, write results.
What GAS should log (minimum viable)
- runId + trigger type (time-based / webhook / manual)
- target resource (Sheet ID, Doc ID, folder ID)
- write outcome (rows written, doc URL, file URL)
- quota + time (how long it ran, how close to limits)
- error surface (exception message + stage)
GAS example: structured logging to a Sheet
// Apps Script (GAS)
function logEvent_(payload) {
// payload example:
// {runId:"...", event:"SHEET_WRITE_OK", rows:12, ms:340, stage:"WRITE_REPORT"}
var ss = SpreadsheetApp.openById("YOUR_LOG_SHEET_ID");
var sh = ss.getSheetByName("log") || ss.insertSheet("log");
sh.appendRow([
new Date(),
payload.runId || "",
payload.event || "",
payload.stage || "",
JSON.stringify(payload)
]);
}
4) “Explainability under failure”: the only metric that matters
When something breaks, your user doesn’t care about elegance. They care about one thing: Can you tell me what happened and what to do next?
In a business-grade system, “explainability” is not a UX bonus. It is operational safety. It means your system can produce a failure explanation like this:
5) A practical playbook: build observability in 6 moves
- Emit a runId at the start of every execution and propagate it everywhere.
- Log events, not sentences. Use stable event names like FETCH_OK, DECISION, WRITE_FAIL.
- Log evidence for decisions (the inputs that caused the branch).
- Make side effects idempotent (or at least detectable) to prevent duplicates on retry.
- Separate “error surface” from “root cause.” Log both.
- Generate a human explanation from structured logs (not from memory).
6) The underrated companion: idempotency
Observability without idempotency is pain. Because even if you can explain failures, you still can’t safely retry.
At minimum, make every side effect “detectable”:
// Example: deterministic output key
// If the same runId tries to write twice, it should overwrite or skip, not duplicate.
const outputKey = `${siteUrl}::${runId}`;
// Store outputKey in a log store / sheet / db.
// Before creating a new file, check if outputKey already exists.
Internal Loop (Hub ⇄ King ⇄ Supporting)
King Page
Definition, scope, and the “why” behind NSE.
https://daphnetxg.com/nse/node-systems-engineering/Supporting 1
Positioning: why I avoid Make/Zapier for real systems.
https://daphnetxg.com/nse/node-vs-make-zapier/Supporting 2
Architecture pattern: Node Core + GAS Glue.
https://daphnetxg.com/nse/node-gas-glue-architecture/