A battery storage site does not have one state of charge. It has one per pack, each reported independently, arriving out of order over a message queue, and drifting as the hardware ages. The control system has to turn those into a single number it can act on, and the ways that goes wrong are more interesting than the happy path.

The data arrives concurrently and out of order

Telemetry comes in over a queue with several consumers running in parallel (eight, in our case), each handling messages from different cabinets. Two reports about the same plant can be processed at the same moment by different workers, and a later reading can land before an earlier one. Any aggregate computed from them has to tolerate that. The state of charge you store has to be a reconciliation across all packs, not whichever message a worker happened to see last.

Aggregation hides sensor drift

The obvious aggregate is the mean across packs, and it works until a pack's sensor drifts, which aging packs do. A mean trusts every input equally, so one bad sensor pulls the whole plant's reported state of charge off true, slowly enough that nobody notices until a charge or discharge decision is made on the wrong number. A robust version rejects outliers or weights packs by confidence. A simple version trusts them all. Both are defensible, but you should know which one you are running before it costs you, because the failure is silent.

State is three tracks, not one

"Charging" is not a single flag. We track a charge state (idle, charging, discharging), an alarm state that adds pre-warning and fault levels, and a grid-connection state (grid-tied or islanded). The control logic reads all three together. A discharge is permitted only when the site is grid-tied and not already charging, and that interlock is what keeps energy from flowing back where it must not. Collapse these into one status field and you eventually get a system that believes it can discharge during a fault, because the field that would have told it otherwise no longer exists.

The barrier that can hang forever

The hourly rollups run several jobs in parallel and wait for all of them on a countdown barrier before publishing. The dangerous failure mode is the one the happy path conceals. If any job dies or times out without signaling the barrier, the wait never completes, and the rollup blocks indefinitely while the dashboard shows stale numbers. A parallel barrier with no timeout is a deadlock waiting for a bad day. Every wait needs a bound, and every bound needs a defined behavior for when it trips, including which partial results you trust and which you discard.

The theme underneath all of these

Each of these is the same lesson in a different place. The system looks correct on clean, in-order, single-sensor data, and the real behavior only shows up under concurrency, drift, and partial failure. The engineering that matters is the handling of those cases, not the arithmetic that runs when everything cooperates.

Where AgentKick fits

We build the data and control layers for industrial and energy systems, where correctness under concurrency, sensor uncertainty, and partial failure is the actual job. If you are taking a monitoring or control platform from "works on the bench" to "trustworthy in the field," that is the work we do, usually as a short scoping engagement into a phased build.

Tracking battery state of charge when eight sensors disagree