MAY 27/Integration & Testing/4 MIN READ

Integration testing is the worst place to find an integration bug

Dan Zaidenband

Share on

In February 2000, an engineer named Boris Smeds ran yet another in-flight test of the link between Cassini and the Huygens probe still bolted to its hull. He found that the orbiter's receiver had not been programmed to handle the Doppler shift the probe would impose during atmospheric descent. Four years before Huygens dropped, the bug existed in the integrated system; the point at which it was found was a choice.

The fix, replanning the trajectory so Doppler at descent fell inside the receiver's tracking window, cost months of mission planning and no hardware. Had it surfaced four years later, when Huygens entered Titan's atmosphere, every byte of the lander's data would have streamed back to a deaf orbiter. ESOC caught it because they ran mid-cruise communications tests as standard practice, not as a phase gate. Most missions get one chance.

The default is integration as a phase

The industry-default schedule treats integration as a phase. Subsystems are built. They are unit-tested. There is a date on the Gantt chart that says "begin integration," and on that date the components arrive in a clean room and the team starts plugging them together. We have sat through enough program reviews to know that the integration phase is also where the schedule slips first. The programs we have visited routinely budget six weeks for it and run twelve.

The slips have a common shape. Subsystem A sends a telemetry word in big-endian; B was reading little-endian. A's vendor updated the message catalog at revision C and forgot to send a copy. B was wired to revision B. A subcontractor's harness reused the pinout from the previous program. None of these are exotic. They are the routine of integration testing on a real spacecraft, and they would all have been catchable months earlier in software, if anyone had tried to.

The cost asymmetry is brutal

A bug caught in simulation costs minutes. The simulator complains, an engineer reads the diff, the message catalog updates, the simulator runs again. The same bug caught at the integration bench costs an order of magnitude more, because the team is in clean-room hours, the harness is built, and the schedule is gated. The same bug caught at orbital insertion costs the mission.

We watched a smallsat program lose four weeks of integration time to a scale-factor disagreement on a battery voltage telemetry word. Different revisions of the ICD spreadsheet had different scaling. The bench saw garbage values. The team flailed for two weeks before they found the diff in the spreadsheet history. A SIL run wired to the same source of truth, any source of truth, would have flagged it on the first commit. Instead, the bench was the first machine to ever check that the two halves of the program agreed on units.

That story is unremarkable. It is a class. We have seen the same shape on three programs in the last twelve months. The pattern is not "integration testing is hard." The pattern is "integration testing is being used to catch the bugs that should have been caught earlier."

Daily SIL is not exotic

The automotive industry runs SIL on engine ECUs every commit. Aerospace primes have run software-in-the-loop for decades on military programs; the F-22 and F-35 software pipelines depend on it. The defense embedded community treats SIL as table stakes.

What we hear in space is that SIL is too expensive to run continuously. The version of "expensive" we hear is some combination of: the framework took a year to stand up, the simulator only runs on one machine in the lab, the bus-level fidelity is too low to catch the things we care about, and the team that built it has rotated out. All of these are real. None of them are reasons SIL is intrinsically hard. They are reasons that this team's SIL is hard, on this program. Most are downstream of the upstream problem we wrote about last cycle: when the interface model is a spreadsheet, the simulator is a bespoke build, and a bespoke build does not get run nightly.

Every integration issue caught in simulation is an integration issue not caught at the bench.

What the daily-practice version looks like

The teams we have seen do this well share a workflow that is simpler than it sounds. The bus and message definitions live in code, in a single repository. The flight software's bus drivers, the SIL framework's mocks, and the integration bench's checkers all import from that repository. A change to the catalog is a pull request. CI runs a simulation of the affected message paths and checks that subscribers can decode what publishers produce. A scale factor change on a telemetry word breaks the SIL test before the PR is merged.

This is not a research program. It is the default in firmware shops shipping to medical devices, in automotive Tier-1s, and in the better-run defense embedded teams. We are arguing that the space industry's reluctance to adopt the pattern is not a tooling reluctance, it is a budget-line reluctance. SIL infrastructure is scope-zero on most space programs, the same as interface registries. Programs do not fund what they assume already exists.

The incentive problem

Schedule slips at integration are visible. The cost of preventing them is invisible. A program manager who hires a SIL engineer six months before integration phase looks like overhead. The same program manager who absorbs eight weeks of integration delay looks like a victim of a hard problem. Nobody gets promoted for the bug that did not happen. Until program leads start treating SIL infrastructure as a critical-path artifact rather than a phase-three convenience, the cost asymmetry above will continue to be paid in months instead of minutes.

Implication

The honest test for whether your integration practice is daily or milestone-gated is mechanical. Pick a message in your bus catalog. Change the scale factor. How many places in your code, your bench, and your simulator have to be edited by hand before the change is consistent? If the answer is one, you are running daily. If the answer is more than three, integration is going to be the place you find out.

Smeds caught the Cassini-Huygens bug because ESA had a culture of running mid-cruise communications tests as a matter of course, not as a milestone. The mission survived. Building a test culture that does not use the integration bench as the first place a problem becomes observable is the highest-payoff move a program lead can make, and it is almost always cheaper than the schedule slip it prevents.

Share on

More from the blog