The goal of operation testing (OT) is to evaluate the effectiveness and suitability of military systems for use by trained military users in operationally realistic environments. Operators perform missions and make systems function. Thus, adequate OT must assess not only system performance and technical capability across the operational space, but also the quality of human-system interactions. Software systems in particular pose a unique challenge to testers. While some software systems may inherently be deterministic in nature, once placed in their intended environment with error-prone humans and highly stochastic networks, variability in outcomes often occurs, so tests often need to account for both “bug” finding and characterizing variability. This document outlines common statistical techniques for planning tests of system performance for software systems, and then discusses how testers might integrate human-system interaction metrics into that design and evaluation. System PerformanceBefore deciding what class of statistical design techniques to apply, testers should consider whether the system under test is deterministic (repeating a process with the same inputs always produces the same output) or stochastic (even if the inputs are fixed, repeating the process again could produce a different result). Software systems–a calculator, for example– may intuitively be deterministic, and as standalone entities in a pristine environment, they are. However, there are other sources of variation to consider when testing such a system in an operational environment with an intended user. If the calculator is intended to be used by scientists in Antarctica, temperature, lighting conditions, and user clothing such as gloves all could affect the users’ ability to operate the system. Combinatorial covering arrays can cover a large input space extremely efficiently and are useful for conducting functionality checks of a complex system. However, several assumptions must be met in order for testers to benefit from combinatorial designs. The system must be fully deterministic, the response variable of interest must be binary (pass/fail), and the primary goal of the test must be to find problems. Combinatorial designs cannot determine cause and effect and are not designed to detect or quantify uncertainty or variability in responses. In operational testing, the assumptions listed above typically are not met. Any number of factors, including the human user, the network load, memory leaks, database errors, and a constantly changing environment can cause variability in the mission-level outcome of interest. While combinatorial designs can be useful for bug checking, they typically are not sufficient for OT. One goal of OT should be to characterize system performance across the space. The appropriate designs to support characterization are classical or optimal designs. These designs, including factorial, fractional factorial, response surface, and D-optimal constructs, have the ability to quantify variability in outcomes and attribute changes in response to specific factors or factor interactions. These two broad classes of design (combinatorial and classical) can be merged in order to serve both goals, finding problems and characterizing performance. Testers can develop a “hybrid” design by first building a combinatorial covering array across all factors, and then adding the necessary runs to support a D-optimal design, for example. This allows testers to efficiently detect any remaining “bugs” in the software, while also quantifying variability and supporting statistical regression analysis of the data.Human-System InteractionIt is not sufficient only to assess technical performance when testing software systems. Systems that account for human factors (operators’ physical and psychological characteristics) are more likely to fulfill their missions. Software that is psychologically challenging often leads to mistakes, inefficiencies, and safety concerns. Testers can use human-system interaction (HSI) metrics to capture software compatibility with key psychological characteristics. Inherent characteristics such as short- and long-term memory processes, capacity for attention, and cognitive load are directly related to measurable constructs such as usability, workload, and task error rates. To evaluate HSI, testers can use either behavioral metrics (e.g. error rates, completion times, speech/facial expressions) or self-report metrics (surveys and interviews). Though behavioral metrics are generally preferred since they are directly observable, the method you choose depends on the HSI concept you want to measure, your test design, and operational constraints. The same logic can be applied to HSI data collection as data collection for system performance. Testers should strive to understand how users’ experience of the system shifts with the operational environment, thus designed experiments with factors and levels should be applied. In addition, understanding if, or how much, user experience affects system performance is key to a thorough evaluation.The easiest way to fit HSI into OT is to leverage the existing test design. First, identify the subset (or possibly superset) of factors that are likely to shape how users experience the system, then distribute those users across the test conditions logically. The number of users, their groupings, and how they will be spread across the factor space all matter when designing an adequate test for HSI.Most HSI data, including behavioral metrics and empirically validated surveys, also can be analyzed in the same way system performance data can, using statistically rigorous techniques such as regression. Operational conditions, user type, and system characteristics all can affect HSI, so it is critical to account for those factors in the design and analysis.
Suggested Citation
Freeman, Laura J, Kelly M Avery, and Heather M Wojton. Users Are Part of the System: How to Account for Human Factors When Designing Operational Tests for Software Systems. IDA Document NS D-8630. Alexandria, VA: Institute for Defense Analyses, 2017.