2024 | Test Science Research Document Library

A Practitioner’s Framework for Federated Model Validation Resource Allocation

Recent advances in computation and statistics led to an increasing use of federated models for end-to-end system test and evaluation. A federated model is a collection of interconnected models where the outputs of a model act as inputs to subsequent models. However, the process of verifying and validating federated models is poorly understood, especially when testers have limited resources, knowledge-based uncertainties, and concerns over operational realism. Testers often struggle with determining how to best allocate limited test resources for model validation....

A Preview of Functional Data Analysis for Modeling and Simulation Validation

Modeling and simulation (M&S) validation for operational testing often involves comparing live data with simulation outputs. Statistical methods known as functional data analysis (FDA) provides techniques for analyzing large data sets (“large” meaning that a single trial has a lot of information associated with it), such as radar tracks. We preview how FDA methods could assist M&S validation by providing statistical tools handling these large data sets. This may facilitate analyses that make use of more of the data available and thus allows for better detection of differences between M&S predictions and live test results....

A Reliability Assurance Test Planning and Analysis Tool

This presentation documents the work of IDA 2024 Summer Associate Emma Mitchell. The work presented details an R Shiny application developed to provide a user-friendly software tool for researchers to use in planning for and analyzing system reliability. Specifically, the presentation details how one can plan for a reliability test using Bayesian Reliability Assurance test methods. Such tests utilize supplementary data and information, including reliability models, prior test results, expert judgment, and knowledge of environmental conditions, to plan for reliability testing, which in turn can often help in reducing the required amount of testing....

Determining the Necessary Number of Runs in Computer Simulations with Binary Outcomes

How many success-or-failure observations should we collect from a computer simulation? Often, researchers use space-filling design of experiments when planning modeling and simulation (M&S) studies. We are not satisfied with existing guidance on justifying the number of runs when developing these designs, either because the guidance is insufficiently justified, does not provide an unambiguous answer, or is not based on optimizing a statistical measure of merit. Analysts should use confidence interval margin of error as the statistical measure of merit for M&S studies intended to characterize overall M&S behavioral trends....

Developing AI Trust- From Theory to Testing and the Myths in Between

This introductory work aims to provide members of the Test and Evaluation community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems. The paper provides a set of working definitions and works toward dispelling confusion and myths surrounding trust. Suggested Citation Razin, Yosef S., and Kristen Alexander. “Developing AI Trust: From Theory to Testing and the Myths in Between.” The ITEA Journal of Test and Evaluation 45, no....

Introduction to Human-Systems Interaction in Operational Test and Evaluation Course

Human-System Interaction (HSI) is the study of interfaces between humans and technical systems. The Department of Defense incorporates HSI evaluations into defense acquisition to improve system performance and reduce lifecycle costs. During operational test and evaluation, HSI evaluations characterize how a system’s operational performance is affected by its users. The goal of this course is to provide the theoretical background and practical tools necessary to plan and evaluate HSI test plans, collect and analyze HSI data, and report on HSI results....

Meta-Analysis of the Effectiveness of the SALIANT Procedure for Assessing Team Situation Awareness

Many Department of Defense (DoD) systems aim to increase or maintain Situational Awareness (SA) at the individual or group level. In some cases, maintenance or enhancement of SA is listed as a primary function or requirement of the system. However, during test and evaluation SA is examined inconsistently or is not measured at all. Situational Awareness Linked Indicators Adapted to Novel Tasks (SALIANT) is an empirically-based methodology meant to measure SA at the team, or group, level....

Operational T&E of AI-Supported Data Integration, Fusion, and Analysis Systems

AI will play an important role in future military systems. However, large questions remain about how to test AI systems, especially in operational settings. Here, we discuss an approach for the operational test and evaluation (OT&E) of AI-supported data integration, fusion, and analysis systems. We highlight new challenges posed by AI-supported systems and we discuss new and existing OT&E methods for overcoming them. We demonstrate how to apply these OT&E methods via a notional test concept that focuses on evaluating an AI-supported data integration system in terms of its technical performance (how accurate is the AI output?...

Quantifying Uncertainty to Keep Astronauts and Warfighters Safe

Both NASA and DOT&E increasingly rely on computer models to supplement data collection, and utilize statistical distributions to quantify the uncertainty in models, so that decision-makers are equipped with the most accurate information about system performance and model fitness. This article provides a high-level overview of uncertainty quantification (UQ) through an example assessment for the reliability of a new space-suit system. The goal is to reach a more general audience in Significance Magazine, and convey the importance and relevance of statistics to the defense and aerospace communities....

Sequential Space-Filling Designs for Modeling & Simulation Analyses

Space-filling designs (SFDs) are a rigorous method for designing modeling and simulation (M&S) studies. However, they are hindered by their requirement to choose the final sample size prior to testing. Sequential designs are an alternative that can increase test efficiency by testing small amounts of data at a time. We have conducted a literature review of existing sequential space-filling designs and found the methods most applicable to the test and evaluation (T&E) community....

Simulation Insights on Power Analysis with Binary Responses--from SNR Methods to 'skprJMP'

Logistic regression is a commonly-used method for analyzing tests with probabilistic responses in the test community, yet calculating power for these tests has historically been challenging. This difficulty prompted the development of methods based on signal-to-noise ratio (SNR) approximations over the last decade, tailored to address the intricacies of logistic regression’s binary outcomes. However, advancements and improvements in statistical software and computational power have reduced the need for such approximate methods....

Statistical Advantages of Validated Surveys over Custom Surveys

Surveys play an important role in quantifying user opinion during test and evaluation (T&E). Current best practice is to use surveys that have been tested, or “validated,” to ensure that they produce reliable and accurate results. However, unvalidated (“custom”) surveys are still widely used in T&E, raising questions about how to determine sample sizes for—and interpret data from— T&E events that rely on custom surveys. In this presentation, I characterize the statistical properties of validated and custom survey responses using data from recent T&E events, and then I demonstrate how these properties affect test design, analysis, and interpretation....

Uncertainty Quantification for Ground Vehicle Vulnerability Simulation

A vulnerability assessment of a combat vehicle uses modeling and simulation (M&S) to predict the vehicle’s vulnerability to a given enemy attack. The system-level output of the M&S is the probability that the vehicle’s mobility is degraded as a result of the attack. The M&S models this system-level phenomenon by decoupling the attack scenario into a hierarchy of sub-systems. Each sub-system addresses a specific scientific problem, such as the fracture dynamics of an exploded munition, or the ballistic resistance provided by the vehicle’s armor....