Practitioner

A Practitioner’s Framework for Federated Model Validation Resource Allocation

Recent advances in computation and statistics led to an increasing use of federated models for end-to-end system test and evaluation. A federated model is a collection of interconnected models where the outputs of a model act as inputs to subsequent models. However, the process of verifying and validating federated models is poorly understood, especially when testers have limited resources, knowledge-based uncertainties, and concerns over operational realism. Testers often struggle with determining how to best allocate limited test resources for model validation....

A Preview of Functional Data Analysis for Modeling and Simulation Validation

Modeling and simulation (M&S) validation for operational testing often involves comparing live data with simulation outputs. Statistical methods known as functional data analysis (FDA) provides techniques for analyzing large data sets (“large” meaning that a single trial has a lot of information associated with it), such as radar tracks. We preview how FDA methods could assist M&S validation by providing statistical tools handling these large data sets. This may facilitate analyses that make use of more of the data available and thus allows for better detection of differences between M&S predictions and live test results....

A Reliability Assurance Test Planning and Analysis Tool

This presentation documents the work of IDA 2024 Summer Associate Emma Mitchell. The work presented details an R Shiny application developed to provide a user-friendly software tool for researchers to use in planning for and analyzing system reliability. Specifically, the presentation details how one can plan for a reliability test using Bayesian Reliability Assurance test methods. Such tests utilize supplementary data and information, including reliability models, prior test results, expert judgment, and knowledge of environmental conditions, to plan for reliability testing, which in turn can often help in reducing the required amount of testing....

Developing AI Trust- From Theory to Testing and the Myths in Between

This introductory work aims to provide members of the Test and Evaluation community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems. The paper provides a set of working definitions and works toward dispelling confusion and myths surrounding trust. Suggested Citation Razin, Yosef S., and Kristen Alexander. “Developing AI Trust: From Theory to Testing and the Myths in Between.” The ITEA Journal of Test and Evaluation 45, no....

Operational T&E of AI-Supported Data Integration, Fusion, and Analysis Systems

AI will play an important role in future military systems. However, large questions remain about how to test AI systems, especially in operational settings. Here, we discuss an approach for the operational test and evaluation (OT&E) of AI-supported data integration, fusion, and analysis systems. We highlight new challenges posed by AI-supported systems and we discuss new and existing OT&E methods for overcoming them. We demonstrate how to apply these OT&E methods via a notional test concept that focuses on evaluating an AI-supported data integration system in terms of its technical performance (how accurate is the AI output?...

Sequential Space-Filling Designs for Modeling & Simulation Analyses

Space-filling designs (SFDs) are a rigorous method for designing modeling and simulation (M&S) studies. However, they are hindered by their requirement to choose the final sample size prior to testing. Sequential designs are an alternative that can increase test efficiency by testing small amounts of data at a time. We have conducted a literature review of existing sequential space-filling designs and found the methods most applicable to the test and evaluation (T&E) community....

Simulation Insights on Power Analysis with Binary Responses--from SNR Methods to 'skprJMP'

Logistic regression is a commonly-used method for analyzing tests with probabilistic responses in the test community, yet calculating power for these tests has historically been challenging. This difficulty prompted the development of methods based on signal-to-noise ratio (SNR) approximations over the last decade, tailored to address the intricacies of logistic regression’s binary outcomes. However, advancements and improvements in statistical software and computational power have reduced the need for such approximate methods....

Statistical Advantages of Validated Surveys over Custom Surveys

Surveys play an important role in quantifying user opinion during test and evaluation (T&E). Current best practice is to use surveys that have been tested, or “validated,” to ensure that they produce reliable and accurate results. However, unvalidated (“custom”) surveys are still widely used in T&E, raising questions about how to determine sample sizes for—and interpret data from— T&E events that rely on custom surveys. In this presentation, I characterize the statistical properties of validated and custom survey responses using data from recent T&E events, and then I demonstrate how these properties affect test design, analysis, and interpretation....

Uncertainty Quantification for Ground Vehicle Vulnerability Simulation

A vulnerability assessment of a combat vehicle uses modeling and simulation (M&S) to predict the vehicle’s vulnerability to a given enemy attack. The system-level output of the M&S is the probability that the vehicle’s mobility is degraded as a result of the attack. The M&S models this system-level phenomenon by decoupling the attack scenario into a hierarchy of sub-systems. Each sub-system addresses a specific scientific problem, such as the fracture dynamics of an exploded munition, or the ballistic resistance provided by the vehicle’s armor....

A Team-Centric Metric Framework for Testing and Evaluation of Human-Machine Teams

We propose and present a parallelized metric framework for evaluating human-machine teams that draws upon current knowledge of human-systems interfacing and integration but is rooted in team-centric concepts. Humans and machines working together as a team involves interactions that will only increase in complexity as machines become more intelligent, capable teammates. Assessing such teams will require explicit focus on not just the human-machine interfacing but the full spectrum of interactions between and among agents....

Development of Wald-Type and Score-Type Statistical Tests to Compare Live Test Data and Simulation Predictions

This work describes the development of a statistical test created in support of ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test computes a Wald-type statistic comparing two generalized linear models estimated from live test data and analogous simulated data. The resulting statistic indicates whether the M&S outputs differ from the live data. After developing the test, we applied it to two logistic regression models estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s Environment Centric Weapons Analysis Facility (ECWAF)....

Implementing Fast Flexible Space-Filling Designs in R

Modeling and simulation (M&S) can be a useful tool when testers and evaluators need to augment the data collected during a test event. When planning M&S, testers use experimental design techniques to determine how much and which types of data to collect, and they can use space-filling designs to spread out test points across the operational space. Fast flexible space-filling designs (FFSFDs) are a type of space-filling design useful for M&S because they work well in design spaces with disallowed combinations and permit the inclusion of categorical factors....

Improving Test Efficiency- A Bayesian Assurance Case Study

To improve test planning for evaluating system reliability, we propose the use of Bayesian methods to incorporate supplementary data and reduce testing duration. Furthermore, we recommend Bayesian methods be employed in the analysis phase to better quantify uncertainty. We find that when using Bayesian Methods for test planning we can scope smaller tests and using Bayesian methods in analysis results in a more precise estimate of reliability – improving uncertainty quantification....

Introduction to Design of Experiments in R- Generating and Evaluating Designs with Skpr

This workshop instructs attendees on how to run an end-to-end optimal Design of Experiments workflow in R using the open source skpr package. This workshop is split into two sections optimal design generation and design evaluation. The first half of the workshop provides basic instructions how to use R, as well as how to use skpr to create an optimal design for an experiment how to specify a model, create a candidate set of potential runs, remove disallowed combinations, and specify the design generation conditions to best suit an experimenter’s goals....

Introduction to Measuring Situational Awareness in Mission-Based Testing Scenarios

Situation Awareness (SA) plays a key role in decision making and human performance, higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events....

Metamodeling Techniques for Verification and Validation of Modeling and Simulation Data

Modeling and simulation (M&S) outputs help the Director, Operational Test and Evaluation (DOT&E) assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities....

Predicting Trust in Automated Systems - An Application of TOAST

Following Wojton’s research on the Trust of Automated Systems Test (TOAST), which is designed to measure how much a human trusts an automated system, we aimed to determine how well this scale performs when not used in a military context. We found that participants who used a poorly performing automated system trusted the system less than expected when using that system on a case by case basis, however, those who used a high performing system trusted the system the same as they expected....

Thoughts on Applying Design of Experiments (DOE) to Cyber Testing

This briefing presented at Dataworks 2022 provides examples of potential ways in which Design of Experiments (DOE) could be applied to initially scope cyber assessments and, based on the results of those assessments, subsequently design in greater detail cyber tests. Suggested Citation Gilmore, James M, Kelly M Avery, Matthew R Girardi, and Rebecca M Medlin. Thoughts on Applying Design of Experiments (DOE) to Cyber Testing. IDA Document NS D-33023. Alexandria, VA: Institute for Defense Analyses, 2022....

Topological Modeling of Human-Machine Teams

A Human-Machine Team (HMT) is a group ofagents consisting of at least one human and at least one machine, all functioning collaboratively towards one or more common objectives. As industry and defense find more helpful, creative, and difficult applications of AI-driven technology, the need to effectively and accurately model, simulate, test, and evaluate HMTs will continue to grow and become even more essential. Going along with that growing need, new methods are required to evaluate whether a human-machine team is performing effectively as a team in testing and evaluation scenarios....

Introduction to Bayesian Analysis

As operational testing becomes increasingly integrated and research questions become more difficult to answer, IDA’s Test Science team has found Bayesian models to be powerful data analysis methods. Analysts and decision-makers should understand the differences between this approach and the conventional way of analyzing data. It is also important to recognize when an analysis could benefit from the inclusion of prior information—what we already know about a system’s performance—and to understand the proper way to incorporate that information....

Space-Filling Designs for Modeling & Simulation

This document presents arguments and methods for using space-filling designs (SFDs) to plan modeling and simulation (M&S) data collection. Suggested Citation Avery, Kelly, John T Haman, Thomas Johnson, Curtis Miller, Dhruv Patel, and Han Yi. Test Design Challenges in Defense Testing. IDA Product ID 3002855. Alexandria, VA: Institute for Defense Analyses, 2024. Slides: Paper:

Warhead Arena Analysis Advancements

Fragmentation analysis is a critical piece of the live fire test and evaluation (LFT&E) of the lethality and vulnerability aspects of warheads. But the traditional methods for data collection are expensive and laborious. New optical tracking technology is promising to increase the fidelity of fragmentation data, and decrease the time and costs associated with data collection. However, the new data will be complex, three-dimensional “fragmentation clouds,” possibly with a time component as well, and there will be a larger number of individual data points....

A Review of Sequential Analysis

Sequential analysis concerns statistical evaluation in situations in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the course of the investigation. Expanding the use of sequential analysis has the potential to save resources and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation....

Circular Prediction Regions for Miss Distance Models under Heteroskedasticity

Circular prediction regions are used in ballistic testing to express the uncertainty in shot accuracy. We compare two modeling approaches for estimating circular prediction regions for the miss distance of a ballistic projectile. The miss distance response variable is bivariate normal and has a mean and variance that can change with one or more experimental factors. The first approach fits a heteroskedastic linear model using restricted maximum likelihood, and uses the Kenward-Roger statistic to estimate circular prediction regions....

Bayesian Component Reliability- An F-35 Case Study

A challenging aspect ofa system reliability assessment is integratingmultiple sources of information, such as component, subsystem, and full-system data,along with previous test data or subject matter expert (SME) opinion. A powerfulfeature of Bayesian analyses is the ability to combine these multiple sources of dataand variability in an informed way to perform statistical inference. This feature isparticularly valuable in assessing system reliability where testing is limited and only asmall number of failures (or none at all) are observed....

Challenges and New Methods for Designing Reliability Experiments

Engineers use reliability experiments to determine the factors that drive product reliability, build robust products, and predict reliability under use conditions. This article uses recent testing of a Howitzer to illustrate the challenges in designing reliability experiments for complex, repairable systems. We leverage lessons learned from current research and propose methods for designing an experiment for a complex, repairable system. Suggested Citation Freeman, Laura J., Rebecca M. Medlin, and Thomas H....

Handbook on Statistical Design & Analysis Techniques for Modeling & Simulation Validation

This handbook focuses on methods for data-driven validation to supplement the vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ. Suggested Citation Wojton, Heather, Kelly M Avery, Laura J Freeman, Samuel H Parry, Gregory S Whittier, Thomas H Johnson, and Andrew C Flack....

M&S Validation for the Joint Air-to-Ground Missile

An operational test is resource-limited and must therefore rely on both live test data and modeling and simulation (M&S) data to inform a full evaluation. For the Joint Air-to-Ground Missile (JAGM) system, we needed to create a test design that accomplished dual goals, characterizing missile performance across the operational space and supporting rigorous validation of the M&S. Our key question is which statistical techniques should be used to compare the M&S to the live data?...

Operational Testing of Systems with Autonomy

Systems with autonomy pose unique challenges for operational test. This document provides an executive level overview of these issues and the proposed solutions and reforms. In order to be ready for the testing challenges of the next century, we will need to change the entire acquisition life cycle, starting even from initial system conceptualization. This briefing was presented to the Director, Operational Test & Evaluation along with his deputies and Chief Scientist....

Sample Size Determination Methods Using Acceptance Sampling by Variables

Acceptance Sampling by Variables (ASbV) is a statistical testing technique used in Personal Protective Equipment programs to determine the quality of the equipment in First Article and Lot Acceptance Tests. This article intends to remedy the lack of existing references that discuss the similarities between ASbV and certain techniques used in different sub-disciplines within statistics. Understanding ASbV from a statistical perspective allows testers to create customized test plans, beyond what is available in MIL-STD-414....

The Effect of Extremes in Small Sample Size on Simple Mixed Models- A Comparison of Level-1 and Level-2 Size

We present a simulation study that examines the impact of small sample sizes in both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research....

The Purpose of Mixed-Effects Models in Test and Evaluation

Mixed-effects models are the standard technique for analyzing data with grouping structure. In defense testing, these models are useful because they allow us to account for correlations between observations, a feature common in many operational tests. In this article, we describe the advantages of modeling data from a mixed-effects perspective and discuss an R package—ciTools—that equips the user with easy methods for presenting results from this type of model. Suggested Citation Haman, John, Matthew Avery, and Heather Wojton....

Analysis of Split-Plot Reliability Experiments with Subsampling

Reliability experiments are important for determining which factors drive product reliability. The data collected in these experiments can be challenging to analyze. Often, the reliability or lifetime data collected follow distinctly nonnormal distributions and include censored observations. Additional challenges in the analysis arise when the experiment is executed with restrictions on randomization. The focus of this paper is on the proper analysis of reliability data collected from a nonrandomized reliability experiments....

Comparing M&S Output to Live Test Data- A Missile System Case Study

In the operational testing of DoD weapons systems, modeling and simulation (M&S) is often used to supplement live test data in order to support a more complete and rigorous evaluation. Before the output of the M&S is included in reports to decision makers, it must first be thoroughly verified and validated to show that it adequately represents the real world for the purposes of the intended use. Part of the validation process should include a statistical comparison of live data to M&S output....

Improved Surface Gunnery Analysis with Continuous Data

Recasting gunfire data from binomial (hit/miss) to continuous (time-to-kill) allows us to draw statistical conclusions with tactical implications from free-play,live-fire surface gunnery events. Our analysis provided the Navy with suggestions forimprovements to its tactics and the employment of its weapons. A censored analysisenabled us to do so, where other methods fell short. Suggested Citation Ashwell, Benjamin A, V Bram Lillard, and George M Khoury. Improved Surface Gunnery Analysis with Continuous Data....

Introduction to Observational Studies

A presentation on the theory and practice of observational studies. Specific average treatment effect methods include matching, difference-in-difference estimators, and instrumental variables. Suggested Citation Thomas, Dean, and Yevgeniya K Pinelis. Introduction to Observational Studies. IDA Document NS D-9020. Alexandria, VA: Institute for Defense Analyses, 2018. Slides:

Parametric Reliability Models Tutorial

This tutorial demonstrates how to plot reliability functions parametrically in R using the output from any reliability modeling software. It provides code and sample plots of reliability and failure rate functions with confidence intervals for three different skewed probability distributions the exponential, the two-parameter Weibull, and the lognormal. These three distributions are the most common parametric models for reliability or survival analysis. This paper also provides mathematical background for the models and recommendations for when to use them....

Scientific Test and Analysis Techniques

Abstract This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics coverec include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Slides

Scientific Test and Analysis Techniques- Continuous Learning Module

This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics covered include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Suggested Citation Pinelis, Yevgeniya, Laura J Freeman, Heather M Wojton, Denise J Edwards, Stephanie T Lane, and James R Simpson....

Testing Defense Systems

The complex, multifunctional nature of defense systems, along with the wide variety of system types, demands a structured but flexible analytical process for testing systems. This chapter summarizes commonly used techniques in defense system testing and specific challenges imposed by the nature of defense system testing. It highlights the core statistical methodologies that have proven useful in testing defense systems. Case studies illustrate the value of using statistical techniques in the design of tests and analysis of the resulting data....

Comparing Live Missile Fire and Simulation

Modeling and Simulation is frequently used in Test and Evaluation (T&E) of air-to-air weapon systems to evaluate the effectiveness of a weapons. The AirIntercept Missile-9X (AIM-9X) program uses modeling and simulationextensively to evaluate missile miss distances. Since flight testing isexpensive, the test program uses relatively few flight tests and supplementsthose data with large numbers of miss distances from simulated tests acrossthe weapons operational space. However, before modeling and simulation canbe used to predict performance it must first be validated....

On Scoping a Test that Addresses the Wrong Objective

Statistical literature refers to a type of error that is committed by giving the right answer to the wrong question. If a test design is adequately scoped to address an irrelevant objective, one could say that a Type III error occurs. In this paper, we focus on a specific Type III error that on some occasions test planners commit to reduce test size and resources. Suggested Citation Johnson, Thomas H., Rebecca M....

Bayesian Reliability- Combining Information

One of the most powerful features of Bayesian analyses is the ability to combine multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models....

Tutorial on Sensitivity Testing in Live Fire Test and Evaluation

A sensitivity experiment is a special type of experimental design that is used when the response variable is binary and the covariate is continuous. Armor protection and projectile lethality tests often use sensitivity experiments to characterize a projectile’s probability of penetrating the armor. In this mini-tutorial we illustrate the challenge of modeling a binary response with a limited sample size, and show how sensitivity experiments can mitigate this problem. We review eight different single covariate sensitivity experiments and present a comparison of these designs using simulation....

Estimating System Reliability from Heterogeneous Data

This briefing provides an example of some of the nuanced issues in reliability estimation in operational testing. The statistical models are motivated by an example of the Paladin Integrated Management (PIM). We demonstrate how to use a Bayesian approach to reliability estimation that uses data from all phases of testing. Suggested Citation Browning, Caleb, Laura Freeman, Alyson Wilson, Kassandra Fronczyk, and Rebecca Dickinson. “Estimating System Reliability from Heterogeneous Data.” Presented at the Conference on Applied Statistics in Defense, George Mason University, October 2015....

Improving Reliability Estimates with Bayesian Statistics

This paper shows how Bayesian methods are ideal for the assessment of complex system reliability assessments. Several examples illustrate the methodology. Suggested Citation Freeman, Laura J, and Kassandra Fronczyk. “Improving Reliability Estimates with Bayesian Statistics.” ITEA Journal of Test and Evaluation 37, no. 4 (June 2015). Paper:

Statistical Models for Combining Information Stryker Reliability Case Study

Reliability is an essential element in assessing the operational suitability of Department of Defense weapon systems. Reliability takes a prominent role in both the design and analysis of operational tests. In the current era of reduced budgets and increased reliability requirements, it is challenging to verify reliability requirements in a single test. Furthermore, all available data should be considered in order to ensure evaluations provide the most appropriate analysis of the system’s reliability....

Applying Risk Analysis to Acceptance Testing of Combat Helmets

Acceptance testing of combat helmets presents multiple challenges that require statistically-sound solutions. For example, how should first article and lot acceptance tests treat multiple threats and measures of performance? How should these tests account for multiple helmet sizes and environmental treatments? How closely should first article testing requirements match historical or characterization test data? What government and manufacturer risks are acceptable during lot acceptance testing? Similar challenges arise when testing other components of Personal Protective Equipment and similar statistical approaches should be applied to all components....

Power Analysis Tutorial for Experimental Design Software

This guide provides both a general explanation of power analysis and specific guidance to successfully interface with two software packages, JMP and Design Expert (DX). Suggested Citation Freeman, Laura J., Thomas H. Johnson, and James R. Simpson. “Power Analysis Tutorial for Experimental Design Software:” Fort Belvoir, VA: Defense Technical Information Center, November 1, 2014. https://doi.org/10.21236/ADA619843. Paper:

Censored Data Analysis- A Statistical Tool for Efficient and Information-Rich Testing

Binomial metrics like probability-to-detect or probability-to-hit typically provide operationally meaningful and easy to interpret test outcomes. However, they are information-poor metrics and extremely expensive to test. The standard power calculations to size a test employ hypothesis tests, which typically result in many tens to hundreds of runs. In addition to being expensive, the test is most likely inadequate for characterizing performance over a variety of conditions due to the inherently large statistical uncertainties associated with binomial metrics....

An Expository Paper on Optimal Design

There are many situations where the requirements of a standard experimental design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response. Suggested Citation Johnson, Rachel T., Douglas C. Montgomery, and Bradley A. Jones. “An Expository Paper on Optimal Design....

Design for Reliability using Robust Parameter Design

Recently, the principles of Design of Experiments (DOE) have been implemented as amethod of increasing the statistical rigor of operational tests. The focus has been on ensuringcoverage of the operational envelope in terms of system effectiveness. DOE is applicable inreliability analysis as well. A reliability standard, ANSI-0009, advocates the use Design forReliability (DfR) early in the product development cycle in order to design-in reliability. Robustparameter design (RPD) first used by Taguchi and then by the response surface communityprovides insights on how DOE can be used to make a products and processes invariant tochanges in factors....

Hybrid Designs- Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models

This tutorial provides an overview of experimental design for modeling and simulation. Pros and cons of each design methodology are discussed. Suggested Citation Silvestrini, Rachel Johnson. “Hybrid Designs: Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models.” Monterey, California, May 2011. Slides:

Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods

In this paper we compare data from a fairly large legacy wind tunnel test campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity to reduce wind tunnel test efforts without losing test information. Suggested Citation Hill, Raymond R., Derek A. Leggio, Shay R. Capehart, and August G. Roesener. “Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods.” Quality and Reliability Engineering International 27, no....