A Practitioner’s Framework for Federated Model Validation Resource Allocation

Recent advances in computation and statistics led to an increasing use of federated models for end-to-end system test and evaluation. A federated model is a collection of interconnected models where the outputs of a model act as inputs to subsequent models. However, the process of verifying and validating federated models is poorly understood, especially when testers have limited resources, knowledge-based uncertainties, and concerns over operational realism. Testers often struggle with determining how to best allocate limited test resources for model validation....

2024 · Dhruv Patel, Jo Anna Capp, John Haman

A Preview of Functional Data Analysis for Modeling and Simulation Validation

Modeling and simulation (M&S) validation for operational testing often involves comparing live data with simulation outputs. Statistical methods known as functional data analysis (FDA) provides techniques for analyzing large data sets (“large” meaning that a single trial has a lot of information associated with it), such as radar tracks. We preview how FDA methods could assist M&S validation by providing statistical tools handling these large data sets. This may facilitate analyses that make use of more of the data available and thus allows for better detection of differences between M&S predictions and live test results....

2024 · Curtis Miller

Meta-Analysis of the Effectiveness of the SALIANT Procedure for Assessing Team Situation Awareness

Many Department of Defense (DoD) systems aim to increase or maintain Situational Awareness (SA) at the individual or group level. In some cases, maintenance or enhancement of SA is listed as a primary function or requirement of the system. However, during test and evaluation SA is examined inconsistently or is not measured at all. Situational Awareness Linked Indicators Adapted to Novel Tasks (SALIANT) is an empirically-based methodology meant to measure SA at the team, or group, level....

2024 · Sarah Shaffer, Miriam Armstrong

Operational T&E of AI-Supported Data Integration, Fusion, and Analysis Systems

AI will play an important role in future military systems. However, large questions remain about how to test AI systems, especially in operational settings. Here, we discuss an approach for the operational test and evaluation (OT&E) of AI-supported data integration, fusion, and analysis systems. We highlight new challenges posed by AI-supported systems and we discuss new and existing OT&E methods for overcoming them. We demonstrate how to apply these OT&E methods via a notional test concept that focuses on evaluating an AI-supported data integration system in terms of its technical performance (how accurate is the AI output?...

2024 · Adam Miller, Logan Ausman, John Haman, Keyla Pagan-Rivera, Sarah Shaffer, Brian Vickers

Simulation Insights on Power Analysis with Binary Responses--from SNR Methods to 'skprJMP'

Logistic regression is a commonly-used method for analyzing tests with probabilistic responses in the test community, yet calculating power for these tests has historically been challenging. This difficulty prompted the development of methods based on signal-to-noise ratio (SNR) approximations over the last decade, tailored to address the intricacies of logistic regression’s binary outcomes. However, advancements and improvements in statistical software and computational power have reduced the need for such approximate methods....

2024 · Tyler Morgan-Wall, Robert Atkins, Curtis Miller

Statistical Advantages of Validated Surveys over Custom Surveys

Surveys play an important role in quantifying user opinion during test and evaluation (T&E). Current best practice is to use surveys that have been tested, or “validated,” to ensure that they produce reliable and accurate results. However, unvalidated (“custom”) surveys are still widely used in T&E, raising questions about how to determine sample sizes for—and interpret data from— T&E events that rely on custom surveys. In this presentation, I characterize the statistical properties of validated and custom survey responses using data from recent T&E events, and then I demonstrate how these properties affect test design, analysis, and interpretation....

2024 · Adam Miller

Comparing Normal and Binary D-Optimal Designs by Statistical Power

In many Department of Defense test and evaluation applications, binary response variables are unavoidable. Many have considered D-optimal design of experiments for generalized linear models. However, little consideration has been given to assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected....

2023 · Addison Adams

Framework for Operational Test Design- An Example Application of Design Thinking

This poster provides an example of how a design thinking framework can facilitate operational test design. Design thinking is a problem-solving approach of interest to many groups including those in the test and evaluation community. Design thinking promotes the principles of human-centeredness, iteration, and diversity and it can be accomplished via a five-phased approach. Following this approach, designers create innovated product solutions by (l) conducting research to empathize with their users, (2) defining specific user problems, (3) ideating on solutions that address the defined problems, (4) prototyping the product, and (5) testing the prototype....

2023 · Miriam Armstrong

Introduction to Design of Experiments in R- Generating and Evaluating Designs with Skpr

This workshop instructs attendees on how to run an end-to-end optimal Design of Experiments workflow in R using the open source skpr package. This workshop is split into two sections optimal design generation and design evaluation. The first half of the workshop provides basic instructions how to use R, as well as how to use skpr to create an optimal design for an experiment how to specify a model, create a candidate set of potential runs, remove disallowed combinations, and specify the design generation conditions to best suit an experimenter’s goals....

2023 · Tyler Morgan-Wall

Introduction to Measuring Situational Awareness in Mission-Based Testing Scenarios

Situation Awareness (SA) plays a key role in decision making and human performance, higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events....

2023 · Elizabeth Green, Miriam Armstrong, Janna Mantua

Introduction to Git

Version control software manages, archives, and (optionally) distributes different versions of files. The most popular program for version control is Git, which serves as the backbone of websites such as Github, Bitbucket, and others. In this mini- tutorial, we will introduce basics of version control in general, and Git in particular. We explain what role Git plays in a reproducible research context. The goal of the course is to get participants started using Git....

2022 · Curtis Miller

Measuring Training Efficacy- Structural Validation of the Operational Assessment of Training Scale

Effective training of the broad set of users/operators of systems has downstream impacts on usability, workload, and ultimate system performance that are related to mission success. In order to measure training effectiveness, we designed a survey called the Operational Assessment of Training Scale (OATS) in partnership with the Army Test and Evaluation Center (ATEC). Two subscales were designed to assess the degrees to which training covered relevant content for real operations (Relevance subscale) and enabled self-rated ability to interact with systems effectively after training (Efficacy subscale)....

2022 · Brian Vickers, Rachel Haga, Daniel Porter, Heather Wojton

Thoughts on Applying Design of Experiments (DOE) to Cyber Testing

This briefing presented at Dataworks 2022 provides examples of potential ways in which Design of Experiments (DOE) could be applied to initially scope cyber assessments and, based on the results of those assessments, subsequently design in greater detail cyber tests. Suggested Citation Gilmore, James M, Kelly M Avery, Matthew R Girardi, and Rebecca M Medlin. Thoughts on Applying Design of Experiments (DOE) to Cyber Testing. IDA Document NS D-33023. Alexandria, VA: Institute for Defense Analyses, 2022....

2022 · Michael Gilmore, Rebecca Medlin, Kelly Avery, Matthew Girardi

Topological Modeling of Human-Machine Teams

A Human-Machine Team (HMT) is a group ofagents consisting of at least one human and at least one machine, all functioning collaboratively towards one or more common objectives. As industry and defense find more helpful, creative, and difficult applications of AI-driven technology, the need to effectively and accurately model, simulate, test, and evaluate HMTs will continue to grow and become even more essential. Going along with that growing need, new methods are required to evaluate whether a human-machine team is performing effectively as a team in testing and evaluation scenarios....

2022 · Leonard Wilkins, Caitlan Fealing

What Statisticians Should Do to Improve M&S Validation Studies

It is often said that many research findings – from social sciences, medicine, economics, and other disciplines – are false. This fact is trumpeted in the media and by many statisticians. There are several reasons that false research is published, but to what extent should we be worried about them in defense testing and modeling and simulation? In this talk I will present several recommendations for actions that statisticians and data scientists can take to improve the quality of our validations and evaluations....

2022 · John Haman

Introduction to Qualitative Methods

Qualitative data, captured through free-form comment boxes, interviews, focus groups, and activity observation is heavily employed in testing and evaluation (T&E). The qualitative research approach can offer many benefits, but knowledge of how to implement methods, collect data, and analyze data according to rigorous qualitative research standards is not broadly understood within the T&E community. This tutorial offers insight into the foundational concepts of method and practice that embody defensible approaches to qualitative research....

2021 · Kristina Carter, Emily Fedele, Daniel Hellmann

A Validation Case Study- The Environment Centric Weapons Analysis Facility (ECWAF)

Reliable modeling and simulation (M&S) allows the undersea warfare community to understand torpedo performance in scenarios that could never be created in live testing, and do so for a fraction of the cost of an in-water test. The Navy hopes to use the Environment Centric Weapons Analysis Facility (ECWAF), a hardware-in-the-loop simulation, to predict torpedo effectiveness and supplement live operational testing. In order to trust the model’s results, the T&E community has applied rigorous statistical design of experiments techniques to both live and simulation testing....

2020 · Elliot Bartis, Steven Rabinowitz

Bayesian Component Reliability- An F-35 Case Study

A challenging aspect ofa system reliability assessment is integratingmultiple sources of information, such as component, subsystem, and full-system data,along with previous test data or subject matter expert (SME) opinion. A powerfulfeature of Bayesian analyses is the ability to combine these multiple sources of dataand variability in an informed way to perform statistical inference. This feature isparticularly valuable in assessing system reliability where testing is limited and only asmall number of failures (or none at all) are observed....

2019 · Rebecca Medlin, V. Bram Lillard

D-Optimal as an Alternative to Full Factorial Designs- a Case Study

The use of Bayesian statistics and experimental design as tools to scope testing and analyze data related to defense has increased in recent years. Planning a test using experimental design will allow testers to cover the operational space while maximizing the information obtained from each run. Understanding which factors can affect a detector’s performance can influence military tactics, techniques and procedures, and improve a commander’s situational awareness when making decisions in an operational environment....

2019 · Keyla Pagan-Rivera

Impact of Conditions which Affect Exploratory Factor Analysis

Some responses cannot be observed directly and must be inferred from multiple indirect measurements, for example human experiences accessed through a variety of survey questions. Exploratory Factor Analysis (EFA) is a data-driven method to optimally combine these indirect measurements to infer some number of unobserved factors. Ideally, EFA should identify how many unobserved factors the indirect measures help estimate (factor extraction), as well as accurately capture how well each indirect measure estimates each factor (parameter recovery)....

2019 · Kevin Krost, Daniel Porter Stephanie Lane, Heather Wojton

M&S Validation for the Joint Air-to-Ground Missile

An operational test is resource-limited and must therefore rely on both live test data and modeling and simulation (M&S) data to inform a full evaluation. For the Joint Air-to-Ground Missile (JAGM) system, we needed to create a test design that accomplished dual goals, characterizing missile performance across the operational space and supporting rigorous validation of the M&S. Our key question is which statistical techniques should be used to compare the M&S to the live data?...

2019 · Brent Crabtree, Andrew Cseko, Thomas Johnson, Joel Williamson, Kelly Avery

Reproducible Research Mini-Tutorial

Analyses are reproducible if the same methods applied to the same data produce identical results when run again by another researcher (or you in the future). Reproducible analyses are transparent and easy for reviewers to verify, as results and figures can be traced directly to the data and methods that produced them. There are also direct benefits to the researcher. Real-world analysis workflows inevitably require changes to incorporate new or additional data, or to address feedback from collaborators, reviewers, or sponsors....

2019 · Andrew Flack, John Haman, Kevin Kirshenbaum

Statistics Boot Camp

In the test community, we frequently use statistics to extract meaning from data. These inferences may be drawn with respect to topics ranging from system performance to human factors. In this mini-tutorial, we will begin by discussing the use of descriptive and inferential statistics. We will continue by discussing commonly used parametric and nonparametric statistics within the defense community, ranging from comparisons of distributions to comparisons of means. We will conclude with a brief discussion of how to present your statistical findings graphically for maximum impact....

2019 · Kelly Avery, Stephanie Lane

Survey Testing Automation Tool (STAT)

In operational testing, survey administration is typically a manual, paper-driven process. We developed a web-based tool called Survey Testing Automation Tool (STAT), which integrates and automates survey construction, administration, and analysis procedures. STAT introduces a standardized approach to the construction of surveys and includes capabilities for survey management, survey planning, and form generation. Suggested Citation Finnegan, Gary M, Kelly Tran, Tara A McGovern, and William R Whitledge. Survey Testing Automation Tool (STAT)....

2019 · Kelly Tran, Tara McGovern, William Whitledge

Improved Surface Gunnery Analysis with Continuous Data

Recasting gunfire data from binomial (hit/miss) to continuous (time-to-kill) allows us to draw statistical conclusions with tactical implications from free-play,live-fire surface gunnery events. Our analysis provided the Navy with suggestions forimprovements to its tactics and the employment of its weapons. A censored analysisenabled us to do so, where other methods fell short. Suggested Citation Ashwell, Benjamin A, V Bram Lillard, and George M Khoury. Improved Surface Gunnery Analysis with Continuous Data....

2018 · Benjamin Ashwell, V. Bram Lillard

Parametric Reliability Models Tutorial

This tutorial demonstrates how to plot reliability functions parametrically in R using the output from any reliability modeling software. It provides code and sample plots of reliability and failure rate functions with confidence intervals for three different skewed probability distributions the exponential, the two-parameter Weibull, and the lognormal. These three distributions are the most common parametric models for reliability or survival analysis. This paper also provides mathematical background for the models and recommendations for when to use them....

2018 · William Whitledge

Comparing Live Missile Fire and Simulation

Modeling and Simulation is frequently used in Test and Evaluation (T&E) of air-to-air weapon systems to evaluate the effectiveness of a weapons. The AirIntercept Missile-9X (AIM-9X) program uses modeling and simulationextensively to evaluate missile miss distances. Since flight testing isexpensive, the test program uses relatively few flight tests and supplementsthose data with large numbers of miss distances from simulated tests acrossthe weapons operational space. However, before modeling and simulation canbe used to predict performance it must first be validated....

2017 · Rebecca Medlin, Pamela Rambow, Douglas Peek

Bayesian Analysis in R/STAN

In an era of reduced budgets and limited testing, verifying that requirements have been met in a single test period can be challenging, particularly using traditional analysis methods that ignore all available information. The Bayesian paradigm is tailor made for these situations, allowing for the combination of multiple sources of data and resulting in more robust inference and uncertainty quantification. Consequently, Bayesian analyses are becoming increasingly popular in T&E. This tutorial briefly introduces the basic concepts of Bayesian Statistics, with implementation details illustrated in R through two case studies: reliability for the Core Mission functional area of the Littoral Combat Ship (LCS) and performance curves for a chemical detector in the Bio-chemical Detection System (BDS) with different agents and matrices....

2016 · Kassandra Fronczyk

Censored Data Analysis Methods for Performance Data- A Tutorial

Binomial metrics like probability-to-detect or probability-to-hit typically do not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. Censored data analysis allows us to account for both pieces of information simultaneously. Suggested Citation Lillard, V Bram. Censored Data Analysis Methods for Performance Data: A Tutorial. IDA Document NS D-5811. Alexandria, VA: Institute for Defense Analyses, 2016....

2016 · V. Bram Lillard

Science of Test Workshop Proceedings, April 11-13, 2016

To mark IDA’s 60th anniversary, we are conducting a series of workshops and symposia that bring together IDA sponsors, researchers, experts inside and outside government, and other stakeholders to discuss issues of the day. These events focus on future national security challenges, reflecting on how past lessons and accomplishments help prepare us to deal with complex issues and environments we face going forward. This publication represents the proceedings of the Science of Test Workshop....

2016 · Laura Freeman, Pamela Rambow, Jonathan Snavely