Test Science Research Document Library

A Practitioner’s Framework for Federated Model Validation Resource Allocation

Recent advances in computation and statistics led to an increasing use of federated models for end-to-end system test and evaluation. A federated model is a collection of interconnected models where the outputs of a model act as inputs to subsequent models. However, the process of verifying and validating federated models is poorly understood, especially when testers have limited resources, knowledge-based uncertainties, and concerns over operational realism. Testers often struggle with determining how to best allocate limited test resources for model validation....

A Preview of Functional Data Analysis for Modeling and Simulation Validation

Modeling and simulation (M&S) validation for operational testing often involves comparing live data with simulation outputs. Statistical methods known as functional data analysis (FDA) provides techniques for analyzing large data sets (“large” meaning that a single trial has a lot of information associated with it), such as radar tracks. We preview how FDA methods could assist M&S validation by providing statistical tools handling these large data sets. This may facilitate analyses that make use of more of the data available and thus allows for better detection of differences between M&S predictions and live test results....

A Reliability Assurance Test Planning and Analysis Tool

This presentation documents the work of IDA 2024 Summer Associate Emma Mitchell. The work presented details an R Shiny application developed to provide a user-friendly software tool for researchers to use in planning for and analyzing system reliability. Specifically, the presentation details how one can plan for a reliability test using Bayesian Reliability Assurance test methods. Such tests utilize supplementary data and information, including reliability models, prior test results, expert judgment, and knowledge of environmental conditions, to plan for reliability testing, which in turn can often help in reducing the required amount of testing....

Determining the Necessary Number of Runs in Computer Simulations with Binary Outcomes

How many success-or-failure observations should we collect from a computer simulation? Often, researchers use space-filling design of experiments when planning modeling and simulation (M&S) studies. We are not satisfied with existing guidance on justifying the number of runs when developing these designs, either because the guidance is insufficiently justified, does not provide an unambiguous answer, or is not based on optimizing a statistical measure of merit. Analysts should use confidence interval margin of error as the statistical measure of merit for M&S studies intended to characterize overall M&S behavioral trends....

Developing AI Trust- From Theory to Testing and the Myths in Between

This introductory work aims to provide members of the Test and Evaluation community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems. The paper provides a set of working definitions and works toward dispelling confusion and myths surrounding trust. Suggested Citation Razin, Yosef S., and Kristen Alexander. “Developing AI Trust: From Theory to Testing and the Myths in Between.” The ITEA Journal of Test and Evaluation 45, no....

Introduction to Human-Systems Interaction in Operational Test and Evaluation Course

Human-System Interaction (HSI) is the study of interfaces between humans and technical systems. The Department of Defense incorporates HSI evaluations into defense acquisition to improve system performance and reduce lifecycle costs. During operational test and evaluation, HSI evaluations characterize how a system’s operational performance is affected by its users. The goal of this course is to provide the theoretical background and practical tools necessary to plan and evaluate HSI test plans, collect and analyze HSI data, and report on HSI results....

Meta-Analysis of the Effectiveness of the SALIANT Procedure for Assessing Team Situation Awareness

Many Department of Defense (DoD) systems aim to increase or maintain Situational Awareness (SA) at the individual or group level. In some cases, maintenance or enhancement of SA is listed as a primary function or requirement of the system. However, during test and evaluation SA is examined inconsistently or is not measured at all. Situational Awareness Linked Indicators Adapted to Novel Tasks (SALIANT) is an empirically-based methodology meant to measure SA at the team, or group, level....

Operational T&E of AI-Supported Data Integration, Fusion, and Analysis Systems

AI will play an important role in future military systems. However, large questions remain about how to test AI systems, especially in operational settings. Here, we discuss an approach for the operational test and evaluation (OT&E) of AI-supported data integration, fusion, and analysis systems. We highlight new challenges posed by AI-supported systems and we discuss new and existing OT&E methods for overcoming them. We demonstrate how to apply these OT&E methods via a notional test concept that focuses on evaluating an AI-supported data integration system in terms of its technical performance (how accurate is the AI output?...

Quantifying Uncertainty to Keep Astronauts and Warfighters Safe

Both NASA and DOT&E increasingly rely on computer models to supplement data collection, and utilize statistical distributions to quantify the uncertainty in models, so that decision-makers are equipped with the most accurate information about system performance and model fitness. This article provides a high-level overview of uncertainty quantification (UQ) through an example assessment for the reliability of a new space-suit system. The goal is to reach a more general audience in Significance Magazine, and convey the importance and relevance of statistics to the defense and aerospace communities....

Sequential Space-Filling Designs for Modeling & Simulation Analyses

Space-filling designs (SFDs) are a rigorous method for designing modeling and simulation (M&S) studies. However, they are hindered by their requirement to choose the final sample size prior to testing. Sequential designs are an alternative that can increase test efficiency by testing small amounts of data at a time. We have conducted a literature review of existing sequential space-filling designs and found the methods most applicable to the test and evaluation (T&E) community....

Simulation Insights on Power Analysis with Binary Responses--from SNR Methods to 'skprJMP'

Logistic regression is a commonly-used method for analyzing tests with probabilistic responses in the test community, yet calculating power for these tests has historically been challenging. This difficulty prompted the development of methods based on signal-to-noise ratio (SNR) approximations over the last decade, tailored to address the intricacies of logistic regression’s binary outcomes. However, advancements and improvements in statistical software and computational power have reduced the need for such approximate methods....

Statistical Advantages of Validated Surveys over Custom Surveys

Surveys play an important role in quantifying user opinion during test and evaluation (T&E). Current best practice is to use surveys that have been tested, or “validated,” to ensure that they produce reliable and accurate results. However, unvalidated (“custom”) surveys are still widely used in T&E, raising questions about how to determine sample sizes for—and interpret data from— T&E events that rely on custom surveys. In this presentation, I characterize the statistical properties of validated and custom survey responses using data from recent T&E events, and then I demonstrate how these properties affect test design, analysis, and interpretation....

Uncertainty Quantification for Ground Vehicle Vulnerability Simulation

A vulnerability assessment of a combat vehicle uses modeling and simulation (M&S) to predict the vehicle’s vulnerability to a given enemy attack. The system-level output of the M&S is the probability that the vehicle’s mobility is degraded as a result of the attack. The M&S models this system-level phenomenon by decoupling the attack scenario into a hierarchy of sub-systems. Each sub-system addresses a specific scientific problem, such as the fracture dynamics of an exploded munition, or the ballistic resistance provided by the vehicle’s armor....

A Team-Centric Metric Framework for Testing and Evaluation of Human-Machine Teams

We propose and present a parallelized metric framework for evaluating human-machine teams that draws upon current knowledge of human-systems interfacing and integration but is rooted in team-centric concepts. Humans and machines working together as a team involves interactions that will only increase in complexity as machines become more intelligent, capable teammates. Assessing such teams will require explicit focus on not just the human-machine interfacing but the full spectrum of interactions between and among agents....

AI + Autonomy T&E in DoD

Test and evaluation (T&E) of AI-enabled systems (AIES) often emphasizes algorithm accuracy over robust, holistic system performance. While this narrow focus may be adequate for some applications of AI, for many complex uses, T&E paradigms removed from operational realism are insufficient. However, leveraging traditional operational testing (OT) methods for to evaluate AIESs can fail to capture novel sources of risk. This brief establishes a common AI vocabulary and highlights OT challenges posed by AIESs by answering the following questions...

CDV Method for Validating AJEM using FUSL Test Data

M&S validation is critical for ensuring credible weapon system evaluations. System-level evaluations of Armored Fighting Vehicles (AFV) rely on the Advanced Joint Effectiveness Model (AJEM) and Full-Up System Level (FUSL) testing to assess AFV vulnerability. This report reviews and improves upon one of the primary methods that analysts use to validate AJEM, called the Component Damage Vector (CDV) Method. The CDV Method compares vehicle components that were damaged in FUSL testing to simulated representations of that damage from AJEM....

Comparing Normal and Binary D-Optimal Designs by Statistical Power

In many Department of Defense test and evaluation applications, binary response variables are unavoidable. Many have considered D-optimal design of experiments for generalized linear models. However, little consideration has been given to assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected....

Data Principles for Operational and Live-Fire Testing

Many DOD systems undergo operational testing, which is a field test involving realistic combat conditions. Data, analysis, and reporting are the fundamental outcomes of operational test, which support leadership decisions. The importance of data standardization and interoperability is widely recognized by leadership in DoD, however, there are no generally recognized standards for the management and handling of data (format, pedigree, architecture, transferability, etc.) in the DOD. In this presentation, I will review a set of data principles that we believe DOD should adopt to improve how it manages test data....

Development of Wald-Type and Score-Type Statistical Tests to Compare Live Test Data and Simulation Predictions

This work describes the development of a statistical test created in support of ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test computes a Wald-type statistic comparing two generalized linear models estimated from live test data and analogous simulated data. The resulting statistic indicates whether the M&S outputs differ from the live data. After developing the test, we applied it to two logistic regression models estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s Environment Centric Weapons Analysis Facility (ECWAF)....

Framework for Operational Test Design- An Example Application of Design Thinking

This poster provides an example of how a design thinking framework can facilitate operational test design. Design thinking is a problem-solving approach of interest to many groups including those in the test and evaluation community. Design thinking promotes the principles of human-centeredness, iteration, and diversity and it can be accomplished via a five-phased approach. Following this approach, designers create innovated product solutions by (l) conducting research to empathize with their users, (2) defining specific user problems, (3) ideating on solutions that address the defined problems, (4) prototyping the product, and (5) testing the prototype....

Implementing Fast Flexible Space-Filling Designs in R

Modeling and simulation (M&S) can be a useful tool when testers and evaluators need to augment the data collected during a test event. When planning M&S, testers use experimental design techniques to determine how much and which types of data to collect, and they can use space-filling designs to spread out test points across the operational space. Fast flexible space-filling designs (FFSFDs) are a type of space-filling design useful for M&S because they work well in design spaces with disallowed combinations and permit the inclusion of categorical factors....

Improving Test Efficiency- A Bayesian Assurance Case Study

To improve test planning for evaluating system reliability, we propose the use of Bayesian methods to incorporate supplementary data and reduce testing duration. Furthermore, we recommend Bayesian methods be employed in the analysis phase to better quantify uncertainty. We find that when using Bayesian Methods for test planning we can scope smaller tests and using Bayesian methods in analysis results in a more precise estimate of reliability – improving uncertainty quantification....

Introduction to Design of Experiments for Testers

This training provides details regarding the use of design of experiments, from choosing proper response variables, to identifying factors that could affect such responses, to determining the amount of data necessary to collect. The training also explains the benefits of using a Design of Experiments approach to testing and provides an overview of commonly used designs (e.g., factorial, optimal, and space-filling). The briefing illustrates the concepts discussed using several case studies....

Introduction to Design of Experiments in R- Generating and Evaluating Designs with Skpr

This workshop instructs attendees on how to run an end-to-end optimal Design of Experiments workflow in R using the open source skpr package. This workshop is split into two sections optimal design generation and design evaluation. The first half of the workshop provides basic instructions how to use R, as well as how to use skpr to create an optimal design for an experiment how to specify a model, create a candidate set of potential runs, remove disallowed combinations, and specify the design generation conditions to best suit an experimenter’s goals....

Introduction to Measuring Situational Awareness in Mission-Based Testing Scenarios

Situation Awareness (SA) plays a key role in decision making and human performance, higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events....

Statistical Methods Development Work for M&S Validation

We discuss four areas in which statistically rigorous methods contribute to modeling and simulation validation studies. These areas are statistical risk analysis, space-filling experimental designs, metamodel construction, and statistical validation. Taken together, these areas implement DOT&E guidance on model validation. In each area, IDA has contributed either research methods, user-friendly tools, or both. We point to our tools on testscience.org, and survey the research methods that we’ve contributed to the M&S validation literature...

Statistical Methods for M&S V&V- An Intro for Non-Statisticians

This is a briefing intended to motivate and explain the basic concepts of applying statistics to verification and validation. The briefing will be presented at the Navy M&S VV&A WG (Sub-WG on Validation Statistical Method Selection). Suggested Citation Pagan-Rivera, Keyla, John T Haman, Kelly M Avery, and Curtis G Miller. Statistical Methods for M&S V&V: An Intro for Non- Statisticians. IDA Product ID-3000770. Alexandria, VA: Institute for Defense Analyses, 2024....

Analysis Apps for the Operational Tester

In the acquisition and testing world, data analysts repeatedly encounter certain categories of data, such as time or distance until an event (e.g., failure, alert, detection), binary outcomes (e.g., success/failure, hit/miss), and survey responses. Analysts need tools that enable them to produce quality and timely analyses of the data they acquire during testing. This poster presents four web-based apps that can analyze these types of data. The apps are designed to assist analysts and researchers with simple repeatable analysis tasks, such as building summary tables and plots for reports or briefings....

Case Study on Applying Sequential Analyses in Operational Testing

Sequential analysis concerns statistical evaluation in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends on the information acquired during the investigation. Although sequential analysis originated in ballistics testing for the Department of Defense (DoD)and it is widely used in other disciplines, it is underutilized in the DoD. Expanding the use of sequential analysis may save money and reduce test time....

Introduction to Git

Version control software manages, archives, and (optionally) distributes different versions of files. The most popular program for version control is Git, which serves as the backbone of websites such as Github, Bitbucket, and others. In this mini- tutorial, we will introduce basics of version control in general, and Git in particular. We explain what role Git plays in a reproducible research context. The goal of the course is to get participants started using Git....

Measuring Training Efficacy- Structural Validation of the Operational Assessment of Training Scale

Effective training of the broad set of users/operators of systems has downstream impacts on usability, workload, and ultimate system performance that are related to mission success. In order to measure training effectiveness, we designed a survey called the Operational Assessment of Training Scale (OATS) in partnership with the Army Test and Evaluation Center (ATEC). Two subscales were designed to assess the degrees to which training covered relevant content for real operations (Relevance subscale) and enabled self-rated ability to interact with systems effectively after training (Efficacy subscale)....

Metamodeling Techniques for Verification and Validation of Modeling and Simulation Data

Modeling and simulation (M&S) outputs help the Director, Operational Test and Evaluation (DOT&E) assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities....

Predicting Trust in Automated Systems - An Application of TOAST

Following Wojton’s research on the Trust of Automated Systems Test (TOAST), which is designed to measure how much a human trusts an automated system, we aimed to determine how well this scale performs when not used in a military context. We found that participants who used a poorly performing automated system trusted the system less than expected when using that system on a case by case basis, however, those who used a high performing system trusted the system the same as they expected....

Thoughts on Applying Design of Experiments (DOE) to Cyber Testing

This briefing presented at Dataworks 2022 provides examples of potential ways in which Design of Experiments (DOE) could be applied to initially scope cyber assessments and, based on the results of those assessments, subsequently design in greater detail cyber tests. Suggested Citation Gilmore, James M, Kelly M Avery, Matthew R Girardi, and Rebecca M Medlin. Thoughts on Applying Design of Experiments (DOE) to Cyber Testing. IDA Document NS D-33023. Alexandria, VA: Institute for Defense Analyses, 2022....

Topological Modeling of Human-Machine Teams

A Human-Machine Team (HMT) is a group ofagents consisting of at least one human and at least one machine, all functioning collaboratively towards one or more common objectives. As industry and defense find more helpful, creative, and difficult applications of AI-driven technology, the need to effectively and accurately model, simulate, test, and evaluate HMTs will continue to grow and become even more essential. Going along with that growing need, new methods are required to evaluate whether a human-machine team is performing effectively as a team in testing and evaluation scenarios....

What Statisticians Should Do to Improve M&S Validation Studies

It is often said that many research findings – from social sciences, medicine, economics, and other disciplines – are false. This fact is trumpeted in the media and by many statisticians. There are several reasons that false research is published, but to what extent should we be worried about them in defense testing and modeling and simulation? In this talk I will present several recommendations for actions that statisticians and data scientists can take to improve the quality of our validations and evaluations....

Artificial Intelligence & Autonomy Test & Evaluation Roadmap Goals

As the Department of Defense acquires new systems with artificial intelligence (AI) and autonomous (AI&A) capabilities, the test and evaluation (T&E) community will need to adapt to the challenges that these novel technologies present. The goals listed in this AI Roadmap address the broad range of tasks that the T&E community will need to achieve in order to properly test, evaluate, verify, and validate AI-enabled and autonomous systems. It includes issues that are unique to AI and autonomous systems, as well as legacy T&E shortcomings that will be compounded by newer technologies....

Determining How Much Testing is Enough- An Exploration of Progress in the Department of Defense Test and Evaluation Community

This paper describes holistic progress in answering the question of “How much testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994. Suggested Citation Medlin, Rebecca, Matthew R Avery, James R Simpson, and Heather M Wojton....

Introduction to Bayesian Analysis

As operational testing becomes increasingly integrated and research questions become more difficult to answer, IDA’s Test Science team has found Bayesian models to be powerful data analysis methods. Analysts and decision-makers should understand the differences between this approach and the conventional way of analyzing data. It is also important to recognize when an analysis could benefit from the inclusion of prior information—what we already know about a system’s performance—and to understand the proper way to incorporate that information....

Introduction to Qualitative Methods

Qualitative data, captured through free-form comment boxes, interviews, focus groups, and activity observation is heavily employed in testing and evaluation (T&E). The qualitative research approach can offer many benefits, but knowledge of how to implement methods, collect data, and analyze data according to rigorous qualitative research standards is not broadly understood within the T&E community. This tutorial offers insight into the foundational concepts of method and practice that embody defensible approaches to qualitative research....

Space-Filling Designs for Modeling & Simulation

This document presents arguments and methods for using space-filling designs (SFDs) to plan modeling and simulation (M&S) data collection. Suggested Citation Avery, Kelly, John T Haman, Thomas Johnson, Curtis Miller, Dhruv Patel, and Han Yi. Test Design Challenges in Defense Testing. IDA Product ID 3002855. Alexandria, VA: Institute for Defense Analyses, 2024. Slides: Paper:

Warhead Arena Analysis Advancements

Fragmentation analysis is a critical piece of the live fire test and evaluation (LFT&E) of the lethality and vulnerability aspects of warheads. But the traditional methods for data collection are expensive and laborious. New optical tracking technology is promising to increase the fidelity of fragmentation data, and decrease the time and costs associated with data collection. However, the new data will be complex, three-dimensional “fragmentation clouds,” possibly with a time component as well, and there will be a larger number of individual data points....

Why are Statistical Engineers Needed for Test & Evaluation?

The Department of Defense (DoD) develops and acquires some of the world’s most advanced and sophisticated systems. As new technologies emerge and are incorporated into systems, OSD/DOT&E faces the challenge of ensuring that these systems undergo adequate and efficient test and evaluation (T&E) prior to operational use. Statistical engineering is a collaborative, analytical approach to problem solving that integrates statistical thinking, methods, and tools with other relevant disciplines. The statistical engineering process provides better solutions to large, unstructured, real-world problems and supports rigorous decision-making....

A Review of Sequential Analysis

Sequential analysis concerns statistical evaluation in situations in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the course of the investigation. Expanding the use of sequential analysis has the potential to save resources and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation....

A Validation Case Study- The Environment Centric Weapons Analysis Facility (ECWAF)

Reliable modeling and simulation (M&S) allows the undersea warfare community to understand torpedo performance in scenarios that could never be created in live testing, and do so for a fraction of the cost of an in-water test. The Navy hopes to use the Environment Centric Weapons Analysis Facility (ECWAF), a hardware-in-the-loop simulation, to predict torpedo effectiveness and supplement live operational testing. In order to trust the model’s results, the T&E community has applied rigorous statistical design of experiments techniques to both live and simulation testing....

Circular Prediction Regions for Miss Distance Models under Heteroskedasticity

Circular prediction regions are used in ballistic testing to express the uncertainty in shot accuracy. We compare two modeling approaches for estimating circular prediction regions for the miss distance of a ballistic projectile. The miss distance response variable is bivariate normal and has a mean and variance that can change with one or more experimental factors. The first approach fits a heteroskedastic linear model using restricted maximum likelihood, and uses the Kenward-Roger statistic to estimate circular prediction regions....

T&E Contributions to Avoiding Unintended Behaviors in Autonomous Systems

To provide assurance that AI-enabled systems will behave appropriately across the range of their operating conditions without performing exhaustive testing, the DoD will need to make inferences about system decision making. However, making these inferences validly requires understanding what causally drives system decision-making, which is not possible when systems are black boxes. In this briefing, we discuss the state of the art and gaps in techniques for obtaining, verifying, validating, and accrediting (OVVA) models of system decision-making....

Test & Evaluation of AI-Enabled and Autonomous Systems- A Literature Review

We summarize a subset of the literature regarding the challenges to and recommendations for the test, evaluation, verification, and validation (TEV&V) of autonomous military systems. This literature review is meant for informational purposes only and does not make any recommendations of its own. A synthesis of the literature identified the following categories of TEV&V challenges Problems arising from the complexity of autonomous systems, Challenges imposed by the structure of the current acquisition system,...

Trustworthy Autonomy- A Roadmap to Assurance -- Part 1- System Effectiveness

The Department of Defense (DoD) has invested significant effort over the past decade considering the role of artificial intelligence and autonomy in national security (e.g., Defense Science Board, 2012, 2016, Deputy Secretary of Defense, 2012, Endsley, 2015, Executive Order No. 13859, 2019, US Department of Defense, 2011, 2019, Zacharias, 2019a). However, these efforts were broadly scoped and only partially touched on how the DoD will certify the safety and performance of these systems....

Visualizing Data- I Don't Remember that Memo, but I Do Remember that Graph

IDA analysts strive to communicate clearly and effectively. Good data visualizations can enhance reports by making the conclusions easier to understand and more memorable. The goal of this seminar is to help you avoid settling for factory defaults and instead present your conclusions through visually appealing and understandable charts. Topics covered include choosing the right level of detail, guidelines for different types of graphical elements (titles, legends, annotations, etc.), selecting the right variable encodings (color, plot symbol, etc....

Bayesian Component Reliability- An F-35 Case Study

A challenging aspect ofa system reliability assessment is integratingmultiple sources of information, such as component, subsystem, and full-system data,along with previous test data or subject matter expert (SME) opinion. A powerfulfeature of Bayesian analyses is the ability to combine these multiple sources of dataand variability in an informed way to perform statistical inference. This feature isparticularly valuable in assessing system reliability where testing is limited and only asmall number of failures (or none at all) are observed....

Challenges and New Methods for Designing Reliability Experiments

Engineers use reliability experiments to determine the factors that drive product reliability, build robust products, and predict reliability under use conditions. This article uses recent testing of a Howitzer to illustrate the challenges in designing reliability experiments for complex, repairable systems. We leverage lessons learned from current research and propose methods for designing an experiment for a complex, repairable system. Suggested Citation Freeman, Laura J., Rebecca M. Medlin, and Thomas H....

D-Optimal as an Alternative to Full Factorial Designs- a Case Study

The use of Bayesian statistics and experimental design as tools to scope testing and analyze data related to defense has increased in recent years. Planning a test using experimental design will allow testers to cover the operational space while maximizing the information obtained from each run. Understanding which factors can affect a detector’s performance can influence military tactics, techniques and procedures, and improve a commander’s situational awareness when making decisions in an operational environment....

Demystifying the Black Box- A Test Strategy for Autonomy

The purpose of this briefing is to provide a high-level overview of how to frame the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and broad-stroke reforms needed to get ready for the test challenges of the next century. Suggested Citation Wojton, Heather M, and Daniel J Porter. Demystifying the Black Box: A Test Strategy for Autonomy. IDA Document NS D-10465-NS....

Designing Experiments for Model Validation- The Foundations for Uncertainty Quantification

Advances in computational power have allowed both greater fidelity and more extensive use of such models. Numerous complex military systems have a corresponding model that simulates its performance in the field. In response, the DoD needs defensible practices for validating these models. Design of Experiments and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations....

Handbook on Statistical Design & Analysis Techniques for Modeling & Simulation Validation

This handbook focuses on methods for data-driven validation to supplement the vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ. Suggested Citation Wojton, Heather, Kelly M Avery, Laura J Freeman, Samuel H Parry, Gregory S Whittier, Thomas H Johnson, and Andrew C Flack....

Impact of Conditions which Affect Exploratory Factor Analysis

Some responses cannot be observed directly and must be inferred from multiple indirect measurements, for example human experiences accessed through a variety of survey questions. Exploratory Factor Analysis (EFA) is a data-driven method to optimally combine these indirect measurements to infer some number of unobserved factors. Ideally, EFA should identify how many unobserved factors the indirect measures help estimate (factor extraction), as well as accurately capture how well each indirect measure estimates each factor (parameter recovery)....

Initial Validation of the Trust of Automated Systems Test (TOAST)

Trust is a key determinant of whether people rely on automated systems in the military and the public. However, there is currently no standard for measuring trust in automated systems. In the present studies we propose a scale to measure trust in automated systems that is grounded in current research and theory on trust formation, which we refer to as the Trust in Automated Systems Test (TOAST). We evaluated both the reliability of the scale structure and criterion validity using independent, military-affiliated and civilian samples....

M&S Validation for the Joint Air-to-Ground Missile

An operational test is resource-limited and must therefore rely on both live test data and modeling and simulation (M&S) data to inform a full evaluation. For the Joint Air-to-Ground Missile (JAGM) system, we needed to create a test design that accomplished dual goals, characterizing missile performance across the operational space and supporting rigorous validation of the M&S. Our key question is which statistical techniques should be used to compare the M&S to the live data?...

Managing T&E Data to Encourage Reuse

Reusing Test and Evaluation (T&E) datasets multiple times at different points throughout a program’s lifecycle is one way to realize their full value. Data management plays an important role in enabling - and even encouraging – this practice. Although Department-level policy on data management is supportive of reuse and consistent with best practices from industry and academia, the documents that shape the day-to-day activities of T&E practitioners are much less so....

Operational Testing of Systems with Autonomy

Systems with autonomy pose unique challenges for operational test. This document provides an executive level overview of these issues and the proposed solutions and reforms. In order to be ready for the testing challenges of the next century, we will need to change the entire acquisition life cycle, starting even from initial system conceptualization. This briefing was presented to the Director, Operational Test & Evaluation along with his deputies and Chief Scientist....

Pilot Training Next- Modeling Skill Transfer in a Military Learning Environment

Pilot Training Next is an exploratory investigation of new technologies and procedures to increase the efficiency of Undergraduate Pilot Training in the United States Air Force. IDA analysts present a method of quantifying skill transfer from simulators to aircraft under realistic, uncontrolled conditions. Suggested Citation Porter, Daniel, Emily Fedele, and Heather Wojton. Pilot Training Next: Modeling Skill Transfer in a Military Learning Environment. IDA Document NS D-10927. Alexandria, VA: Institute for Defense Analyses, 2019....

Reproducible Research Mini-Tutorial

Analyses are reproducible if the same methods applied to the same data produce identical results when run again by another researcher (or you in the future). Reproducible analyses are transparent and easy for reviewers to verify, as results and figures can be traced directly to the data and methods that produced them. There are also direct benefits to the researcher. Real-world analysis workflows inevitably require changes to incorporate new or additional data, or to address feedback from collaborators, reviewers, or sponsors....

Sample Size Determination Methods Using Acceptance Sampling by Variables

Acceptance Sampling by Variables (ASbV) is a statistical testing technique used in Personal Protective Equipment programs to determine the quality of the equipment in First Article and Lot Acceptance Tests. This article intends to remedy the lack of existing references that discuss the similarities between ASbV and certain techniques used in different sub-disciplines within statistics. Understanding ASbV from a statistical perspective allows testers to create customized test plans, beyond what is available in MIL-STD-414....

Statistics Boot Camp

In the test community, we frequently use statistics to extract meaning from data. These inferences may be drawn with respect to topics ranging from system performance to human factors. In this mini-tutorial, we will begin by discussing the use of descriptive and inferential statistics. We will continue by discussing commonly used parametric and nonparametric statistics within the defense community, ranging from comparisons of distributions to comparisons of means. We will conclude with a brief discussion of how to present your statistical findings graphically for maximum impact....

Survey Testing Automation Tool (STAT)

In operational testing, survey administration is typically a manual, paper-driven process. We developed a web-based tool called Survey Testing Automation Tool (STAT), which integrates and automates survey construction, administration, and analysis procedures. STAT introduces a standardized approach to the construction of surveys and includes capabilities for survey management, survey planning, and form generation. Suggested Citation Finnegan, Gary M, Kelly Tran, Tara A McGovern, and William R Whitledge. Survey Testing Automation Tool (STAT)....

The Effect of Extremes in Small Sample Size on Simple Mixed Models- A Comparison of Level-1 and Level-2 Size

We present a simulation study that examines the impact of small sample sizes in both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research....

The Purpose of Mixed-Effects Models in Test and Evaluation

Mixed-effects models are the standard technique for analyzing data with grouping structure. In defense testing, these models are useful because they allow us to account for correlations between observations, a feature common in many operational tests. In this article, we describe the advantages of modeling data from a mixed-effects perspective and discuss an R package—ciTools—that equips the user with easy methods for presenting results from this type of model. Suggested Citation Haman, John, Matthew Avery, and Heather Wojton....

Use of Design of Experiments in Survivability Testing

The purpose of survivability testing is to provide decision makers with relevant, credible evidence about the survivability of an aircraft that is conveyed with some degree of certainty or inferential weight. In developing an experiment to accomplish this goal, a test planner faces numerous questions What critical issue or issues are being address? What data are needed to answer the critical issues? What test conditions should be varied? What is the most economical way of varying those conditions?...

A Groundswell for Test and Evaluation

The fundamental purpose of test and evaluation (T&E) in the Department of Defense (DOD) is to provide knowledge to answer critical questions that help decision makers manage the risk involved in developing, producing, operating, and sustaining systems and capabilities. At its core, T&E takes data and translates it into information for decision makers. Subject matter expertise of the platform and operational mission have always been critical components of developing defensible test and evaluation strategies....

Analysis of Split-Plot Reliability Experiments with Subsampling

Reliability experiments are important for determining which factors drive product reliability. The data collected in these experiments can be challenging to analyze. Often, the reliability or lifetime data collected follow distinctly nonnormal distributions and include censored observations. Additional challenges in the analysis arise when the experiment is executed with restrictions on randomization. The focus of this paper is on the proper analysis of reliability data collected from a nonrandomized reliability experiments....

Comparing M&S Output to Live Test Data- A Missile System Case Study

In the operational testing of DoD weapons systems, modeling and simulation (M&S) is often used to supplement live test data in order to support a more complete and rigorous evaluation. Before the output of the M&S is included in reports to decision makers, it must first be thoroughly verified and validated to show that it adequately represents the real world for the purposes of the intended use. Part of the validation process should include a statistical comparison of live data to M&S output....

Improved Surface Gunnery Analysis with Continuous Data

Recasting gunfire data from binomial (hit/miss) to continuous (time-to-kill) allows us to draw statistical conclusions with tactical implications from free-play,live-fire surface gunnery events. Our analysis provided the Navy with suggestions forimprovements to its tactics and the employment of its weapons. A censored analysisenabled us to do so, where other methods fell short. Suggested Citation Ashwell, Benjamin A, V Bram Lillard, and George M Khoury. Improved Surface Gunnery Analysis with Continuous Data....

Informing the Warfighter—Why Statistical Methods Matter in Defense Testing

Needs one Suggested Citation Freeman, Laura J., and Catherine Warner. “Informing the Warfighter—Why Statistical Methods Matter in Defense Testing.” CHANCE 31, no. 2 (April 3, 2018): 4–11. https://doi.org/10.1080/09332480.2018.1467627. Paper:

Introduction to Observational Studies

A presentation on the theory and practice of observational studies. Specific average treatment effect methods include matching, difference-in-difference estimators, and instrumental variables. Suggested Citation Thomas, Dean, and Yevgeniya K Pinelis. Introduction to Observational Studies. IDA Document NS D-9020. Alexandria, VA: Institute for Defense Analyses, 2018. Slides:

JEDIS Briefing and Tutorial

Are you sick of having to manually iterate your way through sizing your design of experiments? Come learn about JEDIS, the new IDA-developed JMP Add-In for automating design of experiments power calculations. JEDIS builds multiple test designs in JMP over user-specified ranges of sample sizes, Signal-to-Noise Ratios (SNR), and alpha (1 -confidence) levels. It then automatically calculates the statistical power to detect an effect due to each factor and any specified interactions for each design....

Parametric Reliability Models Tutorial

This tutorial demonstrates how to plot reliability functions parametrically in R using the output from any reliability modeling software. It provides code and sample plots of reliability and failure rate functions with confidence intervals for three different skewed probability distributions the exponential, the two-parameter Weibull, and the lognormal. These three distributions are the most common parametric models for reliability or survival analysis. This paper also provides mathematical background for the models and recommendations for when to use them....

Power Approximations for Reliability Test Designs

Reliability tests determine which factors drive system reliability. Often, the reliability or failure time data collected in these tests tend to follow distinctly non- normal distributions and include censored observations. The experimental design should accommodate the skewed nature of the response and allow for censored observations, which occur when systems under test do not fail within the allotted test time. To account for these design and analysis considerations, Monte Carlo simulations are frequently used to evaluate experimental design properties....

Reliability Best Practices and Lessons Learned in the Department of Defense

Despite the importance of acquiring reliable systems to support thewarfighter, many military programs fail to meet reliability requirements, which affectsthe overall suitability and cost of the system. To determine ways to improve reliabilityoutcomes in the future, research staff from the Institute for Defense analysesOperational Evaluation Division compiled case studies identifying reliability lessonslearned and best practices for several DOT&E oversight programs. The case studiesprovide program specific information on strategies that worked well or did not workwell to produce reliable systems....

Scientific Test and Analysis Techniques

Abstract This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics coverec include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Slides

Scientific Test and Analysis Techniques- Continuous Learning Module

This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics covered include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Suggested Citation Pinelis, Yevgeniya, Laura J Freeman, Heather M Wojton, Denise J Edwards, Stephanie T Lane, and James R Simpson....

Testing Defense Systems

The complex, multifunctional nature of defense systems, along with the wide variety of system types, demands a structured but flexible analytical process for testing systems. This chapter summarizes commonly used techniques in defense system testing and specific challenges imposed by the nature of defense system testing. It highlights the core statistical methodologies that have proven useful in testing defense systems. Case studies illustrate the value of using statistical techniques in the design of tests and analysis of the resulting data....

Vetting Custom Scales - Understanding Reliability, Validity, and Dimensionality

For situations in which an empirically vetted scale does not exist or is not suitable, a custom scale may be created. This document presents a comprehensive process for establishing the defensible use of a custom scale. At the highest level, this process encompasses (1) establishing validity of the scale, (2) establishing reliability of the scale, and (3) assessing dimensionality, whether intended or unintended, of the scale. First, the concept of validity is described, including how validity may be established using operators and subject matter experts....

A Multi-Method Approach to Evaluating Human-System Interactions During Operational Testing

The purpose of this paper was to identify the shortcomings of a single-method approach to evaluating human-system interactions during operational testing and offer an alternative, multi-method approach that is more defensible, yields richer insights into how operators interact with weapon systems, and provides a practical implications for identifying when the quality of human-system interactions warrants correction through either operator training or redesign. Suggested Citation Thomas, Dean, Heather Wojton, Chad Bieber, and Daniel Porter....

Comparing Live Missile Fire and Simulation

Modeling and Simulation is frequently used in Test and Evaluation (T&E) of air-to-air weapon systems to evaluate the effectiveness of a weapons. The AirIntercept Missile-9X (AIM-9X) program uses modeling and simulationextensively to evaluate missile miss distances. Since flight testing isexpensive, the test program uses relatively few flight tests and supplementsthose data with large numbers of miss distances from simulated tests acrossthe weapons operational space. However, before modeling and simulation canbe used to predict performance it must first be validated....

Foundations of Psychological Measurement

Psychological measurement is an important issue throughout the Department of Defense (DoD). Forinstance, the DoD engages in psychological measurement to place military personnel into specialties,evaluate the mental health of military personnel, evaluate the quality of human-systems interactions, andidentify factors that affect crime rates on bases. Given its broad use, researchers and decision-makers needto understand the basics of psychological measurement – most notably, the development of surveys. Thisbriefing discusses 1) the goals and challenges of psychological measurement, 2) basic measurementconcepts and how they apply to psychological measurement, 3) basics for developing scales to measurepsychological attributes, and 4) methods for ensuring that scales are reliable and valid....

On Scoping a Test that Addresses the Wrong Objective

Statistical literature refers to a type of error that is committed by giving the right answer to the wrong question. If a test design is adequately scoped to address an irrelevant objective, one could say that a Type III error occurs. In this paper, we focus on a specific Type III error that on some occasions test planners commit to reduce test size and resources. Suggested Citation Johnson, Thomas H., Rebecca M....

Perspectives on Operational Testing-Guest Lecture at Naval Postgraduate School

This document was prepared to support Dr. Lillard’s visit to the NavalPostgraduate School where he will provide a guest lecture to students in the T&Ecourse. The briefing covers three primary themes: 1) evaluation of military systemson the basis of requirements and KPPs alone is often insufficient to determineeffectiveness and suitability in combat conditions, 2) statistical methods are essentialfor developing defensible and rigorous test designs, 3) operational testing is often theonly means to discover critical performance shortcomings....

Power Approximations for Generalized Linear Models using the Signal-to-Noise Transformation Method

Statistical power is a useful measure for assessing the adequacy of anexperimental design prior to data collection. This paper proposes an approach referredto as the signal-to-noise transformation method (SNRx), to approximate power foreffects in a generalized linear model. The contribution of SNRx is that, with a coupleassumptions, it generates power approximations for generalized linear model effectsusing F-tests that are typically used in ANOVA for classical linear models.Additionally, SNRx follows Ohlert and Whitcomb’s unified approach for sizing aneffect, which allows for intuitive effect size definitions, and consistent estimates ofpower....

Prediction Uncertainty for Autocorrelated Lognormal Data with Random Effects

Accurately presenting model estimates with appropriate uncertainties is critical to the credibility and defensibility of anypiece of statistical analysis. When dealing with complex data that require hierarchical covariance structures, many of the standardapproaches for visualizing uncertainty are insufficient. One such case is data fit with log-linear autoregressive mixed effectsmodels. Data requiring such an approach have three exceptional characteristics.1. The data are sampled in “groups” that exhibit variation unexplained by other model factors....

Statistical Methods for Defense Testing

In the increasingly complex and data‐limited world of military defense testing, statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its intended environment with military users. Although the typical test environment is highly variable and factors are often uncontrolled, design of experiments techniques can add objectivity, efficiency, and rigor to the process of test planning. Statistical analyses help system evaluators get the most information out of limited data sets....

Thinking About Data for Operational Test and Evaluation

While the human brain is powerful tool for quickly recognizing patterns in data, it will frequently make errors in interpreting random data. Luckily, these mistakes occur in systematic and predictable ways. Statistical models provide an analytical framework that helps us avoid these error-prone heuristics and draw accurate conclusions from random data. This non-technical presentation highlights some tricks of the trade learned by studying data and the way the human brain processes....

Users are Part of the System-How to Account for Human Factors when Designing Operational Tests for Software Systems

The goal of operation testing (OT) is to evaluate the effectiveness and suitability of military systems for use by trained military users in operationally realistic environments. Operators perform missions and make systems function. Thus, adequate OT must assess not only system performance and technical capability across the operational space, but also the quality of human-system interactions. Software systems in particular pose a unique challenge to testers. While some software systems may inherently be deterministic in nature, once placed in their intended environment with error-prone humans and highly stochastic networks, variability in outcomes often occurs, so tests often need to account for both “bug” finding and characterizing variability....

A First Step into the Bootstrap World

Bootstrapping is a powerful nonparametric tool for conducting statistical inference with many applications to data from operational testing. Bootstrapping is most useful when the population sampled from is unknown or complex or the sampling distribution of the desired statistic is difficult to derive. Careful use of bootstrapping can help address many challenges in analyzing operational test data. Suggested Citation Avery, Matthew R. A First Step into the Bootstrap World. IDA Document NS D-5816....

Bayesian Analysis in R/STAN

In an era of reduced budgets and limited testing, verifying that requirements have been met in a single test period can be challenging, particularly using traditional analysis methods that ignore all available information. The Bayesian paradigm is tailor made for these situations, allowing for the combination of multiple sources of data and resulting in more robust inference and uncertainty quantification. Consequently, Bayesian analyses are becoming increasingly popular in T&E. This tutorial briefly introduces the basic concepts of Bayesian Statistics, with implementation details illustrated in R through two case studies: reliability for the Core Mission functional area of the Littoral Combat Ship (LCS) and performance curves for a chemical detector in the Bio-chemical Detection System (BDS) with different agents and matrices....

Bayesian Reliability- Combining Information

One of the most powerful features of Bayesian analyses is the ability to combine multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models....

Censored Data Analysis Methods for Performance Data- A Tutorial

Binomial metrics like probability-to-detect or probability-to-hit typically do not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. Censored data analysis allows us to account for both pieces of information simultaneously. Suggested Citation Lillard, V Bram. Censored Data Analysis Methods for Performance Data: A Tutorial. IDA Document NS D-5811. Alexandria, VA: Institute for Defense Analyses, 2016....

DOT&E Reliability Course

This reliability course provides information to assist DOT&E action officers in their review and assessment of system reliability. Course briefings cover reliability planning and analysis activities that span the acquisition life cycle. Each briefing discusses review criteria relevant to DOT&E action officers based on DoD policies and lessons learned from previous oversight efforts. Suggested Citation Avery, Matthew, Jonathan Bell, Rebecca Medlin, and Freeman Laura. DOT&E Reliability Course. IDA Document NS D-5836....

Introduction to Survey Design

An important goal of test and evaluation is to understand not only how a system performs in its intended environment, but also users’ experiences operating the system. This briefing aimed to provide the audience with a set of tools – most notably, surveys – that are appropriate for measuring the user experience. DOT&E guidance regarding these tools is highlighted where appropriate. The briefing was broken into three major sections: conceptualizing surveys, writing survey items, and formatting surveys....

Regularization for Continuously Observed Ordinal Response Variables with Piecewise-Constant Functional Predictors

This paper investigates regularization for continuously observed covariates that resemble step functions. The motivating examples come from operational test data from a recent United States Department of Defense (DoD) test of the Shadow Unmanned Air Vehicle system. The response variable, quality of video provided by the Shadow to friendly ground units, was measured on an ordinal scale continuously over time. Functional covariates, altitude and distance, can be well approximated by step functions....

Rigorous Test and Evaluation for Defense, Aerospace, and National Security

In April 2016, NASA, DOT&E, and IDA collaborated on a workshopdesigned to strengthen the community around statistical approaches to test andevaluation in defense and aerospace. The workshop brought practitioners, analysts,technical leadership, and statistical academics together for a three day exchange ofinformation with opportunities to attend world renowned short courses, share commodchallenges, and learn new skill sets from a variety of tutorials. A highlight of theworkshop was the Tuesday afternoon technical leadership panel chaired by Dr....

Science of Test Workshop Proceedings, April 11-13, 2016

To mark IDA’s 60th anniversary, we are conducting a series of workshops and symposia that bring together IDA sponsors, researchers, experts inside and outside government, and other stakeholders to discuss issues of the day. These events focus on future national security challenges, reflecting on how past lessons and accomplishments help prepare us to deal with complex issues and environments we face going forward. This publication represents the proceedings of the Science of Test Workshop....

Tutorial on Sensitivity Testing in Live Fire Test and Evaluation

A sensitivity experiment is a special type of experimental design that is used when the response variable is binary and the covariate is continuous. Armor protection and projectile lethality tests often use sensitivity experiments to characterize a projectile’s probability of penetrating the armor. In this mini-tutorial we illustrate the challenge of modeling a binary response with a limited sample size, and show how sensitivity experiments can mitigate this problem. We review eight different single covariate sensitivity experiments and present a comparison of these designs using simulation....

Best Practices for Statistically Validating Modeling and Simulation (M&S) Tools Used in Operational Testing

In many situations, collecting sufficient data to evaluate system performance against operationally realistic threats is not possible due to cost and resource restrictions, safety concerns, or lack of adequate or representative threats. Modeling and simulation tools that have been verified, validated, and accredited can be used to supplement live testing in order to facilitate a more complete evaluation of performance. Two key questions that frequently arise when planning an operational test are (1) which (and how many) points within the operational space should be chosen in the simulation space and the live space for optimal ability to verify and validate the M&S, and (2) once that data is collected, what is the best way to compare the live trials to the simulated trials for the purpose of validating the M&S?...

Estimating System Reliability from Heterogeneous Data

This briefing provides an example of some of the nuanced issues in reliability estimation in operational testing. The statistical models are motivated by an example of the Paladin Integrated Management (PIM). We demonstrate how to use a Bayesian approach to reliability estimation that uses data from all phases of testing. Suggested Citation Browning, Caleb, Laura Freeman, Alyson Wilson, Kassandra Fronczyk, and Rebecca Dickinson. “Estimating System Reliability from Heterogeneous Data.” Presented at the Conference on Applied Statistics in Defense, George Mason University, October 2015....

Improving Reliability Estimates with Bayesian Statistics

This paper shows how Bayesian methods are ideal for the assessment of complex system reliability assessments. Several examples illustrate the methodology. Suggested Citation Freeman, Laura J, and Kassandra Fronczyk. “Improving Reliability Estimates with Bayesian Statistics.” ITEA Journal of Test and Evaluation 37, no. 4 (June 2015). Paper:

Statistical Models for Combining Information Stryker Reliability Case Study

Reliability is an essential element in assessing the operational suitability of Department of Defense weapon systems. Reliability takes a prominent role in both the design and analysis of operational tests. In the current era of reduced budgets and increased reliability requirements, it is challenging to verify reliability requirements in a single test. Furthermore, all available data should be considered in order to ensure evaluations provide the most appropriate analysis of the system’s reliability....

Surveys in Operational Test and Evaluation

Recently DOT&E signed out a memo providing Guidance on the Use and Design of Surveys in Operational Test and Evaluation. This guidance memo helps the Human Systems Integration (HSI) community to ensure that useful and accurate HSI data are collected. Information about how HSI experts can leverage the guidance is presented. Specifically, the presentation will cover which HSI metrics can and cannot be answered by surveys. Suggested Citation Grier, Rebecca A, and Laura Freeman....

Validating the PRA Testbed Using a Statistically Rigorous Approach

For many systems, testing is expensive and only a few live test events are conducted. When this occurs, testers frequently use a model to extend the test results. However, testers must validate the model to show that it is an accurate representation of the real world from the perspective of the intended uses of the model. This raises a problem when only a small number of live test events are conducted, only limited data are available to validate the model, and some testers struggle with model validation....

A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense

This paper summarizes sensitivity test methods commonly employed in the Department of Defense. A comparison study shows that modern methods such as Neyer’s method and Three-Phase Optimal Design are improvements over historical methods. Suggested Citation Johnson, Thomas H., Laura Freeman, Janice Hester, and Jonathan L. Bell. “A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense.” IEEE Access 2 (2014): 1442–55. https://doi.org/10.1109/ACCESS.2014.2377633. Paper:

Applying Risk Analysis to Acceptance Testing of Combat Helmets

Acceptance testing of combat helmets presents multiple challenges that require statistically-sound solutions. For example, how should first article and lot acceptance tests treat multiple threats and measures of performance? How should these tests account for multiple helmet sizes and environmental treatments? How closely should first article testing requirements match historical or characterization test data? What government and manufacturer risks are acceptable during lot acceptance testing? Similar challenges arise when testing other components of Personal Protective Equipment and similar statistical approaches should be applied to all components....

Design of Experiments for in-Lab Operational Testing of the an/BQQ-10 Submarine Sonar System

Operational testing of the AN/BQQ-10 submarine sonar system has never been able to show significant improvements in software versions because of the high variability of at sea measurements. To mitigate this problem, in the most recent AN/BQQ-10 operational test, the Navy’s operational test agency (in consultation with IDA under the direction of Director, Operational Test and Evaluation) supplemented the at sea testing with an operationally focused in-lab comparison. This test used recorded real data played back on two different versions of the sonar system....

Power Analysis Tutorial for Experimental Design Software

This guide provides both a general explanation of power analysis and specific guidance to successfully interface with two software packages, JMP and Design Expert (DX). Suggested Citation Freeman, Laura J., Thomas H. Johnson, and James R. Simpson. “Power Analysis Tutorial for Experimental Design Software:” Fort Belvoir, VA: Defense Technical Information Center, November 1, 2014. https://doi.org/10.21236/ADA619843. Paper:

Taking the Next Step- Improving the Science of Test in DoD T&E

The current fiscal climate demands now, more than ever, that test and evaluation(T&E) provide relevant and credible characterization of system capabilities andshortfalls across all relevant operational conditions as efficiently as possible. Indetermining the answer to the question, “How much testing is enough?” it isimperative that we use a scientifically defensible methodology. Design ofExperiments (DOE) has a proven track record in Operational Test andEvaluation (OT&E) of not only quantifying how much testing is enough, but alsowhere in the operational space the test points should be placed....

A Tutorial on the Planning of Experiments

This tutorial outlines the basic procedures for planning experiments within the context of the scientific method. Too often quality practitioners fail to appreciate how subject-matter expertise must interact with statistical expertise to generate efficient and effective experimental programs. This tutorial guides the quality practitioner through the basic steps, demonstrated by extensive past experience, that consistently lead to successful results. This tutorial makes extensive use of flowcharts to illustrate the basic process....

Censored Data Analysis- A Statistical Tool for Efficient and Information-Rich Testing

Binomial metrics like probability-to-detect or probability-to-hit typically provide operationally meaningful and easy to interpret test outcomes. However, they are information-poor metrics and extremely expensive to test. The standard power calculations to size a test employ hypothesis tests, which typically result in many tens to hundreds of runs. In addition to being expensive, the test is most likely inadequate for characterizing performance over a variety of conditions due to the inherently large statistical uncertainties associated with binomial metrics....

Comparing Computer Experiments for the Gaussian Process Model Using Integrated Prediction Variance

Space-Filling Designs are a common choice of experimental design strategy for computer experiments. This paper compares space filling design types based on their theoretical prediction variance properties with respect to the Gaussian Process model. Suggested Citation Silvestrini, Rachel T., Douglas C. Montgomery, and Bradley Jones. “Comparing Computer Experiments for the Gaussian Process Model Using Integrated Prediction Variance.” Quality Engineering 25, no. 2 (April 2013): 164–74. https://doi.org/10.1080/08982112.2012.758284. Paper:

Scientific Test and Analysis Techniques- Statistical Measures of Merit

Design of Experiments (DOE) provides a rigorous methodology for developing and evaluating test plans. Design excellence consists of having enough test points placed in the right locations in the operational envelope to answer the questions of interest for the test. The key aspects of a well-designed experiment include: the goal of the test, the response variables, the factors and levels, a method for strategically varying the factors across the operational envelope, and statistical measures of merit....

A Bayesian Approach to Evaluation of Land Warfare Systems

This presentation is a presentation for the Army Conference on Applied Statistics. The presentation covers a brief introduction to land warfare problems, and devises a methodology using Bayes Theorem to estimate parameters of interest. Two examples are given, a simple one using independent Bernoulli Trials, and a more complex one using correlated Red and Blue casualty data in a Loss Exchange Ratio and a hierarchical model. The presentation demonstrates that the Bayesian approach is successful in both examples at reducing the variance of the estimated parameters, potentially reducing the cost of devising a complex test program....

Continuous Metrics for Efficient and Effective Testing

In today’s fiscal environment, efficient and effective testing is essential. Often, military system requirements are defined using probability of success as the primary measure of effectiveness – for example, a system must complete its mission 80 percent of the time; or the system must detect 90 percent of targets. The traditional approach to testing these probability-based requirements is to execute a series of trials and then total the number of successes; the ratio of successes to number of trails provides an intuitive measure of the probability of success....

Designed Experiments for the Defense Community

The areas of application for design of experiments principles have evolved, mimicking the growth of U.S. industries over the last century, from agriculture to manufacturing to chemical and process industries to the services and government sectors. In addition, statistically based quality programs adopted by businesses morphed from total quality management to Six Sigma and, most recently, statistical engineering (see Hoerl and Snee 2010). The good news about these transformations is that each evolution contains more technical substance, embedding the methodologies as core competencies, and is less of a ‘‘program....

Statistically Based T&E Using Design of Experiments

This document outlines the charter for the Committee to Institutionalize Scientific Test Design and Rigor in Test and Evaluation. The charter defines the problem, identifies potential steps in a roadmap for accomplishing the goals of the committee and lists committeemembership. Once the committee is assembled, the members will revise this document as needed. The charter will be endorsed by DOT&E and DDT&E, once finalize. Suggested Citation Freeman, Laura. Statistically Based T&E Using Design of Experiments....

An Expository Paper on Optimal Design

There are many situations where the requirements of a standard experimental design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response. Suggested Citation Johnson, Rachel T., Douglas C. Montgomery, and Bradley A. Jones. “An Expository Paper on Optimal Design....

Design for Reliability using Robust Parameter Design

Recently, the principles of Design of Experiments (DOE) have been implemented as amethod of increasing the statistical rigor of operational tests. The focus has been on ensuringcoverage of the operational envelope in terms of system effectiveness. DOE is applicable inreliability analysis as well. A reliability standard, ANSI-0009, advocates the use Design forReliability (DfR) early in the product development cycle in order to design-in reliability. Robustparameter design (RPD) first used by Taguchi and then by the response surface communityprovides insights on how DOE can be used to make a products and processes invariant tochanges in factors....

Design of Experiments in Highly Constrained Design Spaces

This presentation shows the merits of applying experimental design to operational tests, guidance on using DOE from the Director, Operational Test and Evaluation, and presents the design solution for the test of a chemical agent detector. It is important to keep in mind the advanced techniques from DOE (split-plot designs, optimal designs) to determine effective DOEs for operational testing; traditional design strategies often result in designs that are not executable....

Hybrid Designs- Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models

This tutorial provides an overview of experimental design for modeling and simulation. Pros and cons of each design methodology are discussed. Suggested Citation Silvestrini, Rachel Johnson. “Hybrid Designs: Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models.” Monterey, California, May 2011. Slides:

Use of Statistically Designed Experiments to Inform Decisions in a Resource Constrained Environment

There has been recent emphasis on the increased use of statistics, including the use of statistically designed experiments, to plan and execute tests that support Department of Defense (DoD) acquisition programs. The use of statistical methods, including experimental design, has shown great benefits in industry, especially when used in an integrated fashion; for example see the literature on Six Sigma. The structured approach of experimental design allows the user to determine what data need to be collected and how it should be analyzed to achieve specific decision making objectives....

Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods

In this paper we compare data from a fairly large legacy wind tunnel test campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity to reduce wind tunnel test efforts without losing test information. Suggested Citation Hill, Raymond R., Derek A. Leggio, Shay R. Capehart, and August G. Roesener. “Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods.” Quality and Reliability Engineering International 27, no....

Choice of Second-Order Response Surface Designs for Logistic and Poisson Regression Models

This paper illustrates the construction of D-optimal second order designs for situations when the response is either binomial (pass/fail) or Poisson (count data). Suggested Citation Johnson, Rachel T., and Douglas C. Montgomery. “Choice of Second-Order Response Surface Designs for Logistic and Poisson Regression Models.” International Journal of Experimental Design and Process Optimisation 1, no. 1 (2009): 2. https://doi.org/10.1504/IJEDPO.2009.028954. Paper:

Designing Experiments for Nonlinear Models—an Introduction

We illustrate the construction of Bayesian D-optimal designs for nonlinear models and compare the relative efficiency of standard designs with these designs for several models and prior distributions on the parameters. Through a relative efficiency analysis, we show that standard designs can perform well in situations where the nonlinear model is intrinsically linear. However, if the model is nonlinear and its expectation function cannot be linearized by simple transformations, the nonlinear optimal design is considerably more efficient than the standard design....