Design of Experiments

Determining the Necessary Number of Runs in Computer Simulations with Binary Outcomes

How many success-or-failure observations should we collect from a computer simulation? Often, researchers use space-filling design of experiments when planning modeling and simulation (M&S) studies. We are not satisfied with existing guidance on justifying the number of runs when developing these designs, either because the guidance is insufficiently justified, does not provide an unambiguous answer, or is not based on optimizing a statistical measure of merit. Analysts should use confidence interval margin of error as the statistical measure of merit for M&S studies intended to characterize overall M&S behavioral trends....

Introduction to Human-Systems Interaction in Operational Test and Evaluation Course

Human-System Interaction (HSI) is the study of interfaces between humans and technical systems. The Department of Defense incorporates HSI evaluations into defense acquisition to improve system performance and reduce lifecycle costs. During operational test and evaluation, HSI evaluations characterize how a system’s operational performance is affected by its users. The goal of this course is to provide the theoretical background and practical tools necessary to plan and evaluate HSI test plans, collect and analyze HSI data, and report on HSI results....

Uncertainty Quantification for Ground Vehicle Vulnerability Simulation

A vulnerability assessment of a combat vehicle uses modeling and simulation (M&S) to predict the vehicle’s vulnerability to a given enemy attack. The system-level output of the M&S is the probability that the vehicle’s mobility is degraded as a result of the attack. The M&S models this system-level phenomenon by decoupling the attack scenario into a hierarchy of sub-systems. Each sub-system addresses a specific scientific problem, such as the fracture dynamics of an exploded munition, or the ballistic resistance provided by the vehicle’s armor....

Comparing Normal and Binary D-Optimal Designs by Statistical Power

In many Department of Defense test and evaluation applications, binary response variables are unavoidable. Many have considered D-optimal design of experiments for generalized linear models. However, little consideration has been given to assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected....

Implementing Fast Flexible Space-Filling Designs in R

Modeling and simulation (M&S) can be a useful tool when testers and evaluators need to augment the data collected during a test event. When planning M&S, testers use experimental design techniques to determine how much and which types of data to collect, and they can use space-filling designs to spread out test points across the operational space. Fast flexible space-filling designs (FFSFDs) are a type of space-filling design useful for M&S because they work well in design spaces with disallowed combinations and permit the inclusion of categorical factors....

Introduction to Design of Experiments for Testers

This training provides details regarding the use of design of experiments, from choosing proper response variables, to identifying factors that could affect such responses, to determining the amount of data necessary to collect. The training also explains the benefits of using a Design of Experiments approach to testing and provides an overview of commonly used designs (e.g., factorial, optimal, and space-filling). The briefing illustrates the concepts discussed using several case studies....

Introduction to Design of Experiments in R- Generating and Evaluating Designs with Skpr

This workshop instructs attendees on how to run an end-to-end optimal Design of Experiments workflow in R using the open source skpr package. This workshop is split into two sections optimal design generation and design evaluation. The first half of the workshop provides basic instructions how to use R, as well as how to use skpr to create an optimal design for an experiment how to specify a model, create a candidate set of potential runs, remove disallowed combinations, and specify the design generation conditions to best suit an experimenter’s goals....

Statistical Methods Development Work for M&S Validation

We discuss four areas in which statistically rigorous methods contribute to modeling and simulation validation studies. These areas are statistical risk analysis, space-filling experimental designs, metamodel construction, and statistical validation. Taken together, these areas implement DOT&E guidance on model validation. In each area, IDA has contributed either research methods, user-friendly tools, or both. We point to our tools on testscience.org, and survey the research methods that we’ve contributed to the M&S validation literature...

Case Study on Applying Sequential Analyses in Operational Testing

Sequential analysis concerns statistical evaluation in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends on the information acquired during the investigation. Although sequential analysis originated in ballistics testing for the Department of Defense (DoD)and it is widely used in other disciplines, it is underutilized in the DoD. Expanding the use of sequential analysis may save money and reduce test time....

Thoughts on Applying Design of Experiments (DOE) to Cyber Testing

This briefing presented at Dataworks 2022 provides examples of potential ways in which Design of Experiments (DOE) could be applied to initially scope cyber assessments and, based on the results of those assessments, subsequently design in greater detail cyber tests. Suggested Citation Gilmore, James M, Kelly M Avery, Matthew R Girardi, and Rebecca M Medlin. Thoughts on Applying Design of Experiments (DOE) to Cyber Testing. IDA Document NS D-33023. Alexandria, VA: Institute for Defense Analyses, 2022....

Determining How Much Testing is Enough- An Exploration of Progress in the Department of Defense Test and Evaluation Community

This paper describes holistic progress in answering the question of “How much testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994. Suggested Citation Medlin, Rebecca, Matthew R Avery, James R Simpson, and Heather M Wojton....

Space-Filling Designs for Modeling & Simulation

This document presents arguments and methods for using space-filling designs (SFDs) to plan modeling and simulation (M&S) data collection. Suggested Citation Avery, Kelly, John T Haman, Thomas Johnson, Curtis Miller, Dhruv Patel, and Han Yi. Test Design Challenges in Defense Testing. IDA Product ID 3002855. Alexandria, VA: Institute for Defense Analyses, 2024. Slides: Paper:

Why are Statistical Engineers Needed for Test & Evaluation?

The Department of Defense (DoD) develops and acquires some of the world’s most advanced and sophisticated systems. As new technologies emerge and are incorporated into systems, OSD/DOT&E faces the challenge of ensuring that these systems undergo adequate and efficient test and evaluation (T&E) prior to operational use. Statistical engineering is a collaborative, analytical approach to problem solving that integrates statistical thinking, methods, and tools with other relevant disciplines. The statistical engineering process provides better solutions to large, unstructured, real-world problems and supports rigorous decision-making....

A Review of Sequential Analysis

Sequential analysis concerns statistical evaluation in situations in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the course of the investigation. Expanding the use of sequential analysis has the potential to save resources and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation....

A Validation Case Study- The Environment Centric Weapons Analysis Facility (ECWAF)

Reliable modeling and simulation (M&S) allows the undersea warfare community to understand torpedo performance in scenarios that could never be created in live testing, and do so for a fraction of the cost of an in-water test. The Navy hopes to use the Environment Centric Weapons Analysis Facility (ECWAF), a hardware-in-the-loop simulation, to predict torpedo effectiveness and supplement live operational testing. In order to trust the model’s results, the T&E community has applied rigorous statistical design of experiments techniques to both live and simulation testing....

Circular Prediction Regions for Miss Distance Models under Heteroskedasticity

Circular prediction regions are used in ballistic testing to express the uncertainty in shot accuracy. We compare two modeling approaches for estimating circular prediction regions for the miss distance of a ballistic projectile. The miss distance response variable is bivariate normal and has a mean and variance that can change with one or more experimental factors. The first approach fits a heteroskedastic linear model using restricted maximum likelihood, and uses the Kenward-Roger statistic to estimate circular prediction regions....

D-Optimal as an Alternative to Full Factorial Designs- a Case Study

The use of Bayesian statistics and experimental design as tools to scope testing and analyze data related to defense has increased in recent years. Planning a test using experimental design will allow testers to cover the operational space while maximizing the information obtained from each run. Understanding which factors can affect a detector’s performance can influence military tactics, techniques and procedures, and improve a commander’s situational awareness when making decisions in an operational environment....

Designing Experiments for Model Validation- The Foundations for Uncertainty Quantification

Advances in computational power have allowed both greater fidelity and more extensive use of such models. Numerous complex military systems have a corresponding model that simulates its performance in the field. In response, the DoD needs defensible practices for validating these models. Design of Experiments and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations....

Handbook on Statistical Design & Analysis Techniques for Modeling & Simulation Validation

This handbook focuses on methods for data-driven validation to supplement the vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ. Suggested Citation Wojton, Heather, Kelly M Avery, Laura J Freeman, Samuel H Parry, Gregory S Whittier, Thomas H Johnson, and Andrew C Flack....

Sample Size Determination Methods Using Acceptance Sampling by Variables

Acceptance Sampling by Variables (ASbV) is a statistical testing technique used in Personal Protective Equipment programs to determine the quality of the equipment in First Article and Lot Acceptance Tests. This article intends to remedy the lack of existing references that discuss the similarities between ASbV and certain techniques used in different sub-disciplines within statistics. Understanding ASbV from a statistical perspective allows testers to create customized test plans, beyond what is available in MIL-STD-414....

Use of Design of Experiments in Survivability Testing

The purpose of survivability testing is to provide decision makers with relevant, credible evidence about the survivability of an aircraft that is conveyed with some degree of certainty or inferential weight. In developing an experiment to accomplish this goal, a test planner faces numerous questions What critical issue or issues are being address? What data are needed to answer the critical issues? What test conditions should be varied? What is the most economical way of varying those conditions?...

Analysis of Split-Plot Reliability Experiments with Subsampling

Reliability experiments are important for determining which factors drive product reliability. The data collected in these experiments can be challenging to analyze. Often, the reliability or lifetime data collected follow distinctly nonnormal distributions and include censored observations. Additional challenges in the analysis arise when the experiment is executed with restrictions on randomization. The focus of this paper is on the proper analysis of reliability data collected from a nonrandomized reliability experiments....

Comparing M&S Output to Live Test Data- A Missile System Case Study

In the operational testing of DoD weapons systems, modeling and simulation (M&S) is often used to supplement live test data in order to support a more complete and rigorous evaluation. Before the output of the M&S is included in reports to decision makers, it must first be thoroughly verified and validated to show that it adequately represents the real world for the purposes of the intended use. Part of the validation process should include a statistical comparison of live data to M&S output....

JEDIS Briefing and Tutorial

Are you sick of having to manually iterate your way through sizing your design of experiments? Come learn about JEDIS, the new IDA-developed JMP Add-In for automating design of experiments power calculations. JEDIS builds multiple test designs in JMP over user-specified ranges of sample sizes, Signal-to-Noise Ratios (SNR), and alpha (1 -confidence) levels. It then automatically calculates the statistical power to detect an effect due to each factor and any specified interactions for each design....

Power Approximations for Reliability Test Designs

Reliability tests determine which factors drive system reliability. Often, the reliability or failure time data collected in these tests tend to follow distinctly non- normal distributions and include censored observations. The experimental design should accommodate the skewed nature of the response and allow for censored observations, which occur when systems under test do not fail within the allotted test time. To account for these design and analysis considerations, Monte Carlo simulations are frequently used to evaluate experimental design properties....

Scientific Test and Analysis Techniques

Abstract This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics coverec include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Slides

Scientific Test and Analysis Techniques- Continuous Learning Module

This document contains the technical content for the Scientific Test and Analysis Techniques (STAT) in Test and Evaluation (T&E) continuous learning module. The module provides a basic understanding of STAT in T&E. Topics covered include design of experiments, observational studies, survey design and analysis, and statistical analysis. It is designed as a four hour online course, suitable for inclusion in the DAU T&E certification curriculum. Suggested Citation Pinelis, Yevgeniya, Laura J Freeman, Heather M Wojton, Denise J Edwards, Stephanie T Lane, and James R Simpson....

Testing Defense Systems

The complex, multifunctional nature of defense systems, along with the wide variety of system types, demands a structured but flexible analytical process for testing systems. This chapter summarizes commonly used techniques in defense system testing and specific challenges imposed by the nature of defense system testing. It highlights the core statistical methodologies that have proven useful in testing defense systems. Case studies illustrate the value of using statistical techniques in the design of tests and analysis of the resulting data....

On Scoping a Test that Addresses the Wrong Objective

Statistical literature refers to a type of error that is committed by giving the right answer to the wrong question. If a test design is adequately scoped to address an irrelevant objective, one could say that a Type III error occurs. In this paper, we focus on a specific Type III error that on some occasions test planners commit to reduce test size and resources. Suggested Citation Johnson, Thomas H., Rebecca M....

Perspectives on Operational Testing-Guest Lecture at Naval Postgraduate School

This document was prepared to support Dr. Lillard’s visit to the NavalPostgraduate School where he will provide a guest lecture to students in the T&Ecourse. The briefing covers three primary themes: 1) evaluation of military systemson the basis of requirements and KPPs alone is often insufficient to determineeffectiveness and suitability in combat conditions, 2) statistical methods are essentialfor developing defensible and rigorous test designs, 3) operational testing is often theonly means to discover critical performance shortcomings....

Power Approximations for Generalized Linear Models using the Signal-to-Noise Transformation Method

Statistical power is a useful measure for assessing the adequacy of anexperimental design prior to data collection. This paper proposes an approach referredto as the signal-to-noise transformation method (SNRx), to approximate power foreffects in a generalized linear model. The contribution of SNRx is that, with a coupleassumptions, it generates power approximations for generalized linear model effectsusing F-tests that are typically used in ANOVA for classical linear models.Additionally, SNRx follows Ohlert and Whitcomb’s unified approach for sizing aneffect, which allows for intuitive effect size definitions, and consistent estimates ofpower....

Statistical Methods for Defense Testing

In the increasingly complex and data‐limited world of military defense testing, statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its intended environment with military users. Although the typical test environment is highly variable and factors are often uncontrolled, design of experiments techniques can add objectivity, efficiency, and rigor to the process of test planning. Statistical analyses help system evaluators get the most information out of limited data sets....

Tutorial on Sensitivity Testing in Live Fire Test and Evaluation

A sensitivity experiment is a special type of experimental design that is used when the response variable is binary and the covariate is continuous. Armor protection and projectile lethality tests often use sensitivity experiments to characterize a projectile’s probability of penetrating the armor. In this mini-tutorial we illustrate the challenge of modeling a binary response with a limited sample size, and show how sensitivity experiments can mitigate this problem. We review eight different single covariate sensitivity experiments and present a comparison of these designs using simulation....

Estimating System Reliability from Heterogeneous Data

This briefing provides an example of some of the nuanced issues in reliability estimation in operational testing. The statistical models are motivated by an example of the Paladin Integrated Management (PIM). We demonstrate how to use a Bayesian approach to reliability estimation that uses data from all phases of testing. Suggested Citation Browning, Caleb, Laura Freeman, Alyson Wilson, Kassandra Fronczyk, and Rebecca Dickinson. “Estimating System Reliability from Heterogeneous Data.” Presented at the Conference on Applied Statistics in Defense, George Mason University, October 2015....

A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense

This paper summarizes sensitivity test methods commonly employed in the Department of Defense. A comparison study shows that modern methods such as Neyer’s method and Three-Phase Optimal Design are improvements over historical methods. Suggested Citation Johnson, Thomas H., Laura Freeman, Janice Hester, and Jonathan L. Bell. “A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense.” IEEE Access 2 (2014): 1442–55. https://doi.org/10.1109/ACCESS.2014.2377633. Paper:

Design of Experiments for in-Lab Operational Testing of the an/BQQ-10 Submarine Sonar System

Operational testing of the AN/BQQ-10 submarine sonar system has never been able to show significant improvements in software versions because of the high variability of at sea measurements. To mitigate this problem, in the most recent AN/BQQ-10 operational test, the Navy’s operational test agency (in consultation with IDA under the direction of Director, Operational Test and Evaluation) supplemented the at sea testing with an operationally focused in-lab comparison. This test used recorded real data played back on two different versions of the sonar system....

Power Analysis Tutorial for Experimental Design Software

This guide provides both a general explanation of power analysis and specific guidance to successfully interface with two software packages, JMP and Design Expert (DX). Suggested Citation Freeman, Laura J., Thomas H. Johnson, and James R. Simpson. “Power Analysis Tutorial for Experimental Design Software:” Fort Belvoir, VA: Defense Technical Information Center, November 1, 2014. https://doi.org/10.21236/ADA619843. Paper:

Taking the Next Step- Improving the Science of Test in DoD T&E

The current fiscal climate demands now, more than ever, that test and evaluation(T&E) provide relevant and credible characterization of system capabilities andshortfalls across all relevant operational conditions as efficiently as possible. Indetermining the answer to the question, “How much testing is enough?” it isimperative that we use a scientifically defensible methodology. Design ofExperiments (DOE) has a proven track record in Operational Test andEvaluation (OT&E) of not only quantifying how much testing is enough, but alsowhere in the operational space the test points should be placed....

A Tutorial on the Planning of Experiments

This tutorial outlines the basic procedures for planning experiments within the context of the scientific method. Too often quality practitioners fail to appreciate how subject-matter expertise must interact with statistical expertise to generate efficient and effective experimental programs. This tutorial guides the quality practitioner through the basic steps, demonstrated by extensive past experience, that consistently lead to successful results. This tutorial makes extensive use of flowcharts to illustrate the basic process....

Censored Data Analysis- A Statistical Tool for Efficient and Information-Rich Testing

Binomial metrics like probability-to-detect or probability-to-hit typically provide operationally meaningful and easy to interpret test outcomes. However, they are information-poor metrics and extremely expensive to test. The standard power calculations to size a test employ hypothesis tests, which typically result in many tens to hundreds of runs. In addition to being expensive, the test is most likely inadequate for characterizing performance over a variety of conditions due to the inherently large statistical uncertainties associated with binomial metrics....

Scientific Test and Analysis Techniques- Statistical Measures of Merit

Design of Experiments (DOE) provides a rigorous methodology for developing and evaluating test plans. Design excellence consists of having enough test points placed in the right locations in the operational envelope to answer the questions of interest for the test. The key aspects of a well-designed experiment include: the goal of the test, the response variables, the factors and levels, a method for strategically varying the factors across the operational envelope, and statistical measures of merit....

Statistically Based T&E Using Design of Experiments

This document outlines the charter for the Committee to Institutionalize Scientific Test Design and Rigor in Test and Evaluation. The charter defines the problem, identifies potential steps in a roadmap for accomplishing the goals of the committee and lists committeemembership. Once the committee is assembled, the members will revise this document as needed. The charter will be endorsed by DOT&E and DDT&E, once finalize. Suggested Citation Freeman, Laura. Statistically Based T&E Using Design of Experiments....

An Expository Paper on Optimal Design

There are many situations where the requirements of a standard experimental design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response. Suggested Citation Johnson, Rachel T., Douglas C. Montgomery, and Bradley A. Jones. “An Expository Paper on Optimal Design....

Design for Reliability using Robust Parameter Design

Recently, the principles of Design of Experiments (DOE) have been implemented as amethod of increasing the statistical rigor of operational tests. The focus has been on ensuringcoverage of the operational envelope in terms of system effectiveness. DOE is applicable inreliability analysis as well. A reliability standard, ANSI-0009, advocates the use Design forReliability (DfR) early in the product development cycle in order to design-in reliability. Robustparameter design (RPD) first used by Taguchi and then by the response surface communityprovides insights on how DOE can be used to make a products and processes invariant tochanges in factors....

Hybrid Designs- Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models

This tutorial provides an overview of experimental design for modeling and simulation. Pros and cons of each design methodology are discussed. Suggested Citation Silvestrini, Rachel Johnson. “Hybrid Designs: Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models.” Monterey, California, May 2011. Slides:

Use of Statistically Designed Experiments to Inform Decisions in a Resource Constrained Environment

There has been recent emphasis on the increased use of statistics, including the use of statistically designed experiments, to plan and execute tests that support Department of Defense (DoD) acquisition programs. The use of statistical methods, including experimental design, has shown great benefits in industry, especially when used in an integrated fashion; for example see the literature on Six Sigma. The structured approach of experimental design allows the user to determine what data need to be collected and how it should be analyzed to achieve specific decision making objectives....