Developing AI Trust- From Theory to Testing and the Myths in Between

This introductory work aims to provide members of the Test and Evaluation community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems. The paper provides a set of working definitions and works toward dispelling confusion and myths surrounding trust. Suggested Citation Razin, Yosef S., and Kristen Alexander. “Developing AI Trust: From Theory to Testing and the Myths in Between.” The ITEA Journal of Test and Evaluation 45, no....

2024 · Yosef Razin, Kristen Alexander, John Haman

Meta-Analysis of the Effectiveness of the SALIANT Procedure for Assessing Team Situation Awareness

Many Department of Defense (DoD) systems aim to increase or maintain Situational Awareness (SA) at the individual or group level. In some cases, maintenance or enhancement of SA is listed as a primary function or requirement of the system. However, during test and evaluation SA is examined inconsistently or is not measured at all. Situational Awareness Linked Indicators Adapted to Novel Tasks (SALIANT) is an empirically-based methodology meant to measure SA at the team, or group, level....

2024 · Sarah Shaffer, Miriam Armstrong

Operational T&E of AI-Supported Data Integration, Fusion, and Analysis Systems

AI will play an important role in future military systems. However, large questions remain about how to test AI systems, especially in operational settings. Here, we discuss an approach for the operational test and evaluation (OT&E) of AI-supported data integration, fusion, and analysis systems. We highlight new challenges posed by AI-supported systems and we discuss new and existing OT&E methods for overcoming them. We demonstrate how to apply these OT&E methods via a notional test concept that focuses on evaluating an AI-supported data integration system in terms of its technical performance (how accurate is the AI output?...

2024 · Adam Miller, Logan Ausman, John Haman, Keyla Pagan-Rivera, Sarah Shaffer, Brian Vickers

Statistical Advantages of Validated Surveys over Custom Surveys

Surveys play an important role in quantifying user opinion during test and evaluation (T&E). Current best practice is to use surveys that have been tested, or “validated,” to ensure that they produce reliable and accurate results. However, unvalidated (“custom”) surveys are still widely used in T&E, raising questions about how to determine sample sizes for—and interpret data from— T&E events that rely on custom surveys. In this presentation, I characterize the statistical properties of validated and custom survey responses using data from recent T&E events, and then I demonstrate how these properties affect test design, analysis, and interpretation....

2024 · Adam Miller

A Team-Centric Metric Framework for Testing and Evaluation of Human-Machine Teams

We propose and present a parallelized metric framework for evaluating human-machine teams that draws upon current knowledge of human-systems interfacing and integration but is rooted in team-centric concepts. Humans and machines working together as a team involves interactions that will only increase in complexity as machines become more intelligent, capable teammates. Assessing such teams will require explicit focus on not just the human-machine interfacing but the full spectrum of interactions between and among agents....

2023 · Wilkins, David Sparrow, Caitlan Fealing, Brian Vickers, Kristina Ferguson, Heather Wojton

AI + Autonomy T&E in DoD

Test and evaluation (T&E) of AI-enabled systems (AIES) often emphasizes algorithm accuracy over robust, holistic system performance. While this narrow focus may be adequate for some applications of AI, for many complex uses, T&E paradigms removed from operational realism are insufficient. However, leveraging traditional operational testing (OT) methods for to evaluate AIESs can fail to capture novel sources of risk. This brief establishes a common AI vocabulary and highlights OT challenges posed by AIESs by answering the following questions...

2023 · Brian Vickers, Matthew Avery, Rachel Haga, Mark Herrera, Daniel Porter, Stuart Rodgers

Measuring Training Efficacy- Structural Validation of the Operational Assessment of Training Scale

Effective training of the broad set of users/operators of systems has downstream impacts on usability, workload, and ultimate system performance that are related to mission success. In order to measure training effectiveness, we designed a survey called the Operational Assessment of Training Scale (OATS) in partnership with the Army Test and Evaluation Center (ATEC). Two subscales were designed to assess the degrees to which training covered relevant content for real operations (Relevance subscale) and enabled self-rated ability to interact with systems effectively after training (Efficacy subscale)....

2022 · Brian Vickers, Rachel Haga, Daniel Porter, Heather Wojton

Demystifying the Black Box- A Test Strategy for Autonomy

The purpose of this briefing is to provide a high-level overview of how to frame the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and broad-stroke reforms needed to get ready for the test challenges of the next century. Suggested Citation Wojton, Heather M, and Daniel J Porter. Demystifying the Black Box: A Test Strategy for Autonomy. IDA Document NS D-10465-NS....

2019 · Heather Wojton, Daniel Porter

Users are Part of the System-How to Account for Human Factors when Designing Operational Tests for Software Systems

The goal of operation testing (OT) is to evaluate the effectiveness and suitability of military systems for use by trained military users in operationally realistic environments. Operators perform missions and make systems function. Thus, adequate OT must assess not only system performance and technical capability across the operational space, but also the quality of human-system interactions. Software systems in particular pose a unique challenge to testers. While some software systems may inherently be deterministic in nature, once placed in their intended environment with error-prone humans and highly stochastic networks, variability in outcomes often occurs, so tests often need to account for both “bug” finding and characterizing variability....

2017 · Kelly Avery, Heather Wojton

Surveys in Operational Test and Evaluation

Recently DOT&E signed out a memo providing Guidance on the Use and Design of Surveys in Operational Test and Evaluation. This guidance memo helps the Human Systems Integration (HSI) community to ensure that useful and accurate HSI data are collected. Information about how HSI experts can leverage the guidance is presented. Specifically, the presentation will cover which HSI metrics can and cannot be answered by surveys. Suggested Citation Grier, Rebecca A, and Laura Freeman....

2015 · Rebecca Grier, Laura Freeman