Psychological Science: The Theory of Test Reliability – Correcting 100 Year Old Mistakes (Norman Costa)
By
Norman Costa, Ph.D.
The Submarine Torpedoes That Would Not Explode
Non-exploding torpedoes may seem a strange way to open a discussion about the reliability of educational and psychological tests. This true story from the U.S. Navy in World War 2, however, is a perfect example of how to understand the concept of reliability. If submarine torpedoes seem an odd comparison to psychological tests, keep in mind that both are tools or instruments that are expected to perform in their own ways. We expect our tools or instruments to be reliable. So let's see what we mean by reliability.
Following the Japanese attack on Pearl Harbor in the Hawaiian Islands on December 7, 1941, the United States entered the Second World War. After declaring war upon Japan, Germany declared war upon the United States. Upon the declaration of war on Japan, the U.S. Navy's fleet of submarines was authorized to hunt and sink enemy war ships and enemy ships of commerce. The submarine service immediately went on the offensive and prowled the waters for prey. They did a superb job of tracking down their targets, and firing their torpedoes at the hulls of enemy ships with great accuracy. For all of their stealth, bravery, and accuracy, none of the torpedoes exploded.
The failure of virtually all torpedoes in the submarine fleet at the outset of the war was kept a tightly guarded secret during and after the war. Today, few people know about the non-exploding torpedoes. Almost no one is aware of the cause of the problem. It was sabotage. German spies and agents had infiltrated the factories that produced the torpedoes for the U.S. Navy, even before the U.S. entered the war. In 1967, I met Richie Harris, one of the former workers at one of the torpedo factories. He told me that he was a maintenance worker and was sweeping the floor of the office of one of the managers. A man who looked like any other factory worker came into the office, flashed a badge and identified himself as an FBI agent, told him to stay in the office and not come out until told that he could do so. The FBI agent closed the door and locked Richie Harris in the office. That was the opening of a massive arrest of saboteurs in the factory.
About six months later a huge picture of an Italian battleship was hung on the wall of the factory floor. That battleship was sunk by one of their torpedoes.
I always loved that story, but now I have to get back to a discussion of reliability. If I were to ask someone to make a statement about the reliability of the torpedoes, it's a good bet that the torpedoes would be described as unreliable. It seems obvious that they were unreliable, and with the arrest of the saboteurs the torpedoes became reliable. From the perspective of the U.S. Navy, the ship sinking weapon system did not perform according to requirements for submarine torpedoes. They did not function according to expectation, therefore, they were unreliable. From the perspective of the German saboteurs, the torpedoes (the same torpedoes that were unreliable for the U.S. Navy) performed according to requirements. The torpedoes functioned as a defensive system to protect Axis shipping. They performed according to requirements. They functioned according to expectation, therefore, they were reliable.
The major lesson of this example is that reliability of a tool or instrument is determined by its purpose, or statement of performance requirements, or fulfilling a specific function. The reliability of the torpedoes was dependent upon the function they were to fulfill. The reliability of the torpedoes is undetermined until we know what they were supposed to do, and then observe their performance.
Pigeon Poop and the Noisy Antenna
In the 1960s, two engineers, Arno Penzias, and Robert Woodrow Wilson, were using the Horn Antenna at Bell Laboratory in New Jersey to study radio waves that were bounced off echo balloons in the atmosphere. It was named the Horn Antenna because it looked like a horn. In order to do their research properly, they had to get as pure a signal as possible from the radio waves bouncing off the balloons. This meant they had to eliminate, or filter out, extraneous signals coming into the Horn Antenna. Extraneous signals could be coming from commercial radio or television broadcasts, military communications, malfunctioning transmitters in the neighborhood, airplane traffic, faulty components in the Horn Antenna, and the like.
After a period of time, and great frustration, Penzias and Wilson were unable to get the pure signal they desired with the Horn Antenna. There was still an unacceptable amount of extraneous signal. At one point, they thought the source was the accumulation of pigeon droppings inside the horn of the antenna. After unsuccessful attempts to clean the poop and banish the pigeons, Penzias got his shotgun and dispatched the pigeons, forthwith. It wasn't the pigeon poop. There was no decrease in the remaining extraneous signal.
The failure of the Horn Antenna to provide the pure radio signal they needed was more than offset when they realized that the extraneous signal was Cosmic Microwave Background Radiation (CMBR.) They discovered a type of signal that was predicted to be a tell-tale remnant of the formation of the very early universe, and support for the Big Bang Theory. They were awarded a Nobel Prize for this incredible discovery.
Let's get back to the concept of reliability and leave pigeon poop behind. This example is especially helpful in understanding reliability. The Horn Antenna was not a reliable instrument for the purposes of studying pure radio waves bouncing off echo balloons – or at least not as reliable as they required. The Horn Antenna at Bell Laboratory did not perform according to the stated requirements of Penzias and Wilson, therefore, it was not a reliable tool or instrument for their needs. However, when they realized they had detected CMBR, the Horn Antenna was found to be a reliable measure for a different purpose. Reliability of a test, or instrument, or tool is dependent upon its purpose, or statement of performance requirement, or fulfilling a specific function.
There is another way of looking at the reliability of the Horn Antenna. The CMBR was considered to be noise when the purpose of the antenna was to produce pure radio waves bouncing off echo balloons. The noise had to be eliminated. When the purpose of the antenna was changed to measuring CMBR, what was once noise, became the signal of interest. All other signals that used to be of interest were now noise. It is meaningless to speak of reliability of an instrument that is not associated with purpose, or requirements, or function.
The Theory of Test Reliability for Educational and Psychological Testing
The following is the definition of test reliability from the 1999 Standards for Educational and Psychological Testing. The first two sentences define a test, and the third and last sentence defines test reliability.
“A test, broadly defined, is a set of tasks designed to elicit or a scale to describe examinee behavior in a specified domain, or a system for collecting samples of an individual's work in a particular area. Coupled with the device is a scoring procedure that enables the examiner to quantify, evaluate, and interpret the behavior or work samples. Reliability refers to the consistency of such measurements when the testing procedure is repeated on a population of individuals or groups [emphasis mine]." p.25.
The first thing to note is that the concept of measurement is defined as using a scoring procedure. There is nothing here about a comparison to a standard, as there is in all other science. It is consistent with the view of S. S. Stevens (1946) that measurement is defined by the procedure or rule that is used to assign numeric values to events or objects. There is no understanding in any official sense in the Standards, or in psychology, as to what constitutes a rule or procedure in any scientific sense. I discussed this in my prior articles, Psychological Science: Mathematical Argument and the Quest for Scientific Respectability – Part 1, and Part 2.
The second thing to note is that the reliability of a test is simply the consistency of assigning numbers. There is nothing in this definition of test reliability that requires a statement of the purpose to which the testing activity is aimed. The Standards begins with an implication that there is some intended purpose for the test. However, after the first two sentences, the definition of test reliability requires only that a testing and scoring procedure be repeated to produce one or more additional sets of scores. If the scores are the same for each individual, then you have a reliable test. The purpose of the test is irrelevant to the definition of reliability.
The reliability of a testing instrument as consistency of results, independent of purpose, does not exist anywhere else in science. Textbooks in psychology and published papers on psychological testing, for 100 years, ALL say that reliability is completely independent of the purpose, or validity, of a test. If we extend this idea of reliability as consistency of observed results, independent of purpose, to the examples of the non-exploding torpedoes and Bell Lab's Horn Antenna, we reach some interesting, if bizarre, conclusions. According to psychological science, the weapon system of non-exploding torpedoes is a very reliable system. In fact, it is equally reliable for the U.S. Submarine Service as it for the German saboteurs. This makes sense for psychology, but for no other science in the world.
The same interpretation applies to the Horn Antenna. The measurements provided by the antenna were consistent whether used, without success, for producing pure radio signals bounced off an echo balloon, or used, successfully, to detect CMBR. It was useless for the original intended purpose, and successful for the later purpose. According to psychological science, the instrument was reliable when it failed its intended purpose, and reliable when it was successful for another purpose.
Every time I discuss this with another scientist or engineer, the definition of reliability from psychological science makes no sense to them. It makes even less sense when I describe how psychological science justifies its view of instrument reliability. It goes like this:
- Reliability is independent of the intended purpose of the instrument producing the result.
- A instrument that produces results can be reliable, even if it doesn't function according to intended purpose.
- Finally, and this is the humdinger, an instrument that produces results that are consistent with its intended purpose is, by definition, a reliable instrument. Reliability does not assure results as intended, but you cannot get results for an intended purpose unless the instrument is reliable. In the parlance of educational and psychological testing: Test validity – performing as per an intended purpose – is contingent upon reliability. You cannot have a test that is valid without it being reliable.
I am not going to go into the logical and philosophical crimes that are committed by the definition of reliability as consistency of observed results, without a reference to purpose. The most difficult thing for psychologists to reject is the notion that you cannot have a valid test if it is not, first, reliable. What is difficult to accept is that this statement is true ONLY BECAUSE IT HAS BEEN DEFINED AS TRUE. There is, seemingly, an initial appeal to common sense in that statement. That the appeal is more apparent than real will make the task even more difficult for psychology and psychologists. Sooner or later, psychological science must come to terms with the fact that test reliability, as described in the Standards, is an illusion. And, that is only the beginning of the problems with test reliability in psychological testing. I will discuss these later.
So, What Is Reliability?
Here we need to get our definitions in order if psychological science wants to deal in scientific concepts that are consistent with the rest of the scientific world. There are two principal concepts: validity, and reliability. In addition, there are related ideas of accuracy, precision, and error, but these will fall into place very nicely once we've understood validity and reliability. The Standards understands validity very, very well. The following is from the first three paragraphs of the chapter on validity:
“Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests. The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself [my emphasis]. When test scores are used or interpreted in more than one way, each intended interpretation must be validated.
“Validation logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed use. The proposed interpretation refers to the construct the test is intended to measure….
“….To support test development, the proposed interpretation is elaborated by describing its scope and extent and by delineating the aspects of the construct that are to be represented. The detailed description provides a conceptual framework for the test, delineating the knowledge, skills, abilities, processes, or characteristics to be assessed….” page 9.
In short, validity is the demonstration that a measurement instrument was used, successfully, for an intended purpose. At the end of the chapter on validity, there are 24 distinct requirements, called Standards, for demonstrating successful use of a test for an intended purpose. They are explicit and demanding, as right they should be. A failure to document successful use for an intended purpose means a test cannot be considered valid.
Now, what about reliability? In my prior articles on this subject, pioneers of scientific psychology learned from their colleagues that a test instrument, or any tool, is reliable if it produces the same results every time. That is what they heard, and it was translated into consistency of obtained results, independent of purpose. What those early psychologists did not hear, or did not understand, was something that was implicit in the understanding of reliability among other scientists. What they were saying was that an instrument had to produce the same VALID results every time it is used for its intended purpose. Reliability is not independent of validity. A valid instrument must produce consistent results when used for its intended purpose under identical circumstances. Reliability is the documentation of repeated, successful use for an intended purpose, AND the expectation of the same in the future. Validity is successful use for an intended purpose. Reliability is the expectation of validity based upon history of repeated use. Reliability is dependent upon validity. You cannot have a test that is reliable without it, first, being valid. For 100 years, psychological science got it backwards.
Let's take a look at some examples of definitions of reliability in the more successful sciences. First, other sciences do not use the term 'validity,' or not as frequently as psychology. Instead, they refer directly to the purpose of an instrument, or a system, and the demonstration of success. This is the same as validity for psychological science. Other sciences describe purpose in terms of performing to stated requirements, or usefulness in fulfilling a function, or achieving a desired goal. Here are a few typical definitions from a variety of disciplines:
Discipline: Engineering
Definition of Reliability: The expectation, from history, that an item (component, assembly, plant, or process) performed its intended function, without failure, for the required time duration when installed and operated correctly in a specified environment.
How Reliability is Measured: Compute the probability that a device, system, or process performed its prescribed duty without failure for a given time when operated correctly in a specified environment.
Discipline: Life science
Definition of Reliability: The expectation, from history, of the survival of organisms.
How Reliability is Measured: Compute the probability that an organism survived after a given interval of time, or survived to a given age.
Discipline: Military
Definition of Reliability: The expectation of failure-free performance under stated conditions. The expectation that an item can perform its intended function for a specified interval under stated conditions.
How Reliability is Measured: Compute the duration or probability that an item performed its prescribed duty without failure for a given time when operated correctly in a specified environment.
Discipline: Psychology, OLD 1999 Standards
Definition of Reliability: Consistency of observed results
How Reliability is Measured: Compute correlation coefficient on repeated measures from alternate test forms.
Discipline: Psychology, NEW (Norman Costa)
Definition of Reliability: The expectation of validity, successful use for an intended purpose, from the documented history of repeated use.
How Reliability is Measured: Compute the probabilities of ranges of validity coefficients that an item or test produced when used for an intended purpose.
Discipline: Mathematical Reliability Theory
Definition of Reliability: “Generally speaking, it is a body of ideas, mathematical models, and methods directed toward the solution of problems in predicting, estimating, or optimizing the probability of survival, mean life, or, more generally, life distribution of components or systems: other problems considered in reliability theory are those involving the probability of proper functioning of the system at either a specified or an arbitrary time, or the proportion of time the system is functioning properly. in a large class of reliability situations, maintenance, such as replacement, repair, or inspection, may be performed, so that the solution of the reliability problem may influence decisions concerning maintenance policies to be followed. …[R]eliability theory is mainly concerned with probabilities, mean values, probability distributions…[.]”
Ideas of validity and reliability are essentially the same for all other sciences. There are subtle differences in usage, however, compared to psychology. As mentioned, above, most other disciplines refer directly to success for intended purpose, without using the term 'validity.' Psychological science tends to think of validity and reliability as continuums from perfect validity or reliability to no validity or reliability at all. Other disciplines see reliability, as an example, in binary terms. A tool that has a history of performing to requirements, and is expected to do the same in the future, is reliable. Anything short of that is unreliable.
What's Next?
In my next articles on the theory of test reliability, I will discuss:
- The terms accuracy, precision, and errors of measurement and how they fit into the ideas of validity and reliability;
- How the 100 year old mistakes were made, who made them, and why they persist to this day;
- That the present definition of reliability in the Standards makes it impossible to distinguish, unambiguously, between reliability and validity:
- That numerous anomalies arise from the present definition of reliability;
- The final, absolute, positive proof that reliability in the Standards is an illusion;
- The implications of throwing out the present definition and adopting the new definition that I propose;
- And, so much more.
Thank you for reading. Please contribute to the comments. Bouquets and brick bats welcome.
Leave a comment