Select Page

Reliability “the quality of being trustworthy or of performing consistently well”

The term reliability is very popular among the PhD folks and the academics, and can be quite confusing for the newbies in the research field. Reliability of any measure used in the study is very important to ensure the quality of the overall research findings. In this post, I would like to share my lessons related to reliability.

Reliability in statistics refers to the overall consistency of a measure. A measure is said to be reliable if it produces similar results under consistent conditions. The terms consistency, reproducibility, repeatability, and agreement are often used interchangeably to define reliability. There are several classes of reliability estimates, such as Inter-rater reliability; Intra-rater reliability; and test-retest reliability.  Understanding these different forms with a scientific example might be a little bit difficult. Therefore, to explain them, I have used examples from my everyday life.

Inter-rater reliability assesses the degree to which test scores are consistent from one rater to the next when administered by the same rater. Inter-rater reliability and test-retest reliability terms are often used interchangeably. Test-retest reliability assesses the degree to which test scores are consistent from one test administration to the next. In other words, give the same test twice to the same person at different times to see if the scores are the consistent.

For example, My 3 year’s old daughter (Myra) has developed a taste towards Vegemite toast. Believe me or not, but she can eat it for breakfast, lunch, dinner, and as a snack on the train or after her swim lesson. Not even once, she has refused to eat it. So, I claim that every time I am going to offer a Vegemite toast to her, she is gonna love it! Reliably! In other words, inter-rater reliability was 100%.

Intra-rater reliability assesses the degree of agreement between two or more raters in their appraisals. Understanding this concept through same example- Every time I prepare the toast, Myra eats it. I score the task completed, but Gowtham (my husband) scores it differently, because she doesn’t finish the crust. Hence in this situation, the same action is scored differently. I conclude that she likes the Vegemite toast, while Gowtham concludes that she kind of likes it. This difference in the observations is known as a measurement error (also called an observational error). In this example, behavior of my daughter was dependent on who was checking it. There was no intra-rater (between the raters) reliability.

While it seemed like Myra was enjoying her Vegemite toast every time. There was a funny phenomenon- every time her dad would offer her a toast, she wouldn’t finish it. It happened on 3-4 occasions in a row. Obviously, the toast and the Vegemite spread were the same, so what happened differently? Surely her preference is not reliable between the raters (her mum and dad)? Does it mean, I (her mom) prepare it any different from her dad? Or does it mean she finishes her toast just to please her mum?

So every time I was offering her toast, she would finish it. But as many times Gowtham gives prepares her toast, she would not finish it. The test-retest reliability for both of us (me and Gowtham) was very high. So what went wrong?

To solve the vegemite toast puzzle, I offered Myra to make her own toast. Not only she enjoyed the activity of spreading the vegemite on her toast (what a sense of accomplishment!), she finished her toast as well. Let’s not talk about the mess on her face, clothes, and on the dinning chair, I will keep it to explain some other statistical tool.

Now, why the results were different depending on who was preparing the toast? The reason could be a measurement error.

Measurement Error: is a difference between a measured quantity and its true value. It includes random error (naturally occurring errors that are to be expected- Myra had her watermelon and then may refuse to eat the Vegemite toast) and systematic error (caused by a mis-calibrated instrument that affects all measurements). Myra was not eating Dad’s toast- because he was spreading the Vegemite on a toasted bread, which feels and tastes different- at least to Myra).

The most commonly adopted statistical procedures to assess these different types of reliabilities are: Intraclass correlation coefficient (ICC); and Pearson’s correlation coefficient (r). The measurement errors can be calculated by: Standard error of measurement (SEM); Coefficient of variation (CV); and Limits of agreement (LOA). If interested, the detailed information about these statistical tests can be found here.


Academic Reading:

Leslie, G. P., & Mary, P. W. (2007). Foundations of Clinical Research- Applications to Practice (3rd revised United States ed ed.). Upper Saddle River/US: Prentice Hall.