Wednesday, January 12, 2011

Reliability and Validity

Reliability and validity are two important characteristics of any measurement procedure. Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. A measure is considered reliable if a person’s score on the same test given twice is similar. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for potatoes an hour later (unless, it has been peeled and cooked). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.

Three approaches that reliability is usually estimated:

1. Stability – a measure is stable if one can secure consistent results with repeated measurement of the same person with the same instrument.

Method: Test/Retest

The idea behind test/retest is that the same group of subjects should get the same score on test 1 as they do on test 2 given on a separate day, weeks or months (preferably less than 6 months). Determination of reliability coefficient is correlation..

2. Equivalence – Reliability is associated with the degree to which alternative forms of the same measure produce same or similar results. This approach considers how much error may be introduced by different investigators (in observations) or different samples of items being studied (in questioning or scales). The difference between stability and equivalence is that latter is concerned with personal & situational fluctuations from one time or another, while equivalence is concerned with variations at one point in time among observers and sample of items. The major interest with equivalence is typically not how respondents differ from item to item but how well a given set of items will categorize an individual.

Method: Parallel forms

Parallel forms of a test may be administered to the same group of subjects simultaneously or with a delay, and the paired observations may be correlated.

3. Internal Consistency
This uses only one administration of a test or an instrument in order to assess consistency or homogeneity among the items. Thus it does not involve a time interval as do the test-retest and parallel forms methods.

Methods:

A.      Split-Half Method. This method can be used when the measuring tool has many similar statements or questions to which the subject can respond. After the administration of the instrument, results are separated by item into even and odd numbers or into randomly selected halves.
B.      Kuder-Richardson Methods Formula 20 and 21
These methods measure the extent to which items within one form of the test have as    
 much in common with one another as do the items in that one form with corresponding 
 items in an equivalent form.
C.      Cronbach’s Coefficient Alpha
    This reliability coefficient is closely related to K-R procedures. However, it has the advantage of being applicable to multiple scored tests. Multiple scored tests are those that are not scored right or wrong according to some other all-or-none system.

The primary difference between test/retest and internal consistency estimates of reliability is that test/retest involves two administrations of the measurement instrument, whereas the internal consistency method involves only one administration of that instrument.

Validity, on the other hand, means that the measuring instrument actually measures the property it is supposed to measure. A test is valid when it measures what it is supposed to measure. How valid is the test depends on its purpose – for example, a ruler may be a valid measuring device for length, but it isn’t very valid for measuring volume.

Categories of Measurement Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably.

Content validity is based on the extent to which a measurement reflects the specific intended domain of content. It includes a broad sample of what is being tested, emphasizes important material, and requires appropriate skills. Is the full content of a concept’s definition included in the measure?

Criterion validity is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid. Is the measure consistent with what we already know and what we expect? Two categories: predictive and concurrent

         Predictive: Predicts a known association between the construct you’re measuring and something else.
        Concurrent: Associated with pre-existing indicators, something that already measures the same concept.

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure.  For example, if we’re using an Alcohol Abuse Inventory, even if there’s no way to measure “abuse” itself, we can predict that serious abuse correlates with health, family, and legal problems. Two-sub-categories are: Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related.  Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

Sources:

Prado, Nenita I. et al (2010) .Methods of Research. Cagayan de Oro City
http://writing.colostate.edu/guides/rresearch/relval/pop2b.cfm
socialresearchmethods.net/…/lcoiosi2.htm
http://www1.georgetown.edu/departments/psychology/resources 
experiment-resources.com/validity-and-…

Reflection

I would like to base my reflection on this situation.

What will you recommend given this scenario?

Your school district is looking for an assessment instrument to measure reading ability. There were two possibilities at hand. Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity (http://fcit.usf.edu/assessment/basic/basicc.html).

I would recommend using Test A. Validity is more important than reliability because if an instrument is not accurate - does not actually measure what it is supposed to measure, there is no reason to use it even if it yields consistent results. “Reliability of an instrument does not warranty its validity.” (Murali D.)

1 comment:

  1. yes Jude, no use to discuss about reliability if the test is not valid in the first place.

    ReplyDelete