Monday, January 17, 2011

TEST INTERPRETATION

Aptitude and ability tests are used to make inferences about the individual's competencies, capabilities and likely future performance on the job. But what do scores mean and how are they interpreted?

There two distinct method to use in interpreting scores, criterion-referenced interpretation and norm-referenced interpretation. In criterion-referenced tests, the test score indicates the amount of skill or knowledge that a test taker has in a particular subject area. It involves comparing a student's score with a subjective standard of performance rather than with the performance of a norm group. Norm-referenced interpretation involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group.

Sources:
...uiowa.edu/itp/...itbs_interp_score.aspx
psychometric-success.com/.../interpretin
...wikipedia.org/...Standardized_testing

Reflection

Student's skills and knowledge are usually determined through testing. The test scores are interpreted to judge the performance of the student if improvement has been made or to compare his performance with other students taking the same test. Testing in schools is also used not just to assess the students' performance but also to rank schools with other schools, region, or even with other nations.  This is the reason why teachers are encouraged to do review classes and to shape their classroom activities around the upcoming test with the hopes that their students will perform better than the other students in other schools. Indeed, we are living in the age of much testing and assessment with increasing demands for teachers and school accountability and ever more rigorous expectations for improved student test performance (wikipedia, standardized_ testing). 

Students and teachers feel the pressure put upon them and this creates tension It is very important then that interpretation of test scores be done by trained persons who have adequate amount of information and have understood fully the purpose of the test or what the test is designed to measure. There should be a check-recheck if interpretation of test scores accurately portray the test taker's performance, otherwise, it would be unfair for the students who are trying their best to get high scores and the teachers whose teaching performance is also measured based on the students' results on the test.   

    

Wednesday, January 12, 2011

Reliability and Validity

Reliability and validity are two important characteristics of any measurement procedure. Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. A measure is considered reliable if a person’s score on the same test given twice is similar. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for potatoes an hour later (unless, it has been peeled and cooked). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.

Three approaches that reliability is usually estimated:

1. Stability – a measure is stable if one can secure consistent results with repeated measurement of the same person with the same instrument.

Method: Test/Retest

The idea behind test/retest is that the same group of subjects should get the same score on test 1 as they do on test 2 given on a separate day, weeks or months (preferably less than 6 months). Determination of reliability coefficient is correlation..

2. Equivalence – Reliability is associated with the degree to which alternative forms of the same measure produce same or similar results. This approach considers how much error may be introduced by different investigators (in observations) or different samples of items being studied (in questioning or scales). The difference between stability and equivalence is that latter is concerned with personal & situational fluctuations from one time or another, while equivalence is concerned with variations at one point in time among observers and sample of items. The major interest with equivalence is typically not how respondents differ from item to item but how well a given set of items will categorize an individual.

Method: Parallel forms

Parallel forms of a test may be administered to the same group of subjects simultaneously or with a delay, and the paired observations may be correlated.

3. Internal Consistency
This uses only one administration of a test or an instrument in order to assess consistency or homogeneity among the items. Thus it does not involve a time interval as do the test-retest and parallel forms methods.

Methods:

A.      Split-Half Method. This method can be used when the measuring tool has many similar statements or questions to which the subject can respond. After the administration of the instrument, results are separated by item into even and odd numbers or into randomly selected halves.
B.      Kuder-Richardson Methods Formula 20 and 21
These methods measure the extent to which items within one form of the test have as    
 much in common with one another as do the items in that one form with corresponding 
 items in an equivalent form.
C.      Cronbach’s Coefficient Alpha
    This reliability coefficient is closely related to K-R procedures. However, it has the advantage of being applicable to multiple scored tests. Multiple scored tests are those that are not scored right or wrong according to some other all-or-none system.

The primary difference between test/retest and internal consistency estimates of reliability is that test/retest involves two administrations of the measurement instrument, whereas the internal consistency method involves only one administration of that instrument.

Validity, on the other hand, means that the measuring instrument actually measures the property it is supposed to measure. A test is valid when it measures what it is supposed to measure. How valid is the test depends on its purpose – for example, a ruler may be a valid measuring device for length, but it isn’t very valid for measuring volume.

Categories of Measurement Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably.

Content validity is based on the extent to which a measurement reflects the specific intended domain of content. It includes a broad sample of what is being tested, emphasizes important material, and requires appropriate skills. Is the full content of a concept’s definition included in the measure?

Criterion validity is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid. Is the measure consistent with what we already know and what we expect? Two categories: predictive and concurrent

         Predictive: Predicts a known association between the construct you’re measuring and something else.
        Concurrent: Associated with pre-existing indicators, something that already measures the same concept.

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure.  For example, if we’re using an Alcohol Abuse Inventory, even if there’s no way to measure “abuse” itself, we can predict that serious abuse correlates with health, family, and legal problems. Two-sub-categories are: Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related.  Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

Sources:

Prado, Nenita I. et al (2010) .Methods of Research. Cagayan de Oro City
http://writing.colostate.edu/guides/rresearch/relval/pop2b.cfm
socialresearchmethods.net/…/lcoiosi2.htm
http://www1.georgetown.edu/departments/psychology/resources 
experiment-resources.com/validity-and-…

Reflection

I would like to base my reflection on this situation.

What will you recommend given this scenario?

Your school district is looking for an assessment instrument to measure reading ability. There were two possibilities at hand. Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity (http://fcit.usf.edu/assessment/basic/basicc.html).

I would recommend using Test A. Validity is more important than reliability because if an instrument is not accurate - does not actually measure what it is supposed to measure, there is no reason to use it even if it yields consistent results. “Reliability of an instrument does not warranty its validity.” (Murali D.)

Sunday, January 2, 2011

CRITERION-REFERENCED TESTS AND NORM-REFERENCED TESTS

Many educators and members of the public fail to grasp the distinctions between criterion-referenced and norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic differences are understood.

          There are two chief groups in which tests are categorized, criterion-referenced testing and norm-referenced testing.
          Criterion-referenced tests are tests that seek to determine whether an individual has mastered knowledge or skills which were taught in a section of a course to see if instruction was successful and to take remedial action. This type of tests also serve to determine if someone can be certified to begin work on a given profession. Introduced by Glaser (1962) and Popham and Husek (1969), these are also known as domain-referenced tests, competency tests, basic skills tests, (http:www.education.com/definition/basic-skills/?-module=)  mastery tests, performance tests or assessments, authentic assessments, standards-based tests, credentialing exams to determine persons qualified to receive a license or certificate, and more. What all these tests have in common is that they attempt to determine a candidate’s level of performance in relation to a well defined domain of content. Classroom teachers use them to monitor student performance in their day to day activities. These tests are useful for evaluating student performance and generating educational accountability information at the classroom, school, district, and state levels. The tests are based on the curricula, and the results provide basis for determining how much is being learned by students and how well the educational system is producing desired results. Criterion-referenced tests are also used in training programs to assess learning. Typically pretest-posttest designs with parallel forms of criterion-referenced tests are used.
          In contrast, norm-referenced tests seek to compare respondents with some other group. The interpretation of such tests consist of comparing the individual score with either the other respondents in the same administration or with all others who have ever taken the test. The tests determine a candidate’s level of the construct measured by a test in relation to a well- defined reference group of candidates, referred to as norm group.
                    The following is adapted from Popham, J.W. (1975). Educational evaluation. Englewood Cliffs, New jersey: Prentice-Hall, Inc.
Dimension
Criterion-Referenced Tests
Norm-Referenced Tests
Purpose
To determine whether each student has achieved specific skills or concepts.
To find out how much students know before instruction begins and after it has finished.
To rank each student with respect to the achievement of others in broad areas of knowledge.
To discriminate between high and low achievers.

Content
Measures specific skills which make up a designated curriculum. These skills are identified by teachers and curriculum experts.
Each skill is expressed as an instructional objective.
Measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum experts.

Item Characteristics
Each skill is tested by at least four items in order to obtain an adequate sample of student performance and to minimize the effect of guessing.
The items which test any given skill are parallel in difficulty

Each skill is usually tested by less than four items.

Items vary in difficulty.
Items are selected that discriminate between high and low achievers.

Score Interpretation
Each individual is compared with preset standard for acceptable achievement. The performance of other examinees is irrelevant.
A student’s score is usually expressed as percentage.

Student achievement is reported for individual skills.
Each individual is compared with other examinees and assigned a score-usually expressed as a percentile, a grade equivalent score, or a stanine.


Student achievement is reported for broad skill areas, although some norm-referenced tests do report student achievement for individual skills.

References:
http://www.brighthub.com/education/special/articles/72677.aspx


Reflection

          If I were to choose between criterion referenced testing and norm-referenced testing, I prefer the former. Why, because learners have different needs, level of understanding, and come from different backgrounds. So, why compare students to other students? Instead, compare this student to his own previous performance, if improvement has been made, that way he will strive to perform better and even best. Further, this will prevent the student from being discouraged.
          This can be likened to a parent who says “ Your cousin always gets better grades than you...” It’s unfair to be compared to someone else, isn’t it? Each individual is different. Well, just a thought…