It is said that “assessment is the tail that wags the curriculum dog” (Hargreaves, 1989). This statement amply underscores the importance of assessment in any system of education. However, it also cautions us about the pitfalls that can occur when assessment is improperly used. When poorly conducted, students may view assessment as pointless and just another hurdle to jump over (McDowell & Mowl, 1995; Ramsden, 1997). Students focus on learning what is asked in the examination. As teachers, we can exploit this potential of assessment to give a direction to student learning. Simply stated, it means that if we ask them questions based on factual recall, they will try to memorize facts; but if we frame questions requiring application of knowledge, they will learn to apply their knowledge.
In this chapter, we will introduce the basic concepts related to assessment of students and how we can maximize the effect of assessment in giving a desired shape to learning.
Terminology Used in Assessment
Let us first clarify the terminology. You may have read terms such as measurement, assessment, evaluation, etc. and seen them being used interchangeably. There are subtle differences between these terms. They need to be used in the right context with the right purpose, so that they convey the same meaning to everyone. Interestingly these 2terms also tell the story about how educational testing has evolved over the years.
Measurement was the earliest technique used in educational testing. It meant assigning numbers to the competence exhibited by the students. For example, marking a multiple-choice question paper is a form of measurement. Since measurement is a physical term, it was presumed that it should be as precise and as accurate as possible. As a corollary, it also implied that anything which could not be measured (objectively!) should not form part of the assessment package. The entire emphasis was placed on objectivity and providing standard conditions so that the results represented only student learning (also called true score) and nothing else.
While such an approach may have been appropriate to measure physical properties (such as weight, length, temperature, etc.), it certainly did not capture the essence of educational attainment. There are several qualities which we want our students to develop, but which are not amenable to precise measurement. Can you think of some of these? You may have rightly thought of communication, ethics, professionalism, etc., which are as important as other skills and competencies, but which cannot be precisely measured.
Assessment has come to represent a much broader concept. It includes some attributes which can be measured precisely and others which cannot be measured so precisely (Linn & Miller, 2005). Some aspects such as scores of theoretical tests are objectively measured, while other aspects such as clinical-decision making are subjectively interpreted, and then combining these, a judgment is formed about the level of student achievement. Thus, viewing assessment as a combination of measurement and non-measurement gives a better perspective from teachers’ point of view. Several experts favor this approach, defining assessment as “any formal or purported action to obtain information about the competence and performance of a student” (Vleuten & Schuwirth, 2019).
Evaluation is another term which is used almost synonymously with assessment. However, there are subtle differences. Though both these terms involve passing a value judgment on learning, traditionally the term ‘assessment’ is used in the context of student learning. Evaluation, on the other hand, is used in the context of the educational programs. So, you will assess the performance of students in a particular test, while you will evaluate the quantum to which a particular course is equipping the students with desired knowledge and skills. Assessment of students is a very important input (though not the only one) to judge the value of an educational program.
Let us also clarify some more terms that are often loosely used in the context of student assessment. “Test” and “tool” are two such 3terms. Conventionally, a “test” generally refers to a written instrument which is used to assess learning. Test can be paper/pencil-based or computer-based. On the other hand, a “tool” refers to an instrument used to observe skills or behavior to assess the extent of learning. Objective Structured Clinical Examination (OSCE) and mini-Clinical Evaluation Exercise (m-CEX) are examples of assessment tools.
Why do we need to assess students?
The conventional answer given to this question is: so that we can categorize them as “pass” or “fail”. But more than making this decision, several other advantages accrue from assessment. Rank ordering the students (e.g., for selection), measuring improvement over a period of time, providing feedback to students and teachers about areas which have been learnt well and others which require more attention, and maintaining the quality of educational programs are some of the other important reasons for assessment (Table 1.1).
Assessment in medical education is especially important because we are certifying students as fit to deal with human lives. The actions of doctors have the potential to make a difference between life and death. This makes it even more important to use the most appropriate tools to assess their learning. You will also appreciate that medical students are required to learn a number of practical skills, many of which can be lifesaving. Assessment is also a means to ensure that all students learn these skills.
Types of Assessment
Assessment can be classified in many ways depending on the primary purpose for which it is being conducted. Some of the ways of classifying assessment are as follows:
- Formative and summative assessment
- Criterion- and norm-referenced testing.
1. Formative assessment and summative assessment
As discussed in the preceding paragraphs, assessment can be used not only for certification, but also to provide feedback to teachers and 4students. Based on this perspective, assessment can be classified as formative or summative.
Formative assessment is the assessment which is conducted with the primary purpose of providing feedback to students and teachers. Since the purpose is diagnostic (and remedial), it should be able to reveal strengths and weaknesses in student learning. If students disguise their weaknesses and try to bluff the teacher, the purpose of formative assessment is lost. This feature has important implications in designing assessment for formative purposes. To be useful, formative assessment should happen as often as possible—in fact, experts suggest that it should be almost continuous. Remember, when we give formative feedback, we do not give students a single score, but we give students a complete profile of their strengths and weaknesses in different areas. Since the purpose is to help the student learn better, formative assessment is also called assessment for learning.
Formative assessment should not be used for final certification. This implies that certain assessment opportunities must be designated as formative only, so that teachers have an opportunity to identify the deficiencies of the students and undertake remedial action. A corollary of this statement is that all assignments need not be graded, or that, all grades need not be considered during the calculation of final scores. From this perspective, all assessments are de facto summative; they become formative only when they are used to provide feedback to the students to make learning better. Formative assessment has been discussed in more detail in Chapter 16.
Summative assessment, on the other hand, implies testing at the end of the unit, semester or course. Please note that summative does not refer only to the end-of-the-year University examinations. Assessment becomes summative when the results are going to be used to make educational decisions. Summative assessment is also called assessment of learning.
Summative assessment intends to test if the students have attained the objectives laid down for a specified unit of activity. It is also used for certification and registration purposes (e.g., giving a license to practice medicine). Did you notice that we said “attainment of listed objectives”? This implies that students must be informed well in advance, right at the beginning of the course, about what is expected from them when they complete the course, so that they can plan the path of their learning accordingly. Most institutions ignore this part, leaving it for students to make their own interpretations based on inputs from various sources (mainly from senior students). No wonder then, that we often end up frustrated with the way the students learn.
The contemporary trend is towards blurring the boundary between formative and summative assessment. Purely formative assessment without any consequences will not be taken seriously by anybody. On 5the other hand, purely summative assessment has no learning value or opportunity for improvement. There is no reason why the same assessment cannot be used to provide feedback, as well as to calculate final scores. We will discuss this aspect in more detail in Chapter 18.
We strongly believe that every teacher can play a significant role in improving student learning by the judicious use of assessment for learning. Every teacher may not be involved with setting high-stake question papers, but every teacher is involved with developing assessment locally to provide feedback to the students. Throughout this book, you will find a tilt toward the formative function of assessment.
Sometimes, assessment itself can be used as a learning task, in which case, it is called assessment as learning.
2. Criterion-referenced and norm-referenced testing
Yet another purpose of assessment that we listed above was to rank order the students (e.g., for selection purposes). From this perspective, it is possible to classify assessment as criterion-referenced testing (CRT) and norm-referenced testing (NRT).
Criterion-referenced testing involves comparing the performance of the students against pre-determined criteria. This is particularly useful for term-end examinations or before awarding degrees to doctors, where we want to ensure that students have attained the minimum desired competencies for that course or unit of the course. Competency-based curricula largely require criterion-referenced testing.
Results of CRT can only be a pass or a fail. Let us have an example. If the objective is that the student should be able to perform a cardio-pulmonary resuscitation, then he must perform all the essential steps to be declared as pass. The student cannot pass if he performs only 60% of the steps! CRT requires establishment of an absolute standard before starting the examination.
Norm-referenced testing, on the other hand, implies rank ordering the student. Here each student's results set the standard for those of others. NRT only tells us how the students did in relation to each other—it does not tell us “what” they did. There is no fixed absolute standard, and ranking can happen only after the examination has been conducted.
Again, there can be variations to this, and one of the commonly employed means is a two-stage approach, i.e. first use CRT to decide who should pass and then use NRT to rank order them. Traditionally in India we have been following this mixed approach. However, we do not seem to have established defensible standards of performance so far and often arbitrarily take 50% as the cut-off for pass/fail. This affects the validity of assessment. It is important to have defensible standards. Standard setting has been discussed in Chapter 23.6
Attributes of Good Assessment
We have argued for the importance of assessment as an aid to learning. This is related to many factors. The provision of feedback (e.g., during formative assessment) improves learning (Burch et al, 2006; Rushton, 2005). Similarly, the test-driven nature of learning again speaks for the importance of assessment (Dochy & McDowell, 1997). What we would like to emphasize here, is that the reverse is also true, i.e. when improperly used, assessment can distort learning. We are all aware of the adverse consequences on the learning of interns that occurs when selection into postgraduate courses is based only on the results of one MCQ-based test.
There are several attributes that good assessment should possess. Rather than going into the plethora of attributes available in literature, we will restrict ourselves to the five most important attributes of good assessment as listed by Vleuten & Schuwirth (2005). These include:
- Validity
- Reliability
- Feasibility
- Acceptability, and
- Educational impact.
Validity
Validity is the most important attribute of good assessment. Traditionally, it has been defined as “measuring what is intended to be measured” (Streiner & Norman, 1995). While this definition is correct but, it requires a lot of elaboration (Downing, 2003). Let us try to understand validity better.
The traditional view was that validity is of various types: content validity, criterion validity (this was further divided into predictive validity & concurrent validity), and construct validity (Crossley, Humphris & Jolly, 2002) (Fig. 1.1). This concept had the drawback of seeing assessment as being valid in one situation but not in another. With this approach, a test could cover all areas and have good content validity; but may not be valid when it comes to predicting future performance. Let us draw a parallel between validity and honesty as an attribute.
Just as it is not possible for a person to be honest in one 7situation and dishonest in another (then he would not be called honest!), so is true for validity.
Validity is now seen as a unitary concept, which must be inferred from various evidences (Fig. 1.2). Let us come back to the “honesty” example. When would you say that someone is honest? One would have to look at a person's behavior at work, at home, in a situation when he finds something expensive lying by the roadside, or how he pays his taxes, and only then make an inference about his honesty. Validity is a matter of inference, based on available evidence.
Validity refers to the interpretations that we make out of assessment data. Implied within this is the fact that validity does not refer to the tool or results—rather, it refers to the interpretations we make from the results obtained by use of that tool. From this viewpoint, it is pertinent to remember that no test or tool is inherently valid or invalid.
Let us explain this further. Suppose we use a 200-question MCQs test to select the best students to get into postgraduate medical courses. We could interpret that the highest scorers of the test have the best content knowledge. To do this, we would have to gather evidence to check if all relevant portions of the syllabus had received adequate representation in the paper. We could also state that the students with the best aptitude have been selected for the course. For this we will need to present evidence that the test was designed to assess aptitude also. As you can see, validity is contextual. So here, it is not the tool (MCQ), which is valid or invalid, but the interpretations that we infer from the results of our assessment which matter.
Inferring validity requires empirical evidence. What are the different kinds of evidence that we can gather to determine if the interpretations we are making are appropriate and meaningful?
As Figure 1.2 shows, we need to gather evidence from different sources to support or refute the interpretations that we make from our assessment results. Depending on the situation, we might look for one or two types of evidence to interpret validity of an assessment. 8But ideally, we would need to look for evidence in the following four categories (Waugh & Gronlund, 2012):
- Content-related evidence: Does the test adequately represent the entire domain of tasks that is to be assessed?
- Criterion-related evidence: Does the test predict future performance? Do the test results compare to results of some other simultaneously conducted test (this has been explained below in more detail)?
- Construct-related evidence: Does this test measure the psychological or educational characteristics that we intended to measure?
- Consequence-related evidence: Did the test have a good impact on learning and avoid negative effects?
To do this, one has to be fully aware of: why we are performing a particular assessment; the exact nature of construct being assessed; what we are expecting to obtain by conducting this exercise; what the assessment results are going to be used for; the exact criterion which are going to be used to make decisions on assessment results; and the consequences of using this assessment. Let's understand this in more detail.
Evidence Related to Content Validity
Generally, we look for content-related evidence to see if the test represents the entire domain of the content, competencies, and objectives set for a course. If an undergraduate student is expected to perform certain basic skills (e.g., giving an intramuscular injection or draining an abscess) and if these skills are not assessed in the examination, then content-related validity evidence is lacking. Similarly, if the number of questions is not proportional to the content [e.g., if 50% weightage is given to questions from the central nervous system (CNS) at the cost of anemia, which is a much more common problem], the assessment results might not be meaningful.
Thus, sampling is a key issue in assessment. Always ask yourself if the test is representative of the whole domain that you are assessing. For this, look at the learning outcomes, prepare a plan (blueprint) and prepare items which correspond to these specifications. More on this will be dealt with in Chapter 6.
Evidence Related to Criterion Validity
The reasons why we look for evidence related to criterion validity are to see whether the test scores correspond to the criterion which we seek to predict or estimate. Let us take an example. Suppose we conduct an entrance examination to select the best undergraduate students into a postgraduate surgical course. Here, the purpose of the test is to predict future performance. To infer that this test was appropriate, we would perhaps need to gather data about the students’ performance after they qualify as surgeons and see if these test results correspond to their performance. Here we are using future performance as the 9criterion. This is an example of how evidence about predictive validity can be gathered.
Now suppose we are introducing a new assessment method A and we want to see how it works in comparison to an existing assessment method B for the same purpose and in the same setting. To do this we can compare the results obtained from both tools in the same setting to see how effective method A is in comparison to the previous method B. Here we are concurrently judging the results of two methods to see if they are comparable. This is the concept of concurrent validity.
Evidence Related to Construct Validity
Validity also requires construct-related evidence. What do we understand by ‘construct’? The dictionary meaning of construct is “a complex idea resulting from a synthesis of simpler ideas”. A construct has also been defined in educational or psychological terms as “an intangible collection of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory” (Downing, 2003). Thus, a construct is a collection of inter-related components, which when grouped together as a whole gives a new meaning.
If we were to consider the construct ‘beauty’, we might use attributes such as physique, complexion, poise, confidence, and many such attributes to decide if one is beautiful. Similarly, in educational settings, subject knowledge, its application, data gathering, interpretation of data and many other things go into deciding the construct ‘clinical competence’. In medicine, educational attainment, intelligence, aptitude, problem-solving, professionalism, and ethics are some other examples of constructs.
All assessment in education aims at assessing a construct. It is the theoretical framework which specifies the hypothetical qualities that we seek to measure. For instance, we are not interested in knowing if students can enumerate five causes of hepatomegaly. But we are interested in knowing, if they can take a relevant history based on those causes. In this context, construct-related evidence becomes most important way to infer validity. Simply stated, results of assessment will be more valid, if they told us about the problem-solving ability of a student, rather than about his ability to list five causes of each of the symptoms shown by the patient. As a corollary, it can also be said that if the construct is not fully represented (e.g., testing only presentation skills, but not physical examination skills during a case presentation), validity is threatened. Messick (1989) calls this construct under-representation (CU).
While content and construct seem to be directly related to the course, the way a test is conducted can also influence its validity. A question may be included in an examination to test understanding of certain concepts, but if the papers are marked based on a student's handwriting, validity is threatened. If a test is conducted in a hot, humid, and noisy room, its validity becomes low, because then, one is 10also implicitly testing candidates’ ability to concentrate in the presence of distractions rather than their educational attainment. Notice here, that the construct that we were assessing has changed. Suppose an MCQ is framed in complicated language and if the students must spend more time in understanding the complex language of an MCQ, rather than on its content, then validity is threatened. Here besides content, one is testing vocabulary and reading comprehension. Similarly, leaked question papers, incorrect key, equipment failure, etc. can have a bearing on the validity. Messick (1989) calls this construct irrelevance variance (CIV).
Let us try to explain this concept in a different way. Let us say you conduct an essay type test and try to assess knowledge, skills, and professionalism from the same. We would expect that there would be low correlation between the scores on the three domains. On the other hand, if we conduct three different tests, say for example, essays, MCQs, and oral examination to assess knowledge, we would expect a high correlation between scores. If we were to get just the opposite results—i.e., high correlation in the first setting and low in the second, construct irrelevance variance would be said to exist. You can think of many common examples from your own settings, which induce CIV in our assessment. Too difficult or too complicated questions, use of complex language which is not understood by students, words which confuse the students and “teaching to the test” are some of the factors which will induce CIV. Designing OSCE stations which test only analytical skills will result in invalid interpretation about practical skills of a student by inducing CIV.
The contemporary concept of validity is that all validity is construct validity (Downing, 2003). It is the most important of all the evidences that we gather to determine validity.
Evidence Related to Consequential Validity
When we design an assessment, it is always pertinent to ask about the consequences of using that format. Did it motivate students to learn differently? Did it lower their motivation to study, or did it encourage poor study habits? Did it lead them to choose surface learning over deep learning? Did it make them think about application of the knowledge or did they resort to mere memorization of facts? Evidence about these effects needs to be collected.
These main concepts on validity have been summarized in Box 1.1.
Validity should be built in right from the stage of planning and preparation. Assessment should match the contents of the course and provide proportional weightage to each of the contents. Blueprinting and good sampling of content is a very helpful tool to ensure content representation (see Chapter 6). Also, implied is the need to let students know right in the beginning about what is expected from them at the end of the course. Use questions, which are neither too difficult nor too easy, which are worded in a way appropriate to the level of the students. Validity also involves proper administration and scoring. Maintaining transparency, fairness, and confidentiality of the examinations are some methods of building validity. Similarly, the directions, scoring system, test format all have a bearing on the validity (Table 1.2).
|
We have for long, followed the dictum of “assessment drives learning,” which often results in extraneous factors distorting learning. A better way would be to let “learning drive assessment” so that validity is built into assessments. This concept of Programmatic Assessment has been discussed in detail in Chapter 18.
Similarly, the assessment tools should aim to test broad constructs rather than individual competencies like knowledge or skills. It is often better to use multiple tools to get different pieces of information on which a judgment of student attainment can be made. It is also important to select tools, which can test more than one competency at a time. There is no use of having one OSCE station to test history taking, another for skills, and yet another for professionalism. Each station should be able to test more than one competency. This not only provides an opportunity for wider sampling by having more competencies tested at each station but also builds validity.12
Reliability
Let us now move to the second important attribute of assessment—reliability. Commonly, reliability refers to reproducibility of the scores. Again, like in the case of validity, this definition needs a lot of elaboration (Downing, 2004).
A commonly used definition of reliability is obtaining same results under similar conditions. The concept of obtaining same results under similar conditions might be true of a biochemical test. However, it is not completely true of an educational test. Let us say, during the final professional MBBS examination, we allot a long case to a student in a very conducive setting, where there is no noise or urgency, and the patient is very cooperative. But we know that in actual practice, this seldom happens. Similarly, no two patients with same diagnosis will have similar presentation. In the past, educationists have tried to make the examinations more and more controlled and standardized (e.g., OSCE and standardized patients), so that the results represent only student attainment and nothing else. We argue that it might be better to work in reverse— i.e., conduct examinations in settings as close to actual ones as possible so that reproducibility can be ensured. This is the concept of workplace based and authentic assessment.
We often tend to confuse between the terms ‘objectivity’ and ‘reliability’. Objectivity refers to reproducibility of the scores so that anyone marking the test would mark it the same way. There are certain problems in equating reliability with objectivity in this way. For example, if the key to an item is wrongly marked in a test, everyone would mark the test similarly and generate identical scores. But are we happy with this situation? No, because it leads to faulty interpretation of the scores. Let us add some more examples. Suppose at the end of final professional MBBS, we were to give the students a test paper containing only 10 MCQs. The results will be very objective, but they will not be a reliable measure of students’ knowledge. There is no doubt that objectivity is a useful attribute of any tool, but it is more important to have items (or questions) which are fairly representative of the universe of items which are possible in a subject area, and at the same time sufficient number of items so that the results are generalizable. In other words, in addition to objectivity we also need an appropriate and adequate sample to get reliable results. This example also shows how reliability evidence contributes to validity.
We will also like to argue that objectivity is not sine-qua-non of reliability. A subjective assessment can be very reliable if based on adequate content expertise. We all make predictions— subjective— about potential of our students and we rarely go wrong! The point that we are trying to make is that in educational testing there is always a degree of prediction involved. Will the student whom we have certified as being able to handle a case of mitral stenosis in the medical college be able to do so in practice? To us, reliability is 13therefore the degree of confidence that we can place in our results (try reading reliability as rely-ability).
A common reason for low reliability is the content specificity of the case. Many examiners will prefer to have a neurological case in the final examination in medicine. It is presumed that a student who can satisfactorily present this case can also present a patient with anemia or malnutrition. This could not be farther from the truth. Herein lies the importance of including a variety of cases in the examination to make them representative of what the student is going to see in real life. You will recall what we said earlier that a representative and adequate sampling is also important to build validity.
Viewing reliability of educational assessment differently from that of other tests has important implications. Let us suppose that we give a test of clinical skills to a final year student. If we look at reliability merely as reproducibility— or in other words, getting same results if the same case is given again to the student under same conditions— then we will try to focus on precision of scores. However, if we conceptualize reliability as confidence in our interpretation, then we will like to examine the student under different conditions (outpatients, inpatients, emergency, community settings, etc., and by different examiners) and on different patients so that we can generalize our results. We might even like to add feedback from peers, patients, and other teachers to infer about the competence of the student.
We often go by the idea that examiner variability can induce a lot of unreliability in the results. To some extent this may be true. While examiner training is one solution, it is equally useful to have multiple examiners. We have already discussed about need to include a variety of content in assessment. It may not be possible to use many assessment formats at one occasion, but this can happen when we carry out assessment on multiple occasions. The general agreement in educational assessment is that a single assessment, howsoever perfect, is flawed for making educational decisions. Therefore, it is important to collect information on several occasions using a variety of tools. The key dictum to build reliability (and thereby validity) for any assessment is to have multiple tests on multiple content areas by multiple examiners using multiple tools in multiple settings. The concept of Programmatic Assessment discussed in Chapter 18 largely follows this approach.
Validity and reliability of a test are very intricately related. To be valid, a test should be reliable. Reliability evidence contributes to validity. A judge cannot form a valid inference if the witness who is being examined is unreliable. Thus, reliability is a precondition for validity. But let us caution you that it is not the only condition. Please also be aware that generally there is a trade-off between validity and reliability: the stronger the bases for validity, the weaker the bases for reliability (and vice-versa) (Fendler, 2016).
An application-oriented perspective on validity and reliability of assessments has been discussed in Chapter 26.14
Feasibility
The third important attribute of assessment is feasibility. We may like to assess every student by asking them to perform a cardiopulmonary resuscitation on an actual patient, but it may not be logistically possible. Same is true of many other skills and competencies. In such situations, one needs to think of other alternatives like simulations or tie up with other professional organizations for such assessments.
Acceptability
The next attribute of assessment is acceptability. Several assessment tools are available to us and sometimes we can have a variety of methods to fulfill the same objective.
Portfolios, for example, can provide as much information as can be provided by rating scales. MCQs can provide as much information about knowledge as can be obtained by oral examinations. However, acceptability by students, raters, institutions and society, at large, can play a significant role in accepting or rejecting a tool. MCQs, despite all their problems, are accepted as a tool for selecting students for postgraduate courses, while methods like portfolios, which provide more valid and reliable results may not be. This is not to suggest that we should sacrifice good tools based on likes or dislikes, but to suggest that all stakeholders need to be involved in the decision-making process about use of assessment tools.
Linked to the concept of acceptability is also the issue of feasibility. While we may have developed very good tools for assessing communication skills of our students, resource crunch may not allow us to use this tool on a large scale.
Educational Impact
The educational impact of assessment is a very significant issue. The impact of assessment can be seen in terms of student learning, consequences for the students and consequences for the society. We have already referred to the impact of MCQ-based selection tests on student learning. For students, a wrong assessment decision can act as a double-edged sword. A student who has wrongly been failed has to face consequences in terms of time and money. On the other hand, if a student is wrongly passed when he does not deserve to, society must deal with the consequences of having an incompetent physician.
Assessments do not happen in vacuum. They happen within the context of certain objectives. For each assessment, there is an expected use—it could be making pass/fail judgments, selecting students for an award or simply to provide feedback to teachers. Asking these three questions visually brings a lot of clarity in the process and helps in selecting appropriate tools.
- Who is going to use this data?
- At what time? and,
Utility of Assessment
Before we end this chapter, let us introduce you to the concept of utility of assessment. Vleuten (1996) has suggested a conceptual model for the utility of any assessment.
This is not a mathematical formula but a notional one. This concept is especially important because it shows us how to compensate for deficiencies in assessment tools by their strengths. Results of some tools may be low on reliability but can still be useful if they are high on their educational impact. For example, results of MCQs have a high reliability, but little educational value. Results of mini-clinical evaluation exercise (mini-CEX), on the other hand, may be low on reliability, but have a higher educational value due to the feedback component. Still, both are equally useful to assess students. Similarly, if certain assessment has a negative value for any of the parameter, (e.g., if an assessment promotes unsound learning habits), then its utility may be zero or even negative.
The above five criteria contributing to utility of assessment were accepted by consensus in 2010 as the criteria for good assessment along with two additional criteria (Norcini et al., 2011). While we have retained the earlier nomenclature of “criteria,” there have been some modifications to it. Later at the 2018 Ottawa consensus meeting, the nomenclature was changed from “criteria” to “framework” for good assessment emphasizing the essential structure that these elements provide (Norcini et al., 2018). The alternative nomenclature of these seven elements was provided as: (1) Validity or coherence; (2) Reproducibility, Reliability, or Consistency; (3) Equivalence (the same assessment yields equivalent scores or decisions when administered across different institutions or cycles of testing); (4) Feasibility; (5) Educational Effect; (6) Catalytic effect (the assessment provides results and feedback in a fashion that motivates all stakeholders to create, enhance, and support education; it drives future learning forward and improves overall program quality); (7) Acceptability (Norcini et al., 2018). The same paper well summarizes the relationship between these elements of framework and purpose of assessment (formative or summative) rather well. Validity is essential for both the formative and summative purposes. While reliability and equivalence are more important for the summative assessments, the educational and catalytic effects are key to formative use. Feasibility and acceptability considerations are a must for both. Whatever nomenclature we may adopt, assessment can never be viewed in terms of a single criterion, framework or attribute.16
Easing Assessment Stress
Assessments induce a lot of stress and anxiety amongst students (and teachers). Assessment should be like a moving ramp rather than like a staircase with a block at each stage. Many approaches can be used to reduce examination stress. A COLE framework has been proposed (Siddiqui, 2017) to smoothen out assessment problems. This stands for Communication to the stakeholders about the need and purpose of a tool; Orientation to ensure that the tool is used as intended, by teachers and students alike; Learning orientation in the tool so that all assessments contribute to better learning and Evaluation of the tool itself to see if it is serving the intended purpose.
The other approach is to reduce stakes on individual assessments and take a collective decision based on multiple low stake assessments, spread throughout the course. This is the basis of programmatic assessment and will be discussed in Chapter 18.
As we go through the subsequent chapters, we will be discussing about assessment methods and assessment design in greater detail.
REFERENCES
- Burch, V.C., Saggie, J.C, & Gary, N. (2006). Formative assessment promotes learning in undergraduate clinical clerkships. South African Medical Journal, 96, 430–33.
- Crossley, J., Humphris, G., & Jolly, B. (2002). Assessing health professionals. Medical Education, 36(9), 800–4.
- Dochy, F.J.R.C., & McDowell, L. (1997). Assessment as a tool for learning. Studies in Educational Evaluation, 23, 279–98.
- Downing, S.M. (2003). Validity: on the meaningful interpretation of assessment data. Medical Education, 37, 830–7.
- Downing, S.M. (2004). Reliability: on the reproducibility of assessment data. Medical Education, 38 (9), 1006–12.
- Downing, S.M., Park, Y.S., & Yudkowsky, R. (2019). Assessment in health professions education. (2nd ed.) New York: Routledge.
- Fendler, A. (2016). Ethical implications of validity-vs.-reliability trade-offs in educational research, Ethics & Education, 11: 2, 214–29.
- Hargreaves, A. (1989) Curriculum & Assessment Reform. Milton Keynes, UK: Open University Press.
- Linn, R.L., & Miller, M.D. (2005). Measurement & assessment in teaching. New Jersey: Prentice Hall.
- McDowell, L. & Mowl, G. (1995). Innovative assessment: Its impact on students. In G. Gibbs (Ed.) Improving student Learning. Through assessment & evaluation. Oxford: The Oxford Centre for Staff Development.
- Messick, S. (1989). Validity. In R.L. Linn (Ed.). Educational measurement. New York: American Council on Education. pp. 13–104.
- Norcini, J., Anderson, M.B., Bollela, V., Burch, V., Costa, M.J., Duvivier, R., et al. (2011). Criteria for good assessment: consensus statement & recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–11.
- Ramsden, P. (1997). The context of learning in academic departments. In F. Marton, D. Hounsell, N. Entwistle (Eds.) The Experience of Learning: Implications for Teaching & Studying in Higher Education, (2nd ed.) Edinburgh: Scottish Academic Press.
- Rushton, A. (2005). Formative assessment: a key to deep learning? Medical Teacher, 27, 509–13.
- Schuwirth, L.W.T., & van der Vleuten, C.P.M. (2019). How to design a useful test: The principles of assessment. In: Swanwick, T., Forrest, K., O'Brien, B.C. (Ed.) Understanding medical education: evidence, theory & practice. West Sussex: Wiley-Blackwell.
- Siddiqui, Z.S. (2017). An effective assessment: From Rocky Roads to Silk Route. Pakistan Journal of Medical Sciences Online 32(2), 505–9.
- Streiner, D., Norman, G. (1995). Health measurement scales. A practical guide to their development & use. (2nd ed.) New York: Oxford University Press.
- van der Vleuten, C.P.M., & Schuwirth, L.W.T. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–17.
- Waugh, C.K., & Gronlund, N.E. (2012). Assessment of student achievement. 10th ed. New Jersey: Pearson.
FURTHER READING
- Black, P, & Wiliam, D. (1998). Assessment & classroom learning. Assessment in Education, 5, 7–74.
- Dent, J.A., Harden, R.M. & Hunt, D. (2017). A practical guide for medical teachers. (5th ed.), Edinburgh, Elsevier.
- Epstein, R.M., & Hundert, E.M. (2002). Defining & assessing professional competence. Journal of American Medical Association, 287, 226–35.
- Fredriksen, N. (1984). Influences of testing on teaching & learning. American Psychologist, 39, 193–202.
- Gibbs, G., & Simpson, C. (2004) Conditions under which assessment supports student learning. Learning & Teaching in Higher Education. 1, 3–31.
- Hawkins, R.E., & Holmboe, E.S. (2008). Practical guide to the evaluation of clinical competence. Philadelphia: Mosby-Elsevier.
- Jackson, N., Jamieson, A., & Khan, A. (2007). Assessment in medical education & training: A practical guide. New York: CRC Press.
- Miller, G.E. (1976). Continuous assessment. Medical Education, 10, 611–21.
- Norcini, J. (2003). Setting standard in educational tests. Medical Education, 37, 464–69.
- Singh, T., Gupta, P., & Singh, D. (2021). Principles of Medical Education. (5th ed.) New Delhi: Jaypee Brothers Medical Publishers.
- Singh, T., Anshu & Modi, J.N. (2012). The Quarter Model: A proposed approach to in-training assessment for undergraduate students in Indian Medical Schools. Indian Pediatrics, 49, 871–6.
- Swanwick, T., Forrest, K., & O'Brien, B.C. (Ed.) (2019). Understanding medical education: evidence, theory & practice. (3rd ed.) West Sussex: Wiley-Blackwell.
- Wass, V., Bowden, R., & Jackson, N. (2007). Principles of assessment design. In Jackson, N., Jamieson, A., Khan, A. (Eds.). Assessment in medical education & training: A practical guide. (1st ed.) Oxford: Radcliffe Publishing.