Principles of Assessment in Medical Education Tejinder Singh, Anshu
INDEX
360° Assessment (see Multisource feedback)
A
Admission procedures 248251
Medical College Aptitude Test (MCAT) 249251
multiple mini-interviews (MMI) 249, 251
National Eligibility-cum-Entrance Test (NEET) 252255, 257
UK Clinical Aptitude Test (UKCAT) 249250
AETCOM 69, 174175, 290
Angoff method (see Standard setting)
Assessment tools
acute care assessment tool (ACAT) 126, 194195
direct observation-based assessment of clinical skills 114137
directly observed procedural skills (DOPS) 125126, 185186, 189, 225
ethics (see Ethics) 169171
long case (see long case) 8390
mini-clinical evaluation exercise (see Mini-clinical evaluation exercise) 117125, 185186, 188189, 225
mini-peer assessment tool (mini-PAT) 128129, 185186, 193
multiple choice questions (see Multiple choice questions) 4764
multiple mini-interview (MMI) 111, 143147, 249251
multisource feedback (360° assessment) 127129, 170171, 185186, 192193
objective structured clinical examination (see Objective Structured Clinical Examination) 91113
objective structured long examination record (OSLER) 87, 129130
online assessment (see Online resources for assessment) 300301, 371383
oral examination (see Oral examination) 138148
patient management problem 301303
portfolios (see Portfolios) 151164
professionalism (see Assessment of Professionalism) 169171
selection type questions (See Selection type questions) 3946
structured oral examination 142143, 144146
team assessment of behavior (TAB) 129, 185186, 193194
viva voce (see also Oral examination) 138148
workplace-based assessment (WPBA) (see also Workplace-based assessment tools) 184196
written assessment (see Written assessment) 32
Assessment
assessment as learning 20, 2628, 263
assessment for learning (see Assessment for learning) 20, 25, 28, 233246, 263, 266
assessment of learning 2025, 234, 263
assessment versus evaluation 2
attributes of good assessment 6
basic concepts 117
clinical competence 1829, 84, 117, 120, 179, 275, 301, 353, 359
community-based (see Community-based assessment) 221232
competency-based (see Competency-based assessment) 206220
COLE framework 16
criterion-referenced versus norm-referenced 5
difference between assessment of learning and assessment for learning 234
end of training 288
ethics 165177
expert judgment 28, 122, 262, 267, 357, 359
for selection (see Admission procedures) 247260
formative (see Assessment for learning) 236
in-training 288
measurement versus assessment 2
objective 352353, 357
objective versus subjective 359361
online (see Online assessment) 296307
professionalism (see Professionalism) 165177
programmatic (see Programmatic assessment) 243244, 261277
purposes of 3
reducing assessment stress 15
subjective (See Subjective expert judgment) 353, 357
summative assessment limitations 234235, 279280
summative versus formative 36
test versus tool 3
triangulation of data 267
types 3
utility 15, 262
Vleuten's formula 15, 2324, 262263 (see also Utility of assessment)
workplace-based assessment (WPBA) (see Workplace-based assessment) 178205
written 3039
Assessment as learning 5, 20, 26, 263, 297
Assessment for learning 20, 25, 28, 233246, 263, 266
attributes 237240
cycle 236
effect size of feedback 235
faculty development for 242245
methods 240241
strengths 235
SWOT analysis 242
Assessment of learning 2025, 234, 263
B
Bloom's levels 33
taxonomy 33
MCQ writing 5658
Blueprinting 24
OSCE 101102, 117
question paper setting 7172
C
Checklists versus global ratings 358359
Clinical competence 18
newble's model 31
COLE framework 16
Community-based assessment 221232
4R model 223226
clinical axis 223224
evidence axis 223224
personal axis 223224
social axis 223224
methods 224226
direct observation of field skills 225
direct observation of professional skills 225, 230231
directly observed procedural skills 225
family study assessment 225
logbook 225
mini-CEX 225
multisource feedback 225, 231
objective structured clinical examination 225
observation by community stakeholders 225
portfolios 225, 226
project assessment 225, 229230
reflective writing 225, 228229
assessment 228229
rubrics 228229
self-assessment of professional skills 225, 227
Paul Worley's framework 223226
principles 222223
Community-oriented medical education 221223
Competency frameworks 19
Competency-based assessment 206220
design 212216
prerequisites 210
principles 210212
Competency-based medical education (CBME) 206, 264
assessment, competency-based 206220
Competency
core competencies 19
definition 206
dreyfus and dreyfus model 207
frameworks
ACGME competencies 19, 207
CanMEDs competencies 19, 207
five-star doctor 19
General Medical Council competencies 19, 207
indian Medical Graduate (IMG) 19
medical Council of India 19, 207
tomorrow's doctors 19, 207
ideal doctor 19
milestones 208
roles of Indian Medical Graduate 19
sub-competencies 207
Construct 9, 353
construct formulation 353
construct irrelevance variance 10, 353
construct underrepresentation 9, 353
Constructivism 267
Contrasting groups method (see Standard setting) 326327
Cronbach's alpha (See Reliability)
D
Direct observation-based assessment of clinical skills 114137
360° team assessment of behavior (TAB) 129
acute care assessment tool (ACAT) 126
directly observed procedural skills (DOPS) 125126, 185186, 189, 194
mini clinical evaluation exercise (mini-CEX) 117125, 188189, 225, 356
mini peer assessment tool (mini-PAT) 128129
multisource feedback (360° assessment) 127129
OSCE 90113, 116, 131, 299, 356, 359
OSLER 129130
professionalism mini evaluation exercise (PMEX) 127
tools (see Assessment tools) 115
Directly observed procedural skills (DOPS) 125126, 185186, 189, 225
Dreyfus and Dreyfus model 207210, 213216
E
Educational environment 239240
Educational feedback (see Feedback to students)
Educational impact 14, 262
Educational system 236
Entrustable professional activities (EPA) 207, 213219
designing EPAs 208, 213217
EPA versus specific learning objectives (SLO) 210
stages of entrustment 213
Ethics
AETCOM 174175
attributes 166
autonomy 166
beneficence 166
dignity 166
justice 166
non-maleficience 166
difference between professionalism and ethics 166
narratives 173174
Evaluation 2
Evaluation of teaching (see Student ratings of teaching effectiveness) 342351
Expert judgment (see Subjective expert judgment) 357, 359361
F
Faculty development for better assessment 364370
assessment for learning 242245
for better assessment 364370
formal approaches 367
informal approaches 367
model program for training 368370
objective-structured clinical examination 100101
transfer-oriented training 368
workplace-based assessment (WPBA) 199
Feasibility of assessment 14
Feedback (see also Feedback, educational; Feedback, from students; Feedback, to students) 2526, 238239
Feedback, educational 329341
attributes 332334
definition 329330
descriptive 334335
feedback loop 330331
immediate feedback assessment technique 337
issues 338339
models 335336
feedback sandwich 335
PCP model 335
Pendleton model 335
reflective model 336
SET-GO model 336
STAR model 336
stop-start-continue model 336
opportunities 336338
self-monitoring 337338
strategies for improvement 339340
types 332333
benchmarking 332333
correction 332333
diagnosis 332333
longitudinal development 332333
reinforcement 332333
Feedback, from students (see Student ratings of teaching effectiveness) 342351
Feedback, to students (see Feedback, educational) 6, 15, 25, 26, 63, 86, 92, 103, 115, 117, 123, 130, 170, 180, 188, 211, 234, 237, 263, 266, 268, 282, 329341
Feed forward 2526
H
Hofstee method (see Standard setting) 327
I
Internal assessment 278284
1997 MCI regulations 278279
2019 MCI regulations 279
formative or summative 280281
issues 286288
principles 283
quarter model (see Quarter model) 281, 285295
reliability 281282
strengths 255256, 278284, 286
validity 282
Item analysis 308318
item statistics 308313
difficulty index 309310
discrimination index 309311
distractor efficiency 309, 311
facility value 309310
point biserial correlation 311313
test analysis 313318
methods of estimating reliability 314318
equivalent-forms reliability 314
internal consistency reliability 314315
Cronbach's alpha 315316
KR 20 formula 315
Kuder Richardson formula 315
split half method 315
standard error of measurement 317318
parallel-forms reliability 314
test-retest reliability 314
reliability coefficient 284, 313314
K
Knowledge
assessment of knowledge (see Written assessment) 3046
free response type questions 3039
multiple choice questions (see Multiple choice questions) 4764
selection type questions (see Selection type questions) 3946
type A (declarative) 31
type B (procedural) 31
Kolb's learning cycle 155, 330331
Kuder Richardson formula (see Item analysis) 315
L
Logbook 185187, 225
Long case 8390
comparison with mini-CEX 131133
comparison with OSCE 131133
issues 8485
OSLER 83, 260
process 8384
strategies for improvement 8589
M
Mentoring 26, 153, 244245, 266, 271
Miller pyramid 2023, 3031, 92, 114115
Mini-clinical evaluation exercise (mini-CEX) 117125, 185186, 188189, 225
comparison with long case, 131133
comparison with OSCE, 131133
form 119120
process 118121
strengths 122
Mini-peer assessment tool (mini-PAT) 128129, 185186, 193
Modified essay questions (MEQ) 31, 3536
Multiple choice questions (MCQs) 4764
challenges of using MCQs 4849
conducting MCQ tests 48
guidelines for writing MCQs 5055
negative marking 6061
optical mark reading scanners 59
scoring MCQs 5860
standard setting 62
strengths of MCQs 48
structure of an MCQ 49
Multiple mini-interview (MMI) 111, 143147, 249251
Multisource feedback (360° assessment) 127129, 170171, 185186, 192193
N
Narratives 173174
critical incident technique 173
portfolios 173174
O
Objectification 357
Objective structured clinical examination (OSCE) 91113
admission OSCE 111
blueprinting 101102, 117
checklists versus global ratings 105, 116, 358359
comparison with long case 131133
comparison with mini-CEX 131133
computer assisted OSCE (CA-OSCE) 110
examiner training 100101
factors affecting utility 106108
feasibility 106107
group OSCE (GOSCE) 109
key features 93
modifications and innovations 109111
multiple mini-interview (see Multiple mini-interview) 111
objectivity 107
reliability 108
remote OSCE (ReOSCE) 110
resources to conduct OSCE online 380381
setup 97103
simulated patients 100101
standard setting (See Standard setting) 103104
team OSCE (TOSCE) 109110
telemedicine OSCE (TeleOSCE) 110
types of stations 9397, 116
procedure stations 94, 9597
question stations 9495
rest station 97
validity 107108
Objectivity 2, 108, 212, 282
reliability versus objectivity 352363
Observable practice activities (OPA) 210
Online assessment 296307
automation 298
cheating 305306
consortia 306
designing 297304
electronic patient management problem 301303
implementation 304307
methods 300301
open-book exams 305
plagiarism 305
question formats 299300
sharing resources 306
skill labs 306
take-home exams 305306
triage 306307
types of questions 300301
Online resources for assessment 371383
e-portfolios 379380
for creating distributing and grading assessment 373
for high stakes examinations 381382
for online collaboration 379
gamification apps 376377
interactive tools for formative assessment 373375
learning management systems 372373
online security 381382
proctor devices 381382
quiz apps 376377
to conduct online OSCE 380381
to conduct online simulations 380381
to create interactive videos 377378
to create online polls 378379
to create online surveys 378379
to create rubrics 381
to enhance student engagement 373375
Oral examination (viva voce) 138148
cost-effectiveness 141142
examiner training 147148
flaws 139
halo effect 140
objectivity 139140
reliability 140141
strategies for improvement 142148
strengths 139
structured oral examination 142143, 144146
validity 141
OSLER (see Direct observation-based assessment of clinical skills) 83, 129130, 260
P
Patient management problem 301303
Portfolios 151164, 185186
advantages 159160
challenges 162163
contents 152154
definition 151152
e-portfolios 379380
for assessment 157159, 173174
for learning 152157, 241
implementation 161162
limitations 160161
reflective writing 154157
workplace-based assessment 185186
Professionalism 165177
AETCOM 174175
altruism 166
assessment methods 169171
principles 167169
attributes 166
challenges 167169
conscientious index 175
definition 165167
difference between professionalism and ethics 166
multisource feedback 170171
narratives 173174
critical incident technique 173
portfolios 173174
patient assessment 170, 171
peer assessment 170, 241
professional competence 166
professional identity formation 174175
professionalism mini evaluation exercise (PMEX) 127, 172173
self-assessment 169170
supervisor ratings 170, 172
Professionalism mini-evaluation exercise (PMEX) 127, 172173
Programmatic assessment 11, 26, 243244, 261277
CBME 264, 270
challenges 273276
components 264270
implementation 271276
principles 264267
rationale 262264
traditional assessment versus programmatic assessment 268270
triangulation of data 267
Q
Quarter Model 281, 285295
format 289
implementation 289293
Question banking 318321
steps 319
uses 320321
Question paper setting 6582
blueprinting 7172
determining weightage 6870
item cards 7376
limitations of conventional practices 66
moderation 7778
steps for effective question paper setting 6778
R
Reflections (see Reflective practice)
Reflective practice
for assessment for learning 26, 166, 173, 241, 266
models 155
reflective writing 154157, 225, 228229
rubrics 228229
Reliability 1213, 262, 354356
equivalent-forms reliability 314
methods of estimating reliability 313318
internal consistency reliability 314315
Cronbach's alpha 315316
KR 20 formula 315
Kuder Richardson formula 315
split-half method 315
Standard error of measurement 317318
parallel-forms reliability 314
test-retest reliability 314
reliability coefficient 313314
versus objectivity 352
S
Selection type questions 3946
assertion-reason questions 4142
computer-based objective forms 45
matching questions 4344
key feature test 44
matching questions 42
multiple choice questions 40
multiple response questions 40
ranking questions 41
true-false questions 40
Self-monitoring 337338
Self-directed learning 266
Short answer questions (SAQ) (see Written assessments) 31, 36
Simulated patients 100101
Specific learning objectives 210
Standard error of measurement (see Item analysis) 317318
Standard setting 24, 322328
absolute standards 323324
compensatory standards 324
conjunctive standards 324
criterion-referenced 323324
effect on learning 324325
MCQs 62
methods 325328
for clinical skills 327328
for knowledge tests 325327
angoff method 326
contrasting groups method 326327
hofstee method 327
relative method 325
need 323
norm-referenced 323324
OSCE 103104
relative standards 323324
workplace-based assessment (WPBA) 184
Student ratings of teaching effectiveness 342351
design of instrument 343344
Dr Fox effect 347
generalizability 347
interpretation of data 345346
logistics 344345
misconceptions 342343
misuses 342343, 348
process 343346
professional melancholia 346
purposes 348
reliability 346347
validity 346347
Subjective expert judgment 28, 122, 262, 267, 357, 359361
T
Triage in medical education 306307
U
Utility of assessment 15, 262
Vleuten's formula 15, 2324, 262263
V
Validity 611, 353354
consequence-related evidence 8, 10, 262
construct-related evidence 8, 910, 353
content-related evidence 8
criterion-related evidence 89
factors which lower validity 11
Kane's arguments 354
key concepts 10
W
Web resources for assessment (see Online resources for assessment)
Workplace-based assessment (WPBA) 170171, 178205
difference from traditional assessment 180
faculty development 199
implementation steps 181184
need 178179
prerequisites to implementation 179181
direct observation 180181
feedback, 181
practice, 181
tasks 180
problem areas 199202
quality parameters 197199
role of assessors 196197
role of trainee, 197
standard setting 184
strengths 182
tools 184196
acute care assessment tool (ACAT) 194195
assessment of performance, 194
case-based discussion (CbD) 185186, 190191
clinical encounter cards (CEC) 185188
directly observed procedural skills (DOPS) 185186, 189
discussion of correspondence (DOC) 185186, 192
evaluation of clinical events (ECE) 185186, 191192
assessment tool (HAT) 195196
LEADER case-based discussion (LEADER CbD) 195
logbook, 185187
mini-clinical evaluation exercise 185186, 188189
mini-peer assessment tool (mini-PAT) 185186, 193
multisource feedback (360° assessment) 185186, 192193, 225, 231
patient satisfaction questionnaire, 185186
portfolio, 185186, 225226
procedure based assessment (PbA) 185186, 189190
safeguarding case-based discussion 196
sheffield assessment instrument for letters (SAIL) 185186, 192
supervised learning events 194
team assessment of behaviour (TAB) 185186, 193194
weaknesses 182
Written assessment 3138
closed-ended questions 32
context-poor questions 32
context-rich questions 32
essay questions 31, 3435
modified essay questions (MEQ) 31, 3536
multiple choice questions 4764
open-ended questions 32
short answer questions (SAQ) 31, 36
best response type, 37
completion type, 37
open SAQ, 3738
structured essay questions (SEQ) 31
×
Chapter Notes

Save Clear


Assessment: The BasicsChapter 1

Tejinder Singh
It is said that “assessment is the tail that wags the curriculum dog” (Hargreaves, 1989). This statement amply underscores the importance of assessment in any system of education. However, it also cautions us about the pitfalls that can occur when assessment is improperly used. When poorly conducted, students may view assessment as pointless and just another hurdle to jump over (McDowell & Mowl, 1995; Ramsden, 1997). Students focus on learning what is asked in the examination. As teachers, we can exploit this potential of assessment to give a direction to student learning. Simply stated, it means that if we ask them questions based on factual recall, they will try to memorize facts; but if we frame questions requiring application of knowledge, they will learn to apply their knowledge.
In this chapter, we will introduce the basic concepts related to assessment of students and how we can maximize the effect of assessment in giving a desired shape to learning.
 
Terminology Used in Assessment
Let us first clarify the terminology. You may have read terms such as measurement, assessment, evaluation, etc. and seen them being used interchangeably. There are subtle differences between these terms. They need to be used in the right context with the right purpose, so that they convey the same meaning to everyone. Interestingly these 2terms also tell the story about how educational testing has evolved over the years.
Measurement was the earliest technique used in educational testing. It meant assigning numbers to the competence exhibited by the students. For example, marking a multiple-choice question paper is a form of measurement. Since measurement is a physical term, it was presumed that it should be as precise and as accurate as possible. As a corollary, it also implied that anything which could not be measured (objectively!) should not form part of the assessment package. The entire emphasis was placed on objectivity and providing standard conditions so that the results represented only student learning (also called true score) and nothing else.
While such an approach may have been appropriate to measure physical properties (such as weight, length, temperature, etc.), it certainly did not capture the essence of educational attainment. There are several qualities which we want our students to develop, but which are not amenable to precise measurement. Can you think of some of these? You may have rightly thought of communication, ethics, professionalism, etc., which are as important as other skills and competencies, but which cannot be precisely measured.
Assessment has come to represent a much broader concept. It includes some attributes which can be measured precisely and others which cannot be measured so precisely (Linn & Miller, 2005). Some aspects such as scores of theoretical tests are objectively measured, while other aspects such as clinical-decision making are subjectively interpreted, and then combining these, a judgment is formed about the level of student achievement. Thus, viewing assessment as a combination of measurement and non-measurement gives a better perspective from teachers’ point of view. Several experts favor this approach, defining assessment as “any formal or purported action to obtain information about the competence and performance of a student” (Vleuten & Schuwirth, 2019).
Evaluation is another term which is used almost synonymously with assessment. However, there are subtle differences. Though both these terms involve passing a value judgment on learning, traditionally the term ‘assessment’ is used in the context of student learning. Evaluation, on the other hand, is used in the context of the educational programs. So, you will assess the performance of students in a particular test, while you will evaluate the quantum to which a particular course is equipping the students with desired knowledge and skills. Assessment of students is a very important input (though not the only one) to judge the value of an educational program.
Let us also clarify some more terms that are often loosely used in the context of student assessment. “Test” and “tool” are two such 3terms. Conventionally, a “test” generally refers to a written instrument which is used to assess learning. Test can be paper/pencil-based or computer-based. On the other hand, a “tool” refers to an instrument used to observe skills or behavior to assess the extent of learning. Objective Structured Clinical Examination (OSCE) and mini-Clinical Evaluation Exercise (m-CEX) are examples of assessment tools.
Why do we need to assess students?
The conventional answer given to this question is: so that we can categorize them as “pass” or “fail”. But more than making this decision, several other advantages accrue from assessment. Rank ordering the students (e.g., for selection), measuring improvement over a period of time, providing feedback to students and teachers about areas which have been learnt well and others which require more attention, and maintaining the quality of educational programs are some of the other important reasons for assessment (Table 1.1).
Assessment in medical education is especially important because we are certifying students as fit to deal with human lives. The actions of doctors have the potential to make a difference between life and death. This makes it even more important to use the most appropriate tools to assess their learning. You will also appreciate that medical students are required to learn a number of practical skills, many of which can be lifesaving. Assessment is also a means to ensure that all students learn these skills.
Table 1.1   Purposes of assessment
Summative (to prove)
Formative (to improve)
To ensure that minimum required standard or competence has been attained
To give feedback about performance to students
For certification: as pass/fail, to award a degree
To give feedback about performance to teachers
Rank ordering for competitive selection
To evaluate the quality of an educational program
 
Types of Assessment
Assessment can be classified in many ways depending on the primary purpose for which it is being conducted. Some of the ways of classifying assessment are as follows:
  1. Formative and summative assessment
  2. Criterion- and norm-referenced testing.
1. Formative assessment and summative assessment
As discussed in the preceding paragraphs, assessment can be used not only for certification, but also to provide feedback to teachers and 4students. Based on this perspective, assessment can be classified as formative or summative.
Formative assessment is the assessment which is conducted with the primary purpose of providing feedback to students and teachers. Since the purpose is diagnostic (and remedial), it should be able to reveal strengths and weaknesses in student learning. If students disguise their weaknesses and try to bluff the teacher, the purpose of formative assessment is lost. This feature has important implications in designing assessment for formative purposes. To be useful, formative assessment should happen as often as possible—in fact, experts suggest that it should be almost continuous. Remember, when we give formative feedback, we do not give students a single score, but we give students a complete profile of their strengths and weaknesses in different areas. Since the purpose is to help the student learn better, formative assessment is also called assessment for learning.
Formative assessment should not be used for final certification. This implies that certain assessment opportunities must be designated as formative only, so that teachers have an opportunity to identify the deficiencies of the students and undertake remedial action. A corollary of this statement is that all assignments need not be graded, or that, all grades need not be considered during the calculation of final scores. From this perspective, all assessments are de facto summative; they become formative only when they are used to provide feedback to the students to make learning better. Formative assessment has been discussed in more detail in Chapter 16.
Summative assessment, on the other hand, implies testing at the end of the unit, semester or course. Please note that summative does not refer only to the end-of-the-year University examinations. Assessment becomes summative when the results are going to be used to make educational decisions. Summative assessment is also called assessment of learning.
Summative assessment intends to test if the students have attained the objectives laid down for a specified unit of activity. It is also used for certification and registration purposes (e.g., giving a license to practice medicine). Did you notice that we said “attainment of listed objectives”? This implies that students must be informed well in advance, right at the beginning of the course, about what is expected from them when they complete the course, so that they can plan the path of their learning accordingly. Most institutions ignore this part, leaving it for students to make their own interpretations based on inputs from various sources (mainly from senior students). No wonder then, that we often end up frustrated with the way the students learn.
The contemporary trend is towards blurring the boundary between formative and summative assessment. Purely formative assessment without any consequences will not be taken seriously by anybody. On 5the other hand, purely summative assessment has no learning value or opportunity for improvement. There is no reason why the same assessment cannot be used to provide feedback, as well as to calculate final scores. We will discuss this aspect in more detail in Chapter 18.
We strongly believe that every teacher can play a significant role in improving student learning by the judicious use of assessment for learning. Every teacher may not be involved with setting high-stake question papers, but every teacher is involved with developing assessment locally to provide feedback to the students. Throughout this book, you will find a tilt toward the formative function of assessment.
Sometimes, assessment itself can be used as a learning task, in which case, it is called assessment as learning.
2. Criterion-referenced and norm-referenced testing
Yet another purpose of assessment that we listed above was to rank order the students (e.g., for selection purposes). From this perspective, it is possible to classify assessment as criterion-referenced testing (CRT) and norm-referenced testing (NRT).
Criterion-referenced testing involves comparing the performance of the students against pre-determined criteria. This is particularly useful for term-end examinations or before awarding degrees to doctors, where we want to ensure that students have attained the minimum desired competencies for that course or unit of the course. Competency-based curricula largely require criterion-referenced testing.
Results of CRT can only be a pass or a fail. Let us have an example. If the objective is that the student should be able to perform a cardio-pulmonary resuscitation, then he must perform all the essential steps to be declared as pass. The student cannot pass if he performs only 60% of the steps! CRT requires establishment of an absolute standard before starting the examination.
Norm-referenced testing, on the other hand, implies rank ordering the student. Here each student's results set the standard for those of others. NRT only tells us how the students did in relation to each other—it does not tell us “what” they did. There is no fixed absolute standard, and ranking can happen only after the examination has been conducted.
Again, there can be variations to this, and one of the commonly employed means is a two-stage approach, i.e. first use CRT to decide who should pass and then use NRT to rank order them. Traditionally in India we have been following this mixed approach. However, we do not seem to have established defensible standards of performance so far and often arbitrarily take 50% as the cut-off for pass/fail. This affects the validity of assessment. It is important to have defensible standards. Standard setting has been discussed in Chapter 23.6
 
Attributes of Good Assessment
We have argued for the importance of assessment as an aid to learning. This is related to many factors. The provision of feedback (e.g., during formative assessment) improves learning (Burch et al, 2006; Rushton, 2005). Similarly, the test-driven nature of learning again speaks for the importance of assessment (Dochy & McDowell, 1997). What we would like to emphasize here, is that the reverse is also true, i.e. when improperly used, assessment can distort learning. We are all aware of the adverse consequences on the learning of interns that occurs when selection into postgraduate courses is based only on the results of one MCQ-based test.
There are several attributes that good assessment should possess. Rather than going into the plethora of attributes available in literature, we will restrict ourselves to the five most important attributes of good assessment as listed by Vleuten & Schuwirth (2005). These include:
  1. Validity
  2. Reliability
  3. Feasibility
  4. Acceptability, and
  5. Educational impact.
 
Validity
Validity is the most important attribute of good assessment. Traditionally, it has been defined as “measuring what is intended to be measured” (Streiner & Norman, 1995). While this definition is correct but, it requires a lot of elaboration (Downing, 2003). Let us try to understand validity better.
The traditional view was that validity is of various types: content validity, criterion validity (this was further divided into predictive validity & concurrent validity), and construct validity (Crossley, Humphris & Jolly, 2002) (Fig. 1.1). This concept had the drawback of seeing assessment as being valid in one situation but not in another. With this approach, a test could cover all areas and have good content validity; but may not be valid when it comes to predicting future performance. Let us draw a parallel between validity and honesty as an attribute.
zoom view
Fig. 1.1: Earlier concept of validity.
Just as it is not possible for a person to be honest in one 7situation and dishonest in another (then he would not be called honest!), so is true for validity.
zoom view
Fig. 1.2: Contemporary concept of validity.
Validity is now seen as a unitary concept, which must be inferred from various evidences (Fig. 1.2). Let us come back to the “honesty” example. When would you say that someone is honest? One would have to look at a person's behavior at work, at home, in a situation when he finds something expensive lying by the roadside, or how he pays his taxes, and only then make an inference about his honesty. Validity is a matter of inference, based on available evidence.
Validity refers to the interpretations that we make out of assessment data. Implied within this is the fact that validity does not refer to the tool or results—rather, it refers to the interpretations we make from the results obtained by use of that tool. From this viewpoint, it is pertinent to remember that no test or tool is inherently valid or invalid.
Let us explain this further. Suppose we use a 200-question MCQs test to select the best students to get into postgraduate medical courses. We could interpret that the highest scorers of the test have the best content knowledge. To do this, we would have to gather evidence to check if all relevant portions of the syllabus had received adequate representation in the paper. We could also state that the students with the best aptitude have been selected for the course. For this we will need to present evidence that the test was designed to assess aptitude also. As you can see, validity is contextual. So here, it is not the tool (MCQ), which is valid or invalid, but the interpretations that we infer from the results of our assessment which matter.
Inferring validity requires empirical evidence. What are the different kinds of evidence that we can gather to determine if the interpretations we are making are appropriate and meaningful?
As Figure 1.2 shows, we need to gather evidence from different sources to support or refute the interpretations that we make from our assessment results. Depending on the situation, we might look for one or two types of evidence to interpret validity of an assessment. 8But ideally, we would need to look for evidence in the following four categories (Waugh & Gronlund, 2012):
  1. Content-related evidence: Does the test adequately represent the entire domain of tasks that is to be assessed?
  2. Criterion-related evidence: Does the test predict future performance? Do the test results compare to results of some other simultaneously conducted test (this has been explained below in more detail)?
  3. Construct-related evidence: Does this test measure the psychological or educational characteristics that we intended to measure?
  4. Consequence-related evidence: Did the test have a good impact on learning and avoid negative effects?
To do this, one has to be fully aware of: why we are performing a particular assessment; the exact nature of construct being assessed; what we are expecting to obtain by conducting this exercise; what the assessment results are going to be used for; the exact criterion which are going to be used to make decisions on assessment results; and the consequences of using this assessment. Let's understand this in more detail.
 
Evidence Related to Content Validity
Generally, we look for content-related evidence to see if the test represents the entire domain of the content, competencies, and objectives set for a course. If an undergraduate student is expected to perform certain basic skills (e.g., giving an intramuscular injection or draining an abscess) and if these skills are not assessed in the examination, then content-related validity evidence is lacking. Similarly, if the number of questions is not proportional to the content [e.g., if 50% weightage is given to questions from the central nervous system (CNS) at the cost of anemia, which is a much more common problem], the assessment results might not be meaningful.
Thus, sampling is a key issue in assessment. Always ask yourself if the test is representative of the whole domain that you are assessing. For this, look at the learning outcomes, prepare a plan (blueprint) and prepare items which correspond to these specifications. More on this will be dealt with in Chapter 6.
 
Evidence Related to Criterion Validity
The reasons why we look for evidence related to criterion validity are to see whether the test scores correspond to the criterion which we seek to predict or estimate. Let us take an example. Suppose we conduct an entrance examination to select the best undergraduate students into a postgraduate surgical course. Here, the purpose of the test is to predict future performance. To infer that this test was appropriate, we would perhaps need to gather data about the students’ performance after they qualify as surgeons and see if these test results correspond to their performance. Here we are using future performance as the 9criterion. This is an example of how evidence about predictive validity can be gathered.
Now suppose we are introducing a new assessment method A and we want to see how it works in comparison to an existing assessment method B for the same purpose and in the same setting. To do this we can compare the results obtained from both tools in the same setting to see how effective method A is in comparison to the previous method B. Here we are concurrently judging the results of two methods to see if they are comparable. This is the concept of concurrent validity.
 
Evidence Related to Construct Validity
Validity also requires construct-related evidence. What do we understand by ‘construct’? The dictionary meaning of construct is “a complex idea resulting from a synthesis of simpler ideas”. A construct has also been defined in educational or psychological terms as “an intangible collection of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory” (Downing, 2003). Thus, a construct is a collection of inter-related components, which when grouped together as a whole gives a new meaning.
If we were to consider the construct ‘beauty’, we might use attributes such as physique, complexion, poise, confidence, and many such attributes to decide if one is beautiful. Similarly, in educational settings, subject knowledge, its application, data gathering, interpretation of data and many other things go into deciding the construct ‘clinical competence’. In medicine, educational attainment, intelligence, aptitude, problem-solving, professionalism, and ethics are some other examples of constructs.
All assessment in education aims at assessing a construct. It is the theoretical framework which specifies the hypothetical qualities that we seek to measure. For instance, we are not interested in knowing if students can enumerate five causes of hepatomegaly. But we are interested in knowing, if they can take a relevant history based on those causes. In this context, construct-related evidence becomes most important way to infer validity. Simply stated, results of assessment will be more valid, if they told us about the problem-solving ability of a student, rather than about his ability to list five causes of each of the symptoms shown by the patient. As a corollary, it can also be said that if the construct is not fully represented (e.g., testing only presentation skills, but not physical examination skills during a case presentation), validity is threatened. Messick (1989) calls this construct under-representation (CU).
While content and construct seem to be directly related to the course, the way a test is conducted can also influence its validity. A question may be included in an examination to test understanding of certain concepts, but if the papers are marked based on a student's handwriting, validity is threatened. If a test is conducted in a hot, humid, and noisy room, its validity becomes low, because then, one is 10also implicitly testing candidates’ ability to concentrate in the presence of distractions rather than their educational attainment. Notice here, that the construct that we were assessing has changed. Suppose an MCQ is framed in complicated language and if the students must spend more time in understanding the complex language of an MCQ, rather than on its content, then validity is threatened. Here besides content, one is testing vocabulary and reading comprehension. Similarly, leaked question papers, incorrect key, equipment failure, etc. can have a bearing on the validity. Messick (1989) calls this construct irrelevance variance (CIV).
Let us try to explain this concept in a different way. Let us say you conduct an essay type test and try to assess knowledge, skills, and professionalism from the same. We would expect that there would be low correlation between the scores on the three domains. On the other hand, if we conduct three different tests, say for example, essays, MCQs, and oral examination to assess knowledge, we would expect a high correlation between scores. If we were to get just the opposite results—i.e., high correlation in the first setting and low in the second, construct irrelevance variance would be said to exist. You can think of many common examples from your own settings, which induce CIV in our assessment. Too difficult or too complicated questions, use of complex language which is not understood by students, words which confuse the students and “teaching to the test” are some of the factors which will induce CIV. Designing OSCE stations which test only analytical skills will result in invalid interpretation about practical skills of a student by inducing CIV.
The contemporary concept of validity is that all validity is construct validity (Downing, 2003). It is the most important of all the evidences that we gather to determine validity.
 
Evidence Related to Consequential Validity
When we design an assessment, it is always pertinent to ask about the consequences of using that format. Did it motivate students to learn differently? Did it lower their motivation to study, or did it encourage poor study habits? Did it lead them to choose surface learning over deep learning? Did it make them think about application of the knowledge or did they resort to mere memorization of facts? Evidence about these effects needs to be collected.
These main concepts on validity have been summarized in Box 1.1.
11How can we build in validity?
Validity should be built in right from the stage of planning and preparation. Assessment should match the contents of the course and provide proportional weightage to each of the contents. Blueprinting and good sampling of content is a very helpful tool to ensure content representation (see Chapter 6). Also, implied is the need to let students know right in the beginning about what is expected from them at the end of the course. Use questions, which are neither too difficult nor too easy, which are worded in a way appropriate to the level of the students. Validity also involves proper administration and scoring. Maintaining transparency, fairness, and confidentiality of the examinations are some methods of building validity. Similarly, the directions, scoring system, test format all have a bearing on the validity (Table 1.2).
Table 1.2   Factors which lower validity and their remedies
Factors
Remedies
Too few items or cases
Increase the number of items/cases; Increase the frequency of testing
Unrepresentative or irrelevant content
Blueprinting; Subject experts' feedback; Moderation
Too easy or too difficult questions
Better framing of questions; Test and item analysis
Items violating standard writing guidelines
Screening of items, faculty training
Problems with test administration (leakages, wrong papers, wrong keys, administrative issues)
Appropriate administrative measures, monitoring mechanisms
Problems with test construction, scoring, improper instructions
Faculty training, screening and monitoring mechanisms
We have for long, followed the dictum of “assessment drives learning,” which often results in extraneous factors distorting learning. A better way would be to let “learning drive assessment” so that validity is built into assessments. This concept of Programmatic Assessment has been discussed in detail in Chapter 18.
Similarly, the assessment tools should aim to test broad constructs rather than individual competencies like knowledge or skills. It is often better to use multiple tools to get different pieces of information on which a judgment of student attainment can be made. It is also important to select tools, which can test more than one competency at a time. There is no use of having one OSCE station to test history taking, another for skills, and yet another for professionalism. Each station should be able to test more than one competency. This not only provides an opportunity for wider sampling by having more competencies tested at each station but also builds validity.12
 
Reliability
Let us now move to the second important attribute of assessment—reliability. Commonly, reliability refers to reproducibility of the scores. Again, like in the case of validity, this definition needs a lot of elaboration (Downing, 2004).
A commonly used definition of reliability is obtaining same results under similar conditions. The concept of obtaining same results under similar conditions might be true of a biochemical test. However, it is not completely true of an educational test. Let us say, during the final professional MBBS examination, we allot a long case to a student in a very conducive setting, where there is no noise or urgency, and the patient is very cooperative. But we know that in actual practice, this seldom happens. Similarly, no two patients with same diagnosis will have similar presentation. In the past, educationists have tried to make the examinations more and more controlled and standardized (e.g., OSCE and standardized patients), so that the results represent only student attainment and nothing else. We argue that it might be better to work in reverse— i.e., conduct examinations in settings as close to actual ones as possible so that reproducibility can be ensured. This is the concept of workplace based and authentic assessment.
We often tend to confuse between the terms ‘objectivity’ and ‘reliability’. Objectivity refers to reproducibility of the scores so that anyone marking the test would mark it the same way. There are certain problems in equating reliability with objectivity in this way. For example, if the key to an item is wrongly marked in a test, everyone would mark the test similarly and generate identical scores. But are we happy with this situation? No, because it leads to faulty interpretation of the scores. Let us add some more examples. Suppose at the end of final professional MBBS, we were to give the students a test paper containing only 10 MCQs. The results will be very objective, but they will not be a reliable measure of students’ knowledge. There is no doubt that objectivity is a useful attribute of any tool, but it is more important to have items (or questions) which are fairly representative of the universe of items which are possible in a subject area, and at the same time sufficient number of items so that the results are generalizable. In other words, in addition to objectivity we also need an appropriate and adequate sample to get reliable results. This example also shows how reliability evidence contributes to validity.
We will also like to argue that objectivity is not sine-qua-non of reliability. A subjective assessment can be very reliable if based on adequate content expertise. We all make predictions— subjective— about potential of our students and we rarely go wrong! The point that we are trying to make is that in educational testing there is always a degree of prediction involved. Will the student whom we have certified as being able to handle a case of mitral stenosis in the medical college be able to do so in practice? To us, reliability is 13therefore the degree of confidence that we can place in our results (try reading reliability as rely-ability).
A common reason for low reliability is the content specificity of the case. Many examiners will prefer to have a neurological case in the final examination in medicine. It is presumed that a student who can satisfactorily present this case can also present a patient with anemia or malnutrition. This could not be farther from the truth. Herein lies the importance of including a variety of cases in the examination to make them representative of what the student is going to see in real life. You will recall what we said earlier that a representative and adequate sampling is also important to build validity.
Viewing reliability of educational assessment differently from that of other tests has important implications. Let us suppose that we give a test of clinical skills to a final year student. If we look at reliability merely as reproducibility— or in other words, getting same results if the same case is given again to the student under same conditions— then we will try to focus on precision of scores. However, if we conceptualize reliability as confidence in our interpretation, then we will like to examine the student under different conditions (outpatients, inpatients, emergency, community settings, etc., and by different examiners) and on different patients so that we can generalize our results. We might even like to add feedback from peers, patients, and other teachers to infer about the competence of the student.
We often go by the idea that examiner variability can induce a lot of unreliability in the results. To some extent this may be true. While examiner training is one solution, it is equally useful to have multiple examiners. We have already discussed about need to include a variety of content in assessment. It may not be possible to use many assessment formats at one occasion, but this can happen when we carry out assessment on multiple occasions. The general agreement in educational assessment is that a single assessment, howsoever perfect, is flawed for making educational decisions. Therefore, it is important to collect information on several occasions using a variety of tools. The key dictum to build reliability (and thereby validity) for any assessment is to have multiple tests on multiple content areas by multiple examiners using multiple tools in multiple settings. The concept of Programmatic Assessment discussed in Chapter 18 largely follows this approach.
Validity and reliability of a test are very intricately related. To be valid, a test should be reliable. Reliability evidence contributes to validity. A judge cannot form a valid inference if the witness who is being examined is unreliable. Thus, reliability is a precondition for validity. But let us caution you that it is not the only condition. Please also be aware that generally there is a trade-off between validity and reliability: the stronger the bases for validity, the weaker the bases for reliability (and vice-versa) (Fendler, 2016).
An application-oriented perspective on validity and reliability of assessments has been discussed in Chapter 26.14
 
Feasibility
The third important attribute of assessment is feasibility. We may like to assess every student by asking them to perform a cardiopulmonary resuscitation on an actual patient, but it may not be logistically possible. Same is true of many other skills and competencies. In such situations, one needs to think of other alternatives like simulations or tie up with other professional organizations for such assessments.
 
Acceptability
The next attribute of assessment is acceptability. Several assessment tools are available to us and sometimes we can have a variety of methods to fulfill the same objective.
Portfolios, for example, can provide as much information as can be provided by rating scales. MCQs can provide as much information about knowledge as can be obtained by oral examinations. However, acceptability by students, raters, institutions and society, at large, can play a significant role in accepting or rejecting a tool. MCQs, despite all their problems, are accepted as a tool for selecting students for postgraduate courses, while methods like portfolios, which provide more valid and reliable results may not be. This is not to suggest that we should sacrifice good tools based on likes or dislikes, but to suggest that all stakeholders need to be involved in the decision-making process about use of assessment tools.
Linked to the concept of acceptability is also the issue of feasibility. While we may have developed very good tools for assessing communication skills of our students, resource crunch may not allow us to use this tool on a large scale.
 
Educational Impact
The educational impact of assessment is a very significant issue. The impact of assessment can be seen in terms of student learning, consequences for the students and consequences for the society. We have already referred to the impact of MCQ-based selection tests on student learning. For students, a wrong assessment decision can act as a double-edged sword. A student who has wrongly been failed has to face consequences in terms of time and money. On the other hand, if a student is wrongly passed when he does not deserve to, society must deal with the consequences of having an incompetent physician.
Assessments do not happen in vacuum. They happen within the context of certain objectives. For each assessment, there is an expected use—it could be making pass/fail judgments, selecting students for an award or simply to provide feedback to teachers. Asking these three questions visually brings a lot of clarity in the process and helps in selecting appropriate tools.
  1. Who is going to use this data?
  2. At what time? and,
  3. For what purpose?15
 
Utility of Assessment
Before we end this chapter, let us introduce you to the concept of utility of assessment. Vleuten (1996) has suggested a conceptual model for the utility of any assessment.
This is not a mathematical formula but a notional one. This concept is especially important because it shows us how to compensate for deficiencies in assessment tools by their strengths. Results of some tools may be low on reliability but can still be useful if they are high on their educational impact. For example, results of MCQs have a high reliability, but little educational value. Results of mini-clinical evaluation exercise (mini-CEX), on the other hand, may be low on reliability, but have a higher educational value due to the feedback component. Still, both are equally useful to assess students. Similarly, if certain assessment has a negative value for any of the parameter, (e.g., if an assessment promotes unsound learning habits), then its utility may be zero or even negative.
The above five criteria contributing to utility of assessment were accepted by consensus in 2010 as the criteria for good assessment along with two additional criteria (Norcini et al., 2011). While we have retained the earlier nomenclature of “criteria,” there have been some modifications to it. Later at the 2018 Ottawa consensus meeting, the nomenclature was changed from “criteria” to “framework” for good assessment emphasizing the essential structure that these elements provide (Norcini et al., 2018). The alternative nomenclature of these seven elements was provided as: (1) Validity or coherence; (2) Reproducibility, Reliability, or Consistency; (3) Equivalence (the same assessment yields equivalent scores or decisions when administered across different institutions or cycles of testing); (4) Feasibility; (5) Educational Effect; (6) Catalytic effect (the assessment provides results and feedback in a fashion that motivates all stakeholders to create, enhance, and support education; it drives future learning forward and improves overall program quality); (7) Acceptability (Norcini et al., 2018). The same paper well summarizes the relationship between these elements of framework and purpose of assessment (formative or summative) rather well. Validity is essential for both the formative and summative purposes. While reliability and equivalence are more important for the summative assessments, the educational and catalytic effects are key to formative use. Feasibility and acceptability considerations are a must for both. Whatever nomenclature we may adopt, assessment can never be viewed in terms of a single criterion, framework or attribute.16
 
Easing Assessment Stress
Assessments induce a lot of stress and anxiety amongst students (and teachers). Assessment should be like a moving ramp rather than like a staircase with a block at each stage. Many approaches can be used to reduce examination stress. A COLE framework has been proposed (Siddiqui, 2017) to smoothen out assessment problems. This stands for Communication to the stakeholders about the need and purpose of a tool; Orientation to ensure that the tool is used as intended, by teachers and students alike; Learning orientation in the tool so that all assessments contribute to better learning and Evaluation of the tool itself to see if it is serving the intended purpose.
The other approach is to reduce stakes on individual assessments and take a collective decision based on multiple low stake assessments, spread throughout the course. This is the basis of programmatic assessment and will be discussed in Chapter 18.
As we go through the subsequent chapters, we will be discussing about assessment methods and assessment design in greater detail.
REFERENCES
  1. Burch, V.C., Saggie, J.C, & Gary, N. (2006). Formative assessment promotes learning in undergraduate clinical clerkships. South African Medical Journal, 96, 430–33.
  1. Crossley, J., Humphris, G., & Jolly, B. (2002). Assessing health professionals. Medical Education, 36(9), 800–4.
  1. Dochy, F.J.R.C., & McDowell, L. (1997). Assessment as a tool for learning. Studies in Educational Evaluation, 23, 279–98.
  1. Downing, S.M. (2003). Validity: on the meaningful interpretation of assessment data. Medical Education, 37, 830–7.
  1. Downing, S.M. (2004). Reliability: on the reproducibility of assessment data. Medical Education, 38 (9), 1006–12.
  1. Downing, S.M., Park, Y.S., & Yudkowsky, R. (2019). Assessment in health professions education. (2nd ed.) New York: Routledge. 
  1. Fendler, A. (2016). Ethical implications of validity-vs.-reliability trade-offs in educational research, Ethics & Education, 11: 2, 214–29.
  1. Hargreaves, A. (1989) Curriculum & Assessment Reform. Milton Keynes, UK: Open University Press. 
  1. Linn, R.L., & Miller, M.D. (2005). Measurement & assessment in teaching. New Jersey: Prentice Hall. 
  1. McDowell, L. & Mowl, G. (1995). Innovative assessment: Its impact on students. In G. Gibbs (Ed.) Improving student Learning. Through assessment & evaluation. Oxford: The Oxford Centre for Staff Development. 
  1. Messick, S. (1989). Validity. In R.L. Linn (Ed.). Educational measurement. New York: American Council on Education.  pp. 13–104.
  1. Norcini, J., Anderson, M.B., Bollela, V., Burch, V., Costa, M.J., Duvivier, R., et al. (2011). Criteria for good assessment: consensus statement & recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–11.
  1. Norcini, J., Anderson, M.B., Bollela, V., Burch, V., Costa, M.J., Duvivier, R., et al. (2018) 2018 consensus framework for good assessment. Medical Teacher 40, 1102–9.17
  1. Ramsden, P. (1997). The context of learning in academic departments. In F. Marton, D. Hounsell, N. Entwistle (Eds.) The Experience of Learning: Implications for Teaching & Studying in Higher Education, (2nd ed.) Edinburgh: Scottish Academic Press. 
  1. Rushton, A. (2005). Formative assessment: a key to deep learning? Medical Teacher, 27, 509–13.
  1. Schuwirth, L.W.T., & van der Vleuten, C.P.M. (2019). How to design a useful test: The principles of assessment. In: Swanwick, T., Forrest, K., O'Brien, B.C. (Ed.) Understanding medical education: evidence, theory & practice. West Sussex: Wiley-Blackwell. 
  1. Siddiqui, Z.S. (2017). An effective assessment: From Rocky Roads to Silk Route. Pakistan Journal of Medical Sciences Online 32(2), 505–9.
  1. Streiner, D., Norman, G. (1995). Health measurement scales. A practical guide to their development & use. (2nd ed.) New York: Oxford University Press. 
  1. van der Vleuten, C.P.M., & Schuwirth, L.W.T. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–17.
  1. Waugh, C.K., & Gronlund, N.E. (2012). Assessment of student achievement. 10th ed. New Jersey: Pearson. 
FURTHER READING
  1. Black, P, & Wiliam, D. (1998). Assessment & classroom learning. Assessment in Education, 5, 7–74.
  1. Dent, J.A., Harden, R.M. & Hunt, D. (2017). A practical guide for medical teachers. (5th ed.), Edinburgh, Elsevier. 
  1. Epstein, R.M., & Hundert, E.M. (2002). Defining & assessing professional competence. Journal of American Medical Association, 287, 226–35.
  1. Fredriksen, N. (1984). Influences of testing on teaching & learning. American Psychologist, 39, 193–202.
  1. Gibbs, G., & Simpson, C. (2004) Conditions under which assessment supports student learning. Learning & Teaching in Higher Education. 1, 3–31.
  1. Hawkins, R.E., & Holmboe, E.S. (2008). Practical guide to the evaluation of clinical competence. Philadelphia: Mosby-Elsevier. 
  1. Jackson, N., Jamieson, A., & Khan, A. (2007). Assessment in medical education & training: A practical guide. New York: CRC Press. 
  1. Miller, G.E. (1976). Continuous assessment. Medical Education, 10, 611–21.
  1. Norcini, J. (2003). Setting standard in educational tests. Medical Education, 37, 464–69.
  1. Singh, T., Gupta, P., & Singh, D. (2021). Principles of Medical Education. (5th ed.) New Delhi: Jaypee Brothers Medical Publishers. 
  1. Singh, T., Anshu & Modi, J.N. (2012). The Quarter Model: A proposed approach to in-training assessment for undergraduate students in Indian Medical Schools. Indian Pediatrics, 49, 871–6.
  1. Swanwick, T., Forrest, K., & O'Brien, B.C. (Ed.) (2019). Understanding medical education: evidence, theory & practice. (3rd ed.) West Sussex: Wiley-Blackwell. 
  1. Wass, V., Bowden, R., & Jackson, N. (2007). Principles of assessment design. In Jackson, N., Jamieson, A., Khan, A. (Eds.). Assessment in medical education & training: A practical guide. (1st ed.) Oxford: Radcliffe Publishing.