part of the test, the type of materials with which candidates will
have to engage, the source of such materials if authentic, the extent
to which authentic materials may be altered, the response format,
the test rubric, and how responses are to be scored. Test materials
are then written according to the specifications, which may of
course themselves be revised in the light of the writing process.
Test trials
The fourth stage is trialling or trying out the test materials and pro
cedures prior to their use under operational conditions. This stage
involves careful design of data collection to see how well the test
is working. A trial population will have to be found, that is, a
group of people who resemble in all relevant respects (age, learn
ing background, general proficiency level, etc.) the target test
population. With discrete point test items, a trial population of at
least
r oo,
and frequently far more than this, is required. Careful
statistical analysis is carried out of responses to items to investi
gate their quality and the concepts. Some of the procedures
involved are explained in Chapter
6.
Where subjective judge
ments of speaking and writing are involved, there is a need for
training of those making the judgements, and investigation of the
interpretability and workability of the criteria and rating scales to
be used. These issues are dealt with in detail in Chapter
4 ·
In addition, test-taker feedback should be gathered from the
trial subjects, often by a simple questionnaire. This will
include
questions on perceptions of the level of difficulty of particular
questions, the clarity of the rubrics, and general attitude to the
materials and tasks. Subjects can quickly spot things that are
problematic about materials which test developers struggle to see.
Materials and procedures will be revised in the light of the
trials, in preparation for the operational use of the test. Data from
actual test performances needs to be systematically gathered and
analysed, to investigate the validity and usefulness of the test
under operational conditions. Periodically, the results of this data
gathering may lead to substantial revision of the test design, and
the testing cycle will recommence. In any case, all new versions of
the test need to be trialled, and monitored operationally. It is in
the context of the testing cycle that most research on language
testing is carried out.
3 2
S U RVEY
Conclusion
In this chapter we have examined the process through which a
testing procedure is conceptualized, developed, and put into
operation. We have considered test content as an expression of
test construct, and looked at how that content may be deter
mined, especially through the procedures of job analysis. We have
considered that the way in which candidates interact with test
materials can also replicate real-world processes, and considered
the issue of authenticity that arise. Often, in the interests of econ
omy or manageability, particularly in large-scale tests, such repli
cation is unaffordable, and more conventional response formats
are the only option. We have considered a range of such formats
here. In Chapter
4
we will consider in much greater detail the con
cepts and methods involved in judgements of performance in
speaking and writing, and the issues of fairness that arise.
Throughout the testing cycle data for the investigation of test
qualities are generated automatically in the form of responses to
test items. The use of test data by researchers to question the fair
ness of the test takes us into the area of the validation of tests,
which is the subject of Chapter
5 .
T H E T E S T I N G C Y C L E
3 3
The rating process
Making judgements about people is a common feature of every
day life. We are continually evaluating what others say and do, in
comments called for or not, offering criticism and feedback infor
mally to friends and colleagues about their behaviour. Formal,
institutional judgements figure prominently in our lives too.
People pass driving tests, survive the probationary period in a
new job, get promotions at work, succeed at interviews, win
Oscars for performances in a film, win medals in diving competi
tions, and are released from prison for good behaviour. The
judgement will in most cases have direct consequences for the per
son judged, and so issues of fairness arise, which most public pro
cedures try to take account of in some way. Regrettably, it is easy
to become aware of the way in which the idiosyncrasies of the
rater or the rating process can determine the outcome unfairly. In
international sporting contests such as the Olympic Games and
World Cup soccer, the nationality of judges, referees or umpires,
and their presumed and sometimes real biases become an issue,
and attempts are made to mitigate their effects. All of us can prob
ably recount instances of the benign or damaging role of particu
lar raters in examination processes in which we have been
involved. Many people have anecdotes of bizarre procedures for
reaching rating decisions in various contexts, for example in job
selection.
This chapter will discuss rating procedures used in language
assessment. (The terms ratings and raters will be used to refer to
the judgements and those who make them.) We will discuss the
necessity for, and pitfalls of, a rater-mediated approach to the
assessment of language. First, we will look at the procedures used
THE RATI N G PR O C E S S
3 5
in judging, then at how judgements may be reported, and finally
at threats to the fairness of the procedures and how these may be
avoided or at least mitigated. We will consider in some detail
three aspects of the validation of rating procedures: the establish
ment of rating protocols; exploring differences between individ
ual raters, and mitigating their effects; and understanding
interactions between raters and other features of the rating
process (for example, the reactions of individual raters to particu
lar topics or to speakers from a particular language background).
Establishing a rating procedure
Rater-mediated assessment is becoming more and more central to
language teaching and learning. As communicative language
teaching has increasingly focused on communicative perfor
mance in context, so rating the impact of that communication has
become the focus of language assessment. Rater-mediated lan
guage assessment is also in line with institutional demands for
accountability in education, as outcomes of educational processes
are often described in terms of demonstrable practical compe
tence in the learner. This competence is then verified through
assessment.
Where assessments meet institutional requirements, for exam
ple for certification, as with any bureaucratic procedure there are
set methods for yielding the judgement in question. These meth
ods typically have three main aspects.
First, there is agreement about the conditions (including the
length of time) under which the person's performance or behav
iour is elicited, and/or is attended to by the rater. This may take
the form of a formal examination, with set tasks and fixed
amounts of time for the performances. Alternatively, it may
involve a period of observation during instruction, or while can
didates carry out relevant tasks and roles in the actual target per
formance context.
Second, certain features of the performance are agreed to be
critical; the criteria for judging these will be determined and
agreed. Usually this will involve considering various components
of competence
-
fluency
,
accuracy, organization, sociocultural
appropriateness,
and so on. The weighting of each of the compo
nents of assessment becomes an issue. So does their relevance: an
3 6 S URVEY
increasingly important question in the validation of performance
assessments is how the relevant criteria for assessing the perfor
mance are to be decided. The heart of the test construct lies here.
Finally, raters who have been trained to an agreed understand
ing of the criteria characterize a performance by allocating a
grade or rating. This assumes the prior development of descrip
tive rating categories of some kind: 'competent', 'not competent',
'ready to cope with a university course', and so on.
The problem with raters
Introducing the rater into the assessment process is both neces
sary and problematic. It is problematic because ratings are neces
sarily subjective. Another way of saying this is that the rating
given to a candidate is a reflection, not only of the quality of the
performance, but of the qualities as a rater of the person who has
judged it. The assumption in most rating schemes is that if the rat
ing category labels are clear and explicit, and the rater is trained
carefully to interpret them in accordance with the intentions of
the test designers, and concentrates while doing the rating, then
the rating process can be made objective. In other words, rating is
essentially reduced to a process of the recognition of objective
signs, with classification following automatically. In this view
rat
in
g would resemble the process of chicken
sexing, in which
young chicks are inspected for the external visible signs of their
sex (apparent only
to the trained eye
when chicks are very young),
and allocated to male and female categories accordingly.
But the reality is that rating remains intractably subjective. The
allocation of individuals to categories is not a deterministic
process, driven by the objective, recognizable characteristics of
performances, external to the rater. Rather, rating always con
tains a significant degree of chance, associated with the rater and
other factors. The influence of these factors can be explored by
thinking of rating as a probabilistic phenomenon, that is, explor
ing the probabilities of certain rating outcomes with particular
raters, particular tasks, and so on. We can easily show this by
looking at the way in which even trained raters differ in their han
dling of the allocation of individual performances in borderline
cases. Close comparison of the ratings given by different raters in
such cases will typically show that one rater will be consistently
T H E RATI N G P RO C E S S
3 7
inclined to assign a lower category to candidates whom another
rater puts into a higher one. The obvious result of this is that
whether a candidate is judged as meeting a particular standard or
not depends fortuitously on which rater assesses their work.
Worse (because this is less predictable), raters may not even be
self-consistent from one assessed performance to the next, or
from one rating occasion to another. Researchers have sometimes
been dismayed to learn that there is as much variation among
raters as there is variation between candidates.
In the 19 sos and r96os, when concerns for reliability domi
nated language assessment, rater-mediated assessment was dis
couraged because of the problem of subjectivity. This led to a
tendency to avoid direct testing. Thus, writing skills were assessed
indirectly through examination of control over the grammatical
system and knowledge of vocabulary. But increasingly it was felt
that so much was lost by this restriction on the scope of assess
ment that the problem of subjectivity was something that had to
be faced and managed. Particularly with the advent of commu
nicative language teaching, with its emphasis on how linguistic
knowledge is actually put to use, understanding and managing
the rating process became an urgent necessity.
Establishing a framework for making judgements
In establishing a rating procedure, we need to consider the criteria
by which performances at a given level will be recognized, and
then to decide how many different levels of performance we wish
to distinguish. The answers to these questions will determine the
basic framework or orientation for the rating process. Deciding
which of these orientations best fits a particular assessment set
ting will depend on the context and purpose of the assessment.
It is useful to view achievement as a continuum. The assessment
system may recognize a number of different levels of achieve
ment, in which case we then think of it as representing a
ladder
or
scale.
In other contexts, only one point on the continuum is of
relevance, and a simple 'enough/not enough' distinction is all that
needs to be made. In this case the testing system can best be
thought of in terms of a
hurdle
or cut-point. These two possibilities
are not of course contradictory, but are a little like different set
tings on a camera or microscope. We can stand back and look at
3 8 S URVEY
the whole continuum, or we can zoom in on one part of it. Each
level of the ladder may be thought of as requiring a 'yes/no' deci
sion ('enough/not enough') for that level.
We can illustrate the distinction between the hurdle and ladder
perspectives by reference to two very different kinds of perfor
mance. Consider the driving test. Most people, given adequate
preparation, would assume they could pass it. Although not
everybody who passes the test has equal competence as a driver,
the function of the test is to make a simple distinction between
those who are safe on the roads and those who are not, rather
than to distinguish degrees of competence in driving skill. Often,
in hurdle assessments, as in the driving test, the assessment system
is not intended to permanently exclude. In other words, every
competent person should pass, and it is assumed that most people
with adequate preparation will be capable of a competent perfor
mance, and derive the benefits of certification accordingly. The
aim of the certification is to protect other people from incompe
tence. The assessment is essentially not competitive.
Many systems of assessment try to combine the characteristics
of access and competition. For example, in the system of certifica
tion for competence in piano playing, a number of grades of per
formance are established, with relevant criteria defining each, and
over a number of years a learner of the piano may proceed through
the examinations for the grades. As the levels become more
demanding, fewer people have the necessary
motivation or oppor
tunity to prepare for performance at such a level, or indeed even
the necessary skill. The final stages of certification involve fiercely
contested piano competitions where only the most brilliant will
succeed, so resembling the Olympic context. But at levels below
this, the 'grade' system of certification involves a principle of
access: at each step of competence, judged in a 'yes'/'no' manner
('competent at this level' vs. 'not competent'), those with adequate
preparation are likely to pass. The function of the assessment at a
given level is not to make distinctions between candidates, other
than a binary distinction between those who meet the require
ments of the level and those who do not.
Language testing has examples of each of these kinds of frame
work for making judgements about individuals. In judgements of
competence, to perform particular kinds of occupational roles,
T H E RAT I N G P R O C E S S
3 9
for example to work as a medical practltloner through the
medium of a second language, where the communicative
demands of the work or study setting to which access is sought
are high, then the form of the judgement will be 'ready' or 'not
ready', as in the driving test. Even though the amount of prepara
tion is much greater, and what is demanded is much higher, we
nevertheless expect each of the medical professionals who present
for such a test to succeed in the end. Its function is not usually to
exclude permanently those who need to demonstrate competence
in the language in order to practise their profession, although
tests may of course be used as instruments of such exclusion, as
we shall see later, in Chapter
7·
In contrast, in contexts where
only a small percentage of candidates can be selected, for example
in the awarding of competitive prizes or scholarships, then the
higher levels of achievement will become important as they
are used to distinguish the most able of candidates from the rest.
This is the case in contexts of achievement, for example, in
school-based language learning, or in vocational and workplace
training.
Rating scales
Most often, frameworks for rating are designed as scales, as this
allows the greatest flexibility to the users, who may want to use
the multiple distinctions available from a scale, or who may
choose to focus on only one cut-point or region of the scale. The
preparation of such a scale involves developing level descriptors,
that is, describing in words performances that illustrate each level
of competence defined on the scale. For example, in the driving
test, performance at a passing level might be described as 'Can
drive in normal traffic conditions for 20 minutes making a range
of normal movements and dealing with a range of typical eventu
alities; and can cope with a limited number of frequently encoun
tered suddenly emerging situations on the road.' This description
will necessarily be abstracted from the experience of those famil
iar with the setting and its demands, in this case experienced dri
ving instructors, and will have to be vetted by a relevant authority
entrusted with (in this case) issuing a licence to drive based on the
test performance.
An ordered series of such descriptions is known as a rating
40
S URVEY
scale.
A number of distinctions are usually made-rating scales
typically have between 3 and 9 levels. Figure 4 . 1 gives an example
of a summary rating scale developed by the author to describe lev
els of performance on an advanced level test of English as a sec
ond language for speaking skills in clinical settings.
Aspect of performance considered: overall communicative
effectiveness
1 elementary level of communicative effectiveness
2 clearly could not cope in a bridging programme in a clinical
setting involving interactions with patients and colleagues
3
just below minimum competence needed to cope in a bridging
programme in a clinical setting involving interactions with
patients and colleagues
4
has minimum competence needed to cope in a bridging
programme in a clinical setting involving interactions with
patients and colleagues
5 could easily cope in a bridging programme in a clinical setting
involving interactions with patients and colleagues
6 near native communicative effectiveness
F I G U R E 4 . 1
Rating scale, Occupational English Test for health
professionals
This rating scale is used as part of a screening procedure (used
to determine if an overseas trained health professional has the
necessary minimum language skills to be admitted under supervi
sion to the clinical setting). In this particular case, as the focus of
the discriminations made in the scale is around a single point of
minimum competence, the other levels tend to be defined in terms
of their distance from this point. Most rating scales do not have
such a single point of reference, and ideally the definition of each
level should be independent of the ones above and below it on the
scale. In fact, however, given the continuous nature of the scale,
wordings frequently involve comparative statements, with one
level described relative to one or more others-for example, in
THE RATI N G P R O C E S S 4 1
terms of greater or less control of features of the grammatical sys
tem, or pronunciation, and so on.
An important aspect of a scale is the way in which performance
at the top end of the scale is defined. There is frequently an
unacknowledged problem here. Rating scales often make refer
ence to what are assumed to be the typical performances of native
speakers or expert users of the language at the top end of the
scale. That is, it is assumed that the performance of native speak
ers will be fundamentally unlike the performances of non-native
speakers, who will tend gradually to approximate native speaker
performance as their own proficiency increases. However, claims
about the uniformly superior performance of these idealized
native speakers have rarely been supported empirically. In fact,
the studies that have been carried out typically show the perfor
mance of native speakers as highly variable, related to educa
tional level, and covering a range of positions on the scale. In spite
of this, the idealized view of native speaker performance still hov
ers inappropriately at the top of many rating scales.
The number of levels on a rating scale is also an important mat
ter to consider, although the questions raised here are more a mat
ter of practical utility than of theoretical validity. There is no
point in proliferating descriptions outside the range of ability of
interest. Having too few distinctions within the range of such
ability is also frustrating, and the revision of rating scales often
involves the creation of more distinctions.
The failure of rating scales to make distinctions sufficiently fine
to capture progress being made by students is a frequent problem.
It arises because the purposes of users of a single assessment
instrument may be at odds. Teachers have continuous exposure
to their students' achievements in the normal course of learning.
In the process, they receive ongoing informal confirmation of
learner progress which may not be adequately reflected in a cate
gory difference as described by a scale. Imagine handing parents
who are seeking evidence of their child's growth a measuring stick
with marks on it only a foot ( 3 0 centimetres) apart, the measure
not allowing any other distinction to be made. The parents can
observe the growth of the child: they have independent evidence
in the comments of relatives, or the fact that the child has grown
out of a set of clothes. Yet in terms of the measuring stick no
4 2
S U RVEY
growth can be recorded because the child has not passed the
magic cut-point into the next adjacent category of measurement.
Teachers restricted to reporting achievement only in terms of
broad rating scale categories are in a similar position. Most rating
scales used in public educational settings are imposed by govern
ment authorities for purposes of administrative efficiency and
financial accountability, for which fine-grained distinctions are
unnecessary. The scales are used to report the achievements of the
educational system in terms of changes in the proficiency of large
numbers of learners over relatively extended periods of time. The
government needs the 'big picture' of learner (and teacher)
achievement in order to satisfy itself that its educational budget is
yielding results. Teachers working with these government
imposed, scale-based reporting mechanisms experience frustra
tions with the lack of fine distinctions on the scale. The
coarse-grained character of the categories may hardly do justice
to the teachers' sense of the growth and learning that has been
achieved in a course. The purposes of the two groups-adminis
trators, who are interested in financial accountability, and teach
ers, who are interested in the learning process may be at odds in
such a case.
The wording of rating scales may vary according to the pur
poses for which they are to be used.
On
the one hand, scales are
used to guide and constrain the behaviour of raters, and on the
other, they are used to report the outcome of a rating process to
score users-teachers, employers, admission authorities, parents,
and so on. As a result different versions of a rating scale are often
created for different users.
Holistic and analytic ratings
Performances are complex. Judgement of performances involves
balancing perceptions of a number of different features of the per
formance. In speaking, a person may be fluent, but hard to under
stand; another may be correct, but stilted. Thus rather than
getting raters to record a single impression of the impact of the
performance as a whole (holistic rating), an alternative approach
involves getting raters to provide separate assessments for each of
a number of aspects of performance. For example, in speaking,
raters may be asked to provide separate assessments of: fluency,
T H E RAT I N G P R O C E S S
4 3
appropriateness, pronunciation, control of formal resources of
grammar, and vocabulary and the like. This latter approach is
known as analytic rating, and requires the development of a num
ber of separate rating scales for each aspect assessed. Even where
analytic rating is carried out, it is usual to combine the scores for
the separate aspects into a single overall score for reporting pur
poses. This single reporting scale may maintain its analytic orien
tation in that the overall characterization of a level description
may consist of a weaving together of strands relating to separate
aspects of performance.
Rater training
An important way to improve the quality of rater-mediated
assessment schemes is to provide initial and ongoing training to
raters. This usually takes the form of a moderation meeting. At
such a meeting, individual raters are each initially asked to pro
vide independent ratings for a series of performances at different
levels. They are then confronted with the differences between the
ratings they have given and those given by the other raters in the
group. Discrepancies are noted and are discussed in detail, with
Chia sẻ với bạn bè của bạn: |