Oxford Introductions to Language Study Language Testing

part of the test, the type of materials with which candidates will

tải về 2.79 Mb.

Chế độ xem pdf

trang	4/15
Chuyển đổi dữ liệu	16.04.2022
Kích	2.79 Mb.
	#51661

1 2 3 4 5 6 7 8 9 ... 15

(by-Tim-McNamara)-Language-Testing

part of the test, the type of materials with which candidates will
have to engage, the source of such materials if authentic, the extent
to which authentic materials may be altered, the response format,
the test rubric, and how responses are to be scored. Test materials
are  then  written  according  to  the  specifications,  which  may  of
course themselves be revised in the light of the writing process.
Test trials
The fourth stage is trialling or trying out the test materials and pro
cedures prior to their use under operational conditions. This stage
involves careful design of data collection to see how well the test
is working.  A trial population will  have  to  be  found,  that  is,  a
group of people who resemble in all relevant respects (age, learn
ing  background,  general  proficiency  level,  etc.)  the  target  test
population. With discrete point test items, a trial population of at
least
r oo,
and frequently far more than this, is required. Careful
statistical analysis is carried  out of responses to items to investi
gate  their  quality  and  the  concepts.  Some  of  the  procedures
involved  are  explained  in  Chapter
6.
Where  subjective  judge
ments  of speaking and writing  are  involved,  there is a  need for
training of those making the judgements, and investigation of the
interpretability and workability of the criteria and rating scales to
be used. These issues are dealt with in detail in Chapter
4 ·
In addition,  test-taker feedback should  be  gathered  from  the
trial subjects, often by a  simple  questionnaire.  This will
include
questions  on  perceptions  of the  level of difficulty  of particular
questions,  the  clarity  of the  rubrics,  and  general  attitude to the
materials  and  tasks.  Subjects  can  quickly  spot  things  that  are
problematic about materials which test developers struggle to see.
Materials  and  procedures  will  be  revised  in  the  light  of the
trials, in preparation for the operational use of the test. Data from
actual test performances needs to be systematically gathered  and
analysed,  to  investigate  the  validity  and  usefulness  of  the  test
under operational conditions. Periodically, the results of this data
gathering may lead to substantial revision of the test design, and
the testing cycle will recommence. In any case, all new versions of
the test need to  be trialled, and monitored  operationally. It is in
the context of the testing cycle that  most research  on language
testing is carried out.
3 2
S U RVEY

Conclusion
In this  chapter we  have  examined  the  process  through which  a
testing  procedure  is  conceptualized,  developed,  and  put  into
operation. We  have  considered  test  content  as an  expression  of
test  construct,  and  looked  at  how  that  content  may  be  deter
mined, especially through the procedures of job analysis. We have
considered  that the  way  in which  candidates  interact  with  test
materials can also replicate real-world processes, and considered
the issue of authenticity that arise. Often, in the interests of econ
omy or manageability, particularly in large-scale tests, such repli
cation is unaffordable, and more conventional response formats
are the only option. We have considered a range of such formats
here. In Chapter
4
we will consider in much greater detail the con
cepts  and  methods  involved  in  judgements  of  performance  in
speaking and writing, and the issues of fairness that arise.
Throughout the testing cycle data for the investigation of test
qualities are generated automatically in the form of responses to
test items. The use of test data by researchers to question the fair
ness  of the  test takes  us into  the  area  of the  validation  of tests,
which is the subject of Chapter
5 .
T H E  T E S T I N G   C Y C L E
3 3

The rating process
Making judgements about people is a common feature of every
day life. We are continually evaluating what others say and do, in
comments called for or not, offering criticism and feedback infor
mally to friends and  colleagues  about their  behaviour.  Formal,
institutional  judgements  figure  prominently  in  our  lives  too.
People  pass  driving tests,  survive  the  probationary  period  in  a
new  job,  get  promotions  at  work,  succeed  at  interviews,  win
Oscars for performances in a film, win medals in diving competi
tions,  and  are  released  from  prison  for  good  behaviour.  The
judgement will in most cases have direct consequences for the per
son judged, and so issues of fairness arise, which most public pro
cedures try to take account of in some way. Regrettably, it is easy
to  become aware of the way in which the idiosyncrasies  of the
rater or the rating process can determine the outcome unfairly. In
international sporting contests such as the  Olympic  Games and
World Cup soccer, the nationality of judges, referees or umpires,
and their presumed and sometimes real biases become an issue,
and attempts are made to mitigate their effects. All of us can prob
ably recount instances of the benign or damaging role of particu
lar  raters  in  examination  processes  in  which  we  have  been
involved. Many people have anecdotes of bizarre procedures for
reaching rating decisions in various contexts, for example in job
selection.
This  chapter will  discuss  rating procedures  used  in  language
assessment.  (The terms ratings and raters will be used to refer to
the judgements  and those who make them.) We will discuss the
necessity  for,  and  pitfalls  of,  a  rater-mediated  approach  to  the
assessment of language. First, we will look at the procedures used
THE RATI N G  PR O C E S S
3 5

in judging, then at how judgements may be reported, and finally
at threats to the fairness of the procedures and how these may be
avoided  or  at  least mitigated.  We  will  consider  in  some  detail
three aspects of the validation of rating procedures: the establish
ment of rating protocols; exploring differences between individ
ual  raters,  and  mitigating  their  effects;  and  understanding
interactions  between  raters  and  other  features  of  the  rating
process (for example, the reactions of individual raters to particu
lar topics or to speakers from a particular language background).
Establishing a rating procedure
Rater-mediated assessment is becoming more and more central to
language  teaching  and  learning.  As  communicative  language
teaching  has  increasingly  focused  on  communicative  perfor
mance in context, so rating the impact of that communication has
become  the  focus  of language  assessment.  Rater-mediated  lan
guage  assessment is  also  in  line with  institutional  demands  for
accountability in education, as outcomes of educational processes
are  often  described  in  terms  of demonstrable  practical  compe
tence  in  the  learner.  This  competence  is  then  verified  through
assessment.
Where assessments meet institutional requirements, for exam
ple for certification, as with any bureaucratic procedure there are
set methods for yielding the judgement in question. These meth
ods typically have three main aspects.
First,  there  is  agreement  about  the  conditions  (including  the
length of time)  under  which the person's performance or behav
iour is elicited, and/or is attended to  by the rater. This may take
the  form  of  a  formal  examination,  with  set  tasks  and  fixed
amounts  of  time  for  the  performances.  Alternatively,  it  may
involve a period of observation during instruction, or while can
didates carry out relevant tasks and roles in the actual target per
formance context.
Second,  certain features of the performance  are  agreed  to  be
critical;  the  criteria  for  judging  these  will  be  determined  and
agreed. Usually this will involve considering various components
of  competence
-
fluency
,
accuracy,  organization,  sociocultural
appropriateness,
and so on. The weighting of each of the compo
nents of assessment becomes an issue. So does their relevance: an
3 6   S URVEY

increasingly important question in the validation of performance
assessments is how the relevant criteria for assessing the perfor
mance are to be decided. The heart of the test construct lies here.
Finally, raters who have been trained to an agreed understand
ing  of  the  criteria  characterize  a  performance  by  allocating  a
grade  or rating. This assumes the prior development of descrip
tive rating categories of some kind: 'competent', 'not competent',
'ready to cope with a university course', and so on.
The problem with raters
Introducing the  rater into the  assessment process is both neces
sary and problematic. It is problematic because ratings are neces
sarily  subjective.  Another  way  of  saying  this  is that  the  rating
given to a candidate is a reflection, not only of the quality of the
performance, but of the qualities as a rater of the person who has
judged it. The assumption in most rating schemes is that if the rat
ing category labels are clear and explicit, and the rater is trained
carefully to interpret them in accordance with the intentions  of
the test designers,  and concentrates while doing the rating, then
the rating process can be made objective. In other words, rating is
essentially  reduced  to  a  process  of the  recognition  of  objective
signs,  with  classification  following  automatically.  In  this  view
rat
in
g  would  resemble  the  process  of  chicken
sexing,  in  which
young chicks are  inspected  for the external  visible  signs of their
sex (apparent only
to the trained eye
when chicks are very young),
and allocated to male and female categories accordingly.
But the reality is that rating remains intractably subjective. The
allocation  of  individuals  to  categories  is  not  a  deterministic
process,  driven  by the  objective,  recognizable  characteristics  of
performances,  external  to the  rater.  Rather,  rating always con
tains a significant degree of chance, associated with the rater and
other factors. The  influence  of these factors can be explored by
thinking of rating as a probabilistic phenomenon, that is, explor
ing  the  probabilities of certain rating  outcomes with  particular
raters,  particular tasks, and  so  on.  We  can  easily show this  by
looking at the way in which even trained raters differ in their han
dling of the allocation of individual performances in borderline
cases. Close comparison of the ratings given by different raters in
such cases will typically show that one rater will be consistently
T H E  RATI N G  P RO C E S S
3 7

inclined to assign a lower category to candidates whom another
rater  puts  into  a  higher  one.  The  obvious  result  of this  is that
whether a candidate is judged as meeting a particular standard or
not  depends  fortuitously  on  which  rater  assesses  their  work.
Worse  (because this is  less  predictable), raters may not even  be
self-consistent  from  one  assessed  performance  to  the  next,  or
from one rating occasion to another. Researchers have sometimes
been  dismayed  to  learn  that  there  is  as  much  variation  among
raters as there is variation between candidates.
In the  19 sos  and  r96os, when concerns for reliability  domi
nated  language  assessment,  rater-mediated  assessment was  dis
couraged  because  of the  problem  of subjectivity.  This  led  to  a
tendency to avoid direct testing. Thus, writing skills were assessed
indirectly through examination of control over the grammatical
system and knowledge of vocabulary. But increasingly it was felt
that so much was lost by this restriction on the scope of assess
ment that the problem of subjectivity was something that had to
be faced  and managed.  Particularly with the  advent of commu
nicative language  teaching, with  its  emphasis  on how linguistic
knowledge is actually put to  use,  understanding  and  managing
the rating process became an urgent necessity.
Establishing a framework for making judgements
In establishing a rating procedure, we need to consider the criteria
by which performances  at a  given  level  will  be recognized,  and
then to decide how many different levels of performance we wish
to distinguish. The answers to these questions will determine the
basic  framework or orientation  for the rating process. Deciding
which of these orientations best fits  a particular assessment set
ting will depend on the context and purpose of the assessment.
It is useful to view achievement as a continuum. The assessment
system  may recognize  a  number  of  different  levels  of achieve
ment, in which case we then think of it as representing a
ladder
or
scale.
In other contexts,  only one point on the  continuum is of
relevance, and a simple 'enough/not enough' distinction is all that
needs  to  be  made.  In  this  case  the  testing  system  can  best  be
thought of in terms of a
hurdle
or cut-point. These two possibilities
are not of course contradictory, but are a little like different set
tings on a camera or microscope. We can stand back and look at
3 8   S URVEY

the whole continuum, or we can zoom in on one part of it.  Each
level of the ladder may be thought of as requiring a 'yes/no' deci
sion ('enough/not enough') for that level.
We can illustrate the distinction between the hurdle and ladder
perspectives by reference to two very different  kinds  of perfor
mance.  Consider  the  driving  test.  Most  people,  given  adequate
preparation,  would  assume  they  could  pass  it.  Although  not
everybody who passes the test has equal competence as a driver,
the  function of the test is to  make a  simple  distinction between
those who  are safe  on the  roads  and  those who  are  not,  rather
than to distinguish degrees of competence in driving skill.  Often,
in hurdle assessments, as in the driving test, the assessment system
is  not  intended  to  permanently  exclude.  In  other  words,  every
competent person should pass, and it is assumed that most people
with adequate preparation will be capable of a competent perfor
mance,  and  derive  the  benefits of certification  accordingly.  The
aim of the certification is to protect other people from incompe
tence. The assessment is essentially not competitive.
Many systems of assessment try to combine the characteristics
of access and competition. For example, in the system of certifica
tion for competence in piano playing, a number of grades of per
formance are established, with relevant criteria defining each, and
over a number of years a learner of the piano may proceed through
the  examinations  for  the  grades.  As  the  levels  become  more
demanding, fewer people have the necessary
motivation or oppor
tunity to prepare for performance at such a level, or indeed even
the necessary skill. The final stages of certification involve fiercely
contested  piano  competitions  where  only  the  most  brilliant will
succeed, so resembling the  Olympic context.  But at levels below
this,  the  'grade'  system  of  certification  involves  a  principle  of
access: at each step of competence, judged in a  'yes'/'no' manner
('competent at this level' vs. 'not competent'), those with adequate
preparation are likely to pass. The function of the assessment at a
given level is not to make  distinctions  between candidates, other
than  a  binary  distinction  between  those  who  meet  the  require
ments of the level and those who do not.
Language testing has examples of each of these kinds of frame
work for making judgements about individuals. In judgements of
competence,  to  perform  particular kinds of occupational roles,
T H E   RAT I N G   P R O C E S S
3 9

for  example  to  work  as  a  medical  practltloner  through  the
medium  of  a  second  language,  where  the  communicative
demands of the work or study setting to which access is sought
are high,  then the form of the judgement will be 'ready' or 'not
ready', as in the driving test. Even though the amount of prepara
tion is much greater, and what is demanded is much higher, we
nevertheless expect each of the medical professionals who present
for such a test to succeed in the end. Its function is not usually to
exclude permanently those who need to demonstrate competence
in  the  language  in  order  to  practise  their  profession,  although
tests may of course be used as instruments of such exclusion, as
we  shall  see  later,  in  Chapter
7·
In  contrast,  in contexts where
only a small percentage of candidates can be selected, for example
in the  awarding  of competitive prizes  or  scholarships,  then  the
higher  levels  of  achievement  will  become  important  as  they
are used to distinguish the most able of candidates from the rest.
This  is  the  case  in  contexts  of  achievement,  for  example,  in
school-based language learning, or in vocational and workplace
training.
Rating scales
Most often,  frameworks for rating are designed as scales,  as this
allows the greatest flexibility to the users, who may want to use
the  multiple  distinctions  available  from  a  scale,  or  who  may
choose to focus on only one cut-point or region of the scale. The
preparation of such a scale involves developing level descriptors,
that is, describing in words performances that illustrate each level
of competence defined  on the scale.  For  example,  in the driving
test,  performance  at a  passing  level  might  be  described  as  'Can
drive in normal traffic conditions for 20 minutes making a range
of normal movements and dealing with a range of typical eventu
alities; and can cope with a limited number of frequently encoun
tered suddenly emerging situations on the road.' This description
will necessarily be abstracted from the experience of those famil
iar with the setting and its demands, in this case experienced dri
ving instructors, and will have to be vetted by a relevant authority
entrusted with (in this case) issuing a licence to drive based on the
test performance.
An  ordered  series  of  such  descriptions  is  known  as  a  rating
40
S URVEY

scale.
A  number  of distinctions are  usually  made-rating scales
typically have between 3 and 9 levels. Figure 4 . 1  gives an example
of a summary rating scale developed by the author to describe lev
els of performance on an advanced level test of English as a sec
ond language for speaking skills in clinical settings.
Aspect of performance considered: overall communicative
effectiveness
1  elementary level of communicative effectiveness
2  clearly could not cope in a bridging programme in a clinical
setting involving interactions with patients and colleagues
3
just below minimum competence needed to cope in a bridging
programme in a clinical setting involving interactions with
patients and colleagues
4
has minimum competence needed to cope in a bridging
programme in a clinical setting involving interactions with
patients and colleagues
5  could easily cope in a bridging programme in a clinical setting
involving interactions with patients and colleagues
6  near native communicative effectiveness
F I G U R E   4 . 1
Rating scale,  Occupational English Test for health
professionals
This rating scale is used as part of a screening procedure (used
to  determine  if an  overseas trained health professional  has  the
necessary minimum language skills to be admitted under supervi
sion to the clinical setting). In this particular case, as the focus of
the discriminations made in the scale is around a single point of
minimum competence, the other levels tend to be defined in terms
of their distance from this  point.  Most rating scales do not have
such a single point of reference, and ideally the definition of each
level should be independent of the ones above and below it on the
scale. In fact, however, given the continuous nature of the scale,
wordings frequently  involve comparative  statements,  with  one
level  described  relative  to  one or more  others-for  example,  in
THE RATI N G  P R O C E S S   4 1

terms of greater or less control of features of the grammatical sys
tem, or pronunciation, and so on.
An important aspect of a scale is the way in which performance
at  the  top  end  of  the  scale  is  defined.  There  is  frequently  an
unacknowledged  problem here. Rating scales often make refer
ence to what are assumed to be the typical performances of native
speakers  or  expert  users  of the  language  at  the  top  end  of the
scale. That is, it is assumed that the performance of native speak
ers will be fundamentally unlike the performances of non-native
speakers, who will tend gradually to approximate native speaker
performance as their own proficiency increases. However, claims
about  the  uniformly  superior  performance  of  these  idealized
native speakers have rarely been supported empirically. In fact,
the studies that have been carried out typically show the perfor
mance  of native  speakers  as  highly  variable,  related  to  educa
tional level, and covering a range of positions on the scale. In spite
of this, the idealized view of native speaker performance still hov
ers inappropriately at the top of many rating scales.
The number of levels on a rating scale is also an important mat
ter to consider, although the questions raised here are more a mat
ter  of practical utility than  of theoretical  validity. There  is  no
point in proliferating descriptions outside the range of ability of
interest.  Having  too  few  distinctions  within  the  range  of  such
ability is also frustrating,  and the revision of rating scales often
involves the creation of more distinctions.
The failure of rating scales to make distinctions sufficiently fine
to capture progress being made by students is a frequent problem.
It  arises  because  the  purposes  of  users  of  a  single  assessment
instrument may be at odds. Teachers have continuous exposure
to their students' achievements in the normal course of learning.
In  the  process,  they  receive  ongoing  informal confirmation  of
learner progress which may not be adequately reflected in a cate
gory difference as described by a  scale. Imagine handing parents
who are seeking evidence of their child's growth a measuring stick
with marks on it only a foot ( 3 0  centimetres) apart, the measure
not allowing any other distinction to be made. The parents can
observe the growth of the child: they have independent evidence
in the comments of relatives, or the fact that the child has grown
out  of a  set  of clothes.  Yet  in  terms  of the  measuring  stick  no
4 2
S U RVEY

growth  can  be  recorded  because  the  child  has  not  passed  the
magic cut-point into the next adjacent category of measurement.
Teachers restricted to reporting achievement only in terms of
broad rating scale categories are in a similar position. Most rating
scales used in public educational settings are imposed by govern
ment  authorities  for  purposes  of  administrative  efficiency  and
financial  accountability,  for  which  fine-grained  distinctions  are
unnecessary. The scales are used to report the achievements of the
educational system in terms of changes in the proficiency of large
numbers of learners over relatively extended periods of time. The
government  needs  the  'big  picture'  of  learner  (and  teacher)
achievement in order to satisfy itself that its educational budget is
yielding  results.  Teachers  working  with  these  government
imposed,  scale-based  reporting  mechanisms  experience  frustra
tions  with  the  lack  of  fine  distinctions  on  the  scale.  The
coarse-grained character of the  categories may hardly do justice
to  the teachers'  sense  of the growth and  learning that has  been
achieved in a course. The purposes of the two groups-adminis
trators, who are interested in financial accountability, and teach
ers, who are interested in the learning process may be at odds in
such a case.
The wording of rating  scales may vary  according  to  the  pur
poses for which they are to be used.
On
the one hand, scales are
used to guide and constrain the behaviour of raters, and on the
other, they are used to report the outcome of a rating process to
score users-teachers, employers, admission authorities, parents,
and so on. As a result different versions of a rating scale are often
created for different users.
Holistic and analytic ratings
Performances  are complex. Judgement of performances involves
balancing perceptions of a number of different features of the per
formance. In speaking, a person may be fluent, but hard to under
stand;  another  may  be  correct,  but  stilted.  Thus  rather  than
getting raters to record  a  single  impression  of the impact of the
performance as a whole  (holistic rating),  an alternative approach
involves getting raters to provide separate assessments for each of
a  number  of aspects  of performance.  For example,  in speaking,
raters may be asked to provide separate assessments of: fluency,
T H E  RAT I N G  P R O C E S S
4 3

appropriateness,  pronunciation,  control  of  formal  resources  of
grammar,  and  vocabulary  and  the  like.  This  latter  approach  is
known as analytic rating,  and requires the development of a num
ber of separate rating scales for each aspect assessed. Even where
analytic rating is carried out, it is usual to combine the scores for
the separate aspects into a single overall score for reporting pur
poses. This single reporting scale may maintain its analytic orien
tation in that the  overall  characterization  of a  level  description
may consist of a weaving together of strands relating to separate
aspects of performance.
Rater training
An  important  way  to  improve  the  quality  of  rater-mediated
assessment schemes is to provide initial and  ongoing training to
raters.  This  usually takes the  form  of a  moderation meeting.  At
such a meeting,  individual  raters are each initially asked to pro
vide independent ratings for a series of performances at different
levels. They are then confronted with the differences between the
ratings they have given and those given by the other raters in the
group.  Discrepancies  are noted and are  discussed in detail, with

tải về 2.79 Mb.

Chia sẻ với bạn bè của bạn:

1 2 3 4 5 6 7 8 9 ... 15