Oxford Introductions to Language Study Language Testing

part of a larger area of research on the impact of assessment pro

tải về 2.79 Mb.

Chế độ xem pdf

trang	6/15
Chuyển đổi dữ liệu	16.04.2022
Kích	2.79 Mb.
	#51661

1 2 3 4 5 6 7 8 9 ... 15

(by-Tim-McNamara)-Language-Testing

part of a larger area of research on the impact of assessment pro
cedures on teaching and learning, and more broadly on society as
a whole. The social context of assessment will  be considered in
detail in Chapter
7 ·
Conclusion
In  this  chapter we  have  examined the  need for  questioning  the
bases for inferences about candidate abilities residing in test pro
cedures,  and  the  way  in  which  these  inferences  may  be  at  risk
from aspects  of test  des
i
gn  and test method,  or
lack
of
clarity in
our thinking  about what we are measuring. Efforts  to establish
the validity of tests have generated much of what constitutes the
field of language testing research. Such research involves two pri
mary techniques: speculation and empiricism. Speculation here
refers to reasoning and logical analysis about the nature of lan
guage and language use, and of the nature of performance, of the
type that we outlined in Chapter
2.
Empiricism means subjecting
such  theorizing  and  specific  implications  of  particular  testing
practices to examination in the light of data from test trials and
operational test administrations. Thus, as an outcome of the test
development cycle, language testing research involves the forma
tion  of  hypotheses  about  the  nature  of  language  ability,  and
putting such hypotheses to the test. In this way, language testing
is rescued from being a merely technical activity and constitutes
a  site  for  research  activity  of a  fundamental  nature  in  applied
linguistics.
5 4
S U RVEY

6
Measurement
Introduction
Assessment  usually  involves  allocating  a  score,  an  attractively
simple  number.  Gertrude Stein tells  us that 'A rose  is a rose  is a
rose',  but measurement people  (in their unimaginative way)  tell
us that a score is not a score is not a score: scores can be deceptive.
For example,  when different raters give the same score, do they
mean  the  same  thing?  In  tests  with  several  parts  and  multiple
items, how consistent are score patterns across different parts of a
test? Can we add scores from the different parts, or across tests of
different  sub-skills,  or are  they  measuring  such  different  things
that  they  are  incommensurable,  cannot  be  talked  about  in  the
same breath? What do the scores on a test tell us about its quality,
and  its  suitability for its  intended purpose?  These  are  questions
addressed by measurement, the theoretical and empirical analysis
of scores and score meaning.
Often, when people think  (if they do)  about testing, they per
ceive it as a dauntingly technical field, and it is often the measure
ment aspect of the field that puts people off. 'Means', 'percentiles',
'standard  deviations',  statistics-these  inspire  a  lack  of  confi
dence that one could ever (or indeed would ever want to) engage
successfully with testing as an area of knowledge and expertise.
Yet,  curiously,  concepts  from  the  field  of measurement  can  be
found frequently in everyday conversation: 'She is of above aver
age intelligence.' 'He topped his class.' 'It's like saying that these
apples are not very good oranges.' 'He's a not a reliable judge.' It
is not so much then, that people are not interested in the questions
that  measurement  asks,  as  that  they  are  daunted  by the  way  it
goes  about answering them, by its procedures and language. The
M EA S U RE M E N T
5 5

aim of this chapter is to give a  brief introduction to a small selec
tion of measurement concepts and procedures commonly used in
language  assessment,  and  in  particular to make  the  reader  feel
that they are accessible and worth understanding.
Measurement
Measurement investigates the quality of the process of assessment
by looking at scores. Two main steps are involved:
r
Quantification, that is, the  assigning of numbers or scores to
various outcomes of assessment. The set of scores available for
analysis when data are gathered from a number of test-takers is
known as a data matrix.
2
Checking for various kinds of mathematical and statistical pat
terning within the matrix in order to investigate the extent to
which necessary properties  (for example,  consistency of per
formance by candidates, or by judges) are present in the assess
ment.
The aim of these procedures is to achieve quality control, that is,
to  improve  the  meaningfulness  and  fairness  of the  conclusions
reached  about  individual  candidates  (the  validity  of  the  test).
Measurement procedures have no rationale other than to under
pin validity.
Quality control for raters
As an example  of what measurement expertise can contribute to
our understanding of language tests, aml
our  ability  to  develop
fair and meaningful tests, we will look at the question of quality
control procedures for raters. To what extent is there agreement
between  raters,  and  where  there  is  disagreement,  what  can  be
done about it?
As investigation of rater agreement depends on the comparison
of ratings, the first step involves careful data collection. A rating
design is prepared in which raters are asked to carry out a number
of ratings, with overlap between raters so that they each indepen
dently rate the same performances. In this way the ratings of one
rater can be compared with the ratings of others.
Imagine  a  rating  system  (suggestive  of  the  fiction  of  Franz
Kafka) in which the ratings which candidates get depend not at all
5 6   S URVEY

on the quality of their performances, but entirely on the whim of
the rater. Occasionally, the rating would  (by chance) be fair,  but
mostly  it would  not,  and  one  would  never know which rating
accidentally reflected the candidate's ability,  and which did not.
The ratings would be entirely unreliable. Looked at mathemati
cally, the ratings of one rater for a set of performances would bear
little  relationship  to  those  of  another,  and  would  not  be  pre
dictable from them. The reason for this is that the only thing caus
ing differences in scores is the whim of individual raters, not the
quality of the performance, to which the rater is indifferent.
Imagine the opposite (and equally fanciful) case of the ideal rat
ing system.  In this case, the  only thing driving the ratings is the
quality of the performance, so it shouldn't matter who the judge
is,  as  he/she will  recognize  that  quality  and  allocate  the perfor
mance to the appropriate rating category accordingly. Looked at
mathematically, in such a situation the ratings of any individual
rater  for  a  set  of  performances  would  be  perfectly  predictable
from knowledge  of the  ratings  given  to  those performances  by
another rater; they would be identical. If I wanted to know how
candidate Laura fared with Rater B, I need only find out how she
fared with Rater A and I would know.
In reality, the situation will lie somewhere between these two
extremes. But exactly where? How much dependable information
on  the  quality of performances  do  scores  from  a  rater contain,
and how much do they reflect the whim of that rater? Measurement
methods can help us tackle this question very precisely. They can
do  so  because  they  can  draw  on  mathematical  methods  for
exploring the extent to which one set of measures is predictable
from another set for the same class of individuals or objects.
Such  mathematical  methods  for  establishing  predictable
numerical relations of this kind originated in the rather prosaic
field of agriculture, in order to explore the predictive relationship
between varying amounts of fertilizer and associated crop yields.
But the methods apply equally well to human beings, for example,
in working out the extent to which the weight of a set of adult males
of  a  given  age  group  is  predictable  from their  height.  A  set  of
statistics
or single summary figures has been developed to capture
any such predictive relationship. One of these, the correlation coef
ficient
r,
is  frequently  used  in language assessment.  It expresses
M E A S U R E M ENT
5 7

the extent to which one score  set is knowable from another, and
uses a scale from o (no correspondence between the score sets at
all, as in the Kafkaesque situation) to
I
(perfect correspondence,
as in the ideal rating system) . When used to express the extent of
predictability  of  ratings  between  raters,  and  hence  inter-rater
agreement,  this  coefficient  is  called  a  reliability  coefficient  and
expresses inter-rater reliability. Let us say we calculate this coeffi
cient for each pairing of raters who are taking part in the rating
scheme, and come up with some figures on the o to
I
scale. How
are we to interpret these figures?  And what level of reliability as
expressed by the statistic should we be demanding of these raters?
Benchmarks  for  minimum  acceptable  inter-rater  agreement
range from o. 7 to 0.9 on this scale, depending on what is at stake,
and what other information about candidates we may have (for
example, their scores on other parts of the test}. 0.7 represents a
rock-bottom minimum of acceptable agreement  between  raters:
this  value can be understood as representing  about  s o%  agree
ment  and  s o%  disagreement  between  a  pair  of raters-hardly
impressive.  0.9  is  a  much  more  satisfactory  level  (representing
about 8o% agreement, 20% disagreement overall); but achieving
this level among raters may involve careful attention to the clear
wording of criteria and rigorous training of raters in their inter
pretation.
Obviously, it is useful to have a commonly understood scale for
expressing the degree of rater agreement in this way. It allows for
ready communication about the extent to which one can depend
on ratings from any  assessment scheme involving raters, and to
set standards. It also allows you to study the impact of rater train
ing  in  improving  rater  reliability,  to  identify  individual  raters
whose ratings  are  inconsistent with  those  of others,  to  provide
certification for consistent raters,  and  to  have  confidence,  once
overall levels of agreement are high, in the workability of the rat
ing scheme.
Correlation  coefficients  are  not  the  only  means  of studying
agreement between raters. When a single classification decision is
to be made, a classification analysis can be carried out. This is a
very simple procedure which can easily be done by hand. Imagine
two raters  (A and B)  each of whom independently rates a set of
performances  from  3 0   candidates.  They  are  required  to  say
S 8   S URVEY

whether the performance demonstrates a required level of compe
tence, or otherwise. A table is drawn up, setting out the rating cat
egories  ('Competent'/'Not  Competent')  available  to  each  rater,
and  the  frequency  of agreement  and  disagreement between the
raters, as in Figure 6.
I
Rater A
Competent
Not competent
Competent
1 1 1 1 1 1 1 1 1 1 1 1 1
I l l
[
=
13]
[
=
3]
Rater B
Not competent
I I  II
1 1 1 1 1 1 1 1 1 1
[
=
4]
[
=
10]
F 1 G u R E  6.
I
Classification analysis for two raters
The pairs of ratings for each candidate's performance are con
sidered in turn. Where there is  agreement that the performance
demonstrates competence, a mark is made in the upper left hand
cell in the table; where the raters agree that the performance fails
to  demonstrate  competence,  a  mark is  made  in the  lower right
cell. The cases of disagreement are similarly noted. The number of
marks in each cell is then totalled. In this case, the raters agreed
about 23  of the 3 0  performances, and disagreed about
7·
We can
then report this as percentage agreement: 23/3 0
=
77%.
Where  more  than  two  classification  categories  are  available,
the  above kind of information on frequency of misclassification
can be complemented by information on how far apart the raters
were  in  particular instances-one level  apart,  two  levels  apart,
and  so on. This information can  be used in rater training, rater
certification, and research, as with the inter-rater reliability coef
ficients discussed above.
There  are  a  range  of  further  and  more  complex  statistical
analyses  procedures  for  the  investigation  of  ratings  which  we
need not go into here: they can be taken as more elaborate varia
tions on the same basic themes.
Investigating the properties of individual test items
While investigating rater characteristics is important in guarantee
ing the meaningfulness and fairness of assessment in performance
M EA S U R E M ENT
5 9

tests,  other  kinds of quality control procedures are necessary in
paper-and-pencil tests (for this distinction, see Chapter r; for item
formats, see Chapter 4). In tests with a number of individual objec
tively scored test items, for example, in tests of language compre
hension, or tests of knowledge of individual points of grammar or
vocabulary,  it  is  usual  to  carry  out  a  procedure  known  as  item
analysis.
This procedure involves the careful analysis of score pat
terns on each of the test items. The analysis tells us how well each
item is working, that is, the contribution it is making to the overall
picture of candidates' ability emerging from the test.
Item analysis is a normal part of test development. Before a test
is introduced in its final format, a pilot version of the test is devel
oped. This will contain a number of draft items (many more than
are needed, so that only the  best ones will survive the piloting),
possibly in a  variety of item  formats of interest.  This  version is
then taken by a group of individuals with the same learner profile
as the ultimate test-takers (the number has to be sufficiently large
for analyses of patterns of responses to items to be possible). This
stage of test development is known as trialling or trying out. The
effectiveness  of items  (and  hence  of formats)  is  evaluated  using
the item  analysis  procedures  described  later in this chapter,  and
the test revised before the operational version of the test (the ver
sion that will actually be used in test administrations with candi
dates) is finalized.
Item  analysis  usually  provides  two  kinds  of  information  on
items:
item facility,
which helps us decide if test items are at the right
level for the target group, and
item  discrimination,
which allows us  to  see if individual items
are  providing information  on  candidates'  abilities  consistent
with that provided by the other items on the test.
Item facility
expresses the proportion of the people taking the test
who  got  a given item right.  (Item difficulty is sometimes used to
express similar information, in this case the proportion who got
an item wrong.) Where the test purpose is to make distinctions
between candidates, to spread them out in terms of their perfor
mance on the test, the items should be neither too easy nor too dif
ficult. If the items are too easy, then people with differing levels of
60
S U RVEY

ability or knowledge will all get them right, and the differences in
ability or knowledge will not be revealed by the item. Similarly, if
the  items  are  too  hard,  then  able  and  less  able  candidates  alike
will get them wrong, and the item won't help us in distinguishing
between them. Item facility is expressed on a scale from o (no-one
got the item right) to
r
(everybody got it right); for example, an
item facility of 0. 3 7  means that 3 7%  of those who took the item
got it right. Ideal item facility is o.
5
but of course it is hard to hit
this target exactly, and a range of item facilities from
o .
3 3 to o.67
is usually accepted. Even though, as we have seen, items that are
very easy (items with high item facility) don't distinguish between
candidates, it may be useful to include some at the beginning of a
test in order to ease candidates into the  test  and to  allow them a
chance  to get over their nerves.  It may also be worth including a
few  rather  hard  items  near  the end of the test  in order to distin
guish between the most able candidates, if that information is rel
evant,  for  example  in  deciding  who  shall  get  prizes  in  a
competitive examination.
Analysis of item discrimination addresses a different target: con
sistency  of performance  by  candidates  across  items.  The  usual
method for calculating item  discrimination  involves  comparing
performance on each item by different groups of test-takers: those
who have done well on the test overall,
and
those who have done
relatively  poorly.  For  example,  as  items  get  harder,  we  would
expect those who do best on the test overall to be the ones who in
the  main  get  them  right.  Poor  item  discrimination indices  are  a
signal that an item deserves revision.
If there are a lot of items with problems of discrimination, the
information coming out of the test is confusing, as it means that
some items are suggesting certain candidates are relatively better,
while  others  are  indicating that other  individuals  are  better; no
clear  picture  of the  candidates'  abilities  emerges from  the  test.
(The scores, in other words, are misleading, and not reliable indi
cators  of the underlying abilities  of the  candidates. )  Such  a  test
will need considerable revision. The  overall capacity of a multi
item test such  as  a comprehension test or  a  test of grammar or
vocabulary to define levels of knowledge or ability among candi
dates consistently is referred to as the reliability of the test. As with
the rater-mediated assessment indices discussed above, a statisti-
M E A S U REMENT
6 r

cal  index  known  as a  reliability coefficient is available to express
on a scale of
o
to r the extent to which the test overall is succeed
ing in these terms. This index is broadly interpretable in the same
way as the inter-rater reliability indices discussed above. We nor
mally look for reliabilities on comprehension tests, or on tests of
grammar  or  vocabulary,  of
0.9
or  better.  A  reliability  of
0 . 9
means that scores on the test are providing about  8o% reliable
information on candidates' abilities, with about 20% attributable
to randomness or error.
Norm-referenced and criterion-referenced measurement
Approaches  to  testing  can  be  defined  in  terms  of  the  broad
measurement assumptions they make. Two approaches are particu
larly  relevant  within  language  testing:  norm-referenced  and
criterion-referenced measurement.
Norm-referenced measurement
adopts a framework of compari
son between individuals for understanding the significance of any
single score. Each score is seen in the light of other scores, partic
ularly in terms of its frequency (how often such a score typically
occurs in a much larger group of test-takers). In daily life we oper
ate with an idea of typical frequencies of occurrence for particular
values of height, weight,  and so on.  For example, you will hear
people  saying  'That  little  girl  is  tall  for  her  age'  or  'He's  rather
overweight'  or  'She's  average  looking.' We  have internalized  a
sense of how often we will  see  young  men of a  range  of heights.
Men  of average  height  are  so common  as to  be unremarkable;
exceptionally  tall  men-for  example,  athletes  in  sports  where
height may  be  an  advantage  are often the  subject  of comment.
The typical distribution of height in this population of young men
is  well  recognized,  for  example,  by  shopkeepers  selling  men's
clothing, who will keep abundant stock of trousers with the most
common leg measurements,  but far  fewer  items  of unusual size
which would fit basketball champions or jockeys.
If we carefully measured the height of a large number of sub
jects from the population of interest, we could keep count of how
frequently measurement within given ranges of height occurred.
In other words, we could develop information on the distribution
of these frequencies of occurrences of heights across the men we
had measured. Statisticians interested in measurement have done
62
S U RVEY

just this for a number of biological attributes, and it turns out that
the distribution in each case is broadly similar.  Statisticians have
attempted to capture these typical frequencies in an idealized for
mat  known  as  the  normal  distribution.  The  highest  frequencies
occur near the average (or mean), and known proportions  occur
at given distances either side of the mean, thus giving the curve of
the  distribution  its  well-known  bell  shape  (cf.  Figure  6.2). The
mathematical  character  of  the  normal  distribution  has  been
intensively  studied for  decades,  and  has  predictable  properties
which can then be applied in measurement.
F I G U R E  6.2
The bell curve of the normal distribution
Norm-referenced approaches to measurement assume that test
scores will be like height or other biological measures, that is, nor
mally distributed across the population of interest.  Most scores
will be around the average, and the further away from the average
a score is, the more unusual it is likely to be. Thus, in norm-refer
enced measurement, an individual performance is evaluated not
in terms of its quality compared with some criterion performance
('Did it meet what was required?') but in terms of its typicality for
the population in question ('How good was it compared with the
performances of others?').
Norm-referenced measurement has several advantages. In con
texts where this is appropriate it allows for distinct levels of per
formance  to  be  defined,  and  allows  for  distinctions  between
individual performances to be made. In addition, the procedures
for  investigating  the  reliability  and  aspects  of  the  validity  of
norm-referenced  scores  are  well  established  and  well  known.
However,  from an  educational  point of view its  dependence  on
M EA S U R E M ENT
63

comparisons across a population has  been seen as being inappro
priately competitive, and discouraging for the 'average' student.
An  alternative  approach  which  does  not  use  a  comparison
between individuals as its frame of reference is known as criterion
referenced  measurement.
Here, individual performances are eval
uated  against  a  verbal  description  of a  satisfactory performance
at a given level. In this way, a series of performance goals can be
set for individual learners and they can reach these at their own
rate. In this way, motivation is maintained, and the striving is for
a  'personal  best'  rather  than  against  other  learners.  Of course,
even here comparison may creep in, as learners will compare the
levels they  and others  have  reached.  Raters,  too,  will inevitably
have in  their heads a reference map of the range of achievement
they have come to expect as teachers or raters, and locate the cur
rent performance accordingly. Nevertheless, in principle it is use
ful  to  distinguish  the  two  broad  approaches  to  assessment.
Because criterion-referenced measurement involves evaluation of
performance against descriptors, it typically involves judgement
as to how a performance should be classified. Thus, measurement
procedures  used in criterion referenced  approaches will  include
the  indices  of the quality of raters (inter-rater  reliability indices,
classification analysis, and so on) presented earlier in this chapter.
Norm-referenced approaches require a  score
d
istr
i
bution
,
whose
frequencies can be modelled in terms of the expected frequencies
of the normal distribution. A score  distribution implies the exis
tence of a range of possible scores. Language tests which involve
multiple items  (and hence a range of possible total scores) gener
ate  such  distributions,  and  so  norm-referenced  approaches  are
more  typically  associated  with  comprehension  tests,  or  tests  of
grammar and vocabulary.
New approaches to measurement
New measurement approaches continually emerge. The most sig
nificant of them is known by the general name of Item Response
Theory ( IRT) .
IRT
represents a new approach to item analysis (see
earlier discussion). This, on the face of it, unexciting characteris
tic has important practical implications. It greatly facilitates the
formerly very difficult business of test equating (producing tests of
equivalent  difficulty) .  It  also  permits  test  linking,  that  is,  using
64
S URVEY

tests  of  differing  but  known  relative  difficulty  to  measure  the
growth  of  individuals  over  time.  IRT  also  makes  possible  the
development  of  computer  adaptive  tests,  a  form  of  computer
delivered test to be discussed in detail in Chapter 8 .  IRT has also
made  great  strides  in  the  analysis  of  data  from  performance
assessments,  particularly  through  the  branch  of IRT  known  as
Rasch measurement.
Readers wishing to learn more about these
new  developments  are  referred  to  the  suggestions  for  further
reading in Section
3
(References).
Conclusion
In this chapter we  have  considered  a  number of ways  in which
concepts and practices from the field of educational measurement
or  psychometrics  have  had  an  impact  on  the  area  of language
assessment.  We  distinguished  different  approaches  to  measure
ment, with different sets of assumptions,  and some  of the most
common  techniques  associated with  each  for  investigating  the
quality  of language  tests.  We  also  drew  attention  to  the  new
developments taking place in the field.
It has  been argued recently that too obsessive a  concern with
measurement considerations can have a destructive effect educa
tionally. For example, the move away from multiple-choice items
in favour of
assessment
of
integrated
performances is in line with
communicative  approaches  to  language  teaching  and  arguably
therefore likely to have a beneficial impact on the curriculum and
on  classroom  practice.  But  it  is  also  more  difficult  to  achieve
acceptable levels of reliability in rater-mediated assessment than
it  is on multi-item multiple choice  tests.  Which consideration
validity or reliability-should predominate in such  a  case?  This
brings  up  one  of the  central  issues  in  testing,  namely  that  one
might test what is readily testable rather than what needs to  be
tested to provide a proper assessment of language ability. And the
question of what counts as proper assessment involves a consid
eration of the  social and educational responsibility of language
assessment.  These  are  matters  to  be taken  up  in  the  following
chapter.

The social character of
language tests
Introduction
At  a  moment  of dramatic  intensity  in the  theatre,  the  glare  of a
single spotlight can isolate an individual actor from his or her sur
roundings. The spotlight focuses the spectator's attention on the
psychological state of the character being portrayed. Temporarily
at  least,  the  surroundings,  including  other  actors  present,  are
rendered invisible for the audience. Until fairly recently, thinking
about  language  assessment was  like  this.  It focused  exclusively
on  the  skills  and  abilities  of  the  individual  being  assessed.
Educational assessment has traditionally drawn its concepts and
procedures  primarily  from  the  field  of  psychology,  and  more
specifically from the branch of psychology known as psychomet
rics, that is, the measurement of individual cognitive abilities. But
what does the bright spotlight of this individualizing perspective
exclude? What lies behind,  around?  Imagine the spotlight going
off to be replaced by normal stage lighting: the other actors on the
stage are revealed. Now imagine the performance continuing, but
the  house  lights  coming  up,  so  that  the  audience  is  revealed.
Imagine finally the side curtains being pulled back and the stage
set  removed  to  expose  all  the  personnel  working  behind  the
scenes. The individual  performance  is  now exposed as  forming

tải về 2.79 Mb.

Chia sẻ với bạn bè của bạn:

1 2 3 4 5 6 7 8 9 ... 15

Oxford Introductions to Language Study Language Testing

part of a larger area of research on the impact of assessment pro­

part of a larger area of research on the impact of assessment pro