part of a larger area of research on the impact of assessment pro
cedures on teaching and learning, and more broadly on society as
a whole. The social context of assessment will be considered in
detail in Chapter
7 ·
Conclusion
In this chapter we have examined the need for questioning the
bases for inferences about candidate abilities residing in test pro
cedures, and the way in which these inferences may be at risk
from aspects of test des
i
gn and test method, or
lack
of
clarity in
our thinking about what we are measuring. Efforts to establish
the validity of tests have generated much of what constitutes the
field of language testing research. Such research involves two pri
mary techniques: speculation and empiricism. Speculation here
refers to reasoning and logical analysis about the nature of lan
guage and language use, and of the nature of performance, of the
type that we outlined in Chapter
2.
Empiricism means subjecting
such theorizing and specific implications of particular testing
practices to examination in the light of data from test trials and
operational test administrations. Thus, as an outcome of the test
development cycle, language testing research involves the forma
tion of hypotheses about the nature of language ability, and
putting such hypotheses to the test. In this way, language testing
is rescued from being a merely technical activity and constitutes
a site for research activity of a fundamental nature in applied
linguistics.
5 4
S U RVEY
6
Measurement
Introduction
Assessment usually involves allocating a score, an attractively
simple number. Gertrude Stein tells us that 'A rose is a rose is a
rose', but measurement people (in their unimaginative way) tell
us that a score is not a score is not a score: scores can be deceptive.
For example, when different raters give the same score, do they
mean the same thing? In tests with several parts and multiple
items, how consistent are score patterns across different parts of a
test? Can we add scores from the different parts, or across tests of
different sub-skills, or are they measuring such different things
that they are incommensurable, cannot be talked about in the
same breath? What do the scores on a test tell us about its quality,
and its suitability for its intended purpose? These are questions
addressed by measurement, the theoretical and empirical analysis
of scores and score meaning.
Often, when people think (if they do) about testing, they per
ceive it as a dauntingly technical field, and it is often the measure
ment aspect of the field that puts people off. 'Means', 'percentiles',
'standard deviations', statistics-these inspire a lack of confi
dence that one could ever (or indeed would ever want to) engage
successfully with testing as an area of knowledge and expertise.
Yet, curiously, concepts from the field of measurement can be
found frequently in everyday conversation: 'She is of above aver
age intelligence.' 'He topped his class.' 'It's like saying that these
apples are not very good oranges.' 'He's a not a reliable judge.' It
is not so much then, that people are not interested in the questions
that measurement asks, as that they are daunted by the way it
goes about answering them, by its procedures and language. The
M EA S U RE M E N T
5 5
aim of this chapter is to give a brief introduction to a small selec
tion of measurement concepts and procedures commonly used in
language assessment, and in particular to make the reader feel
that they are accessible and worth understanding.
Measurement
Measurement investigates the quality of the process of assessment
by looking at scores. Two main steps are involved:
r
Quantification, that is, the assigning of numbers or scores to
various outcomes of assessment. The set of scores available for
analysis when data are gathered from a number of test-takers is
known as a data matrix.
2
Checking for various kinds of mathematical and statistical pat
terning within the matrix in order to investigate the extent to
which necessary properties (for example, consistency of per
formance by candidates, or by judges) are present in the assess
ment.
The aim of these procedures is to achieve quality control, that is,
to improve the meaningfulness and fairness of the conclusions
reached about individual candidates (the validity of the test).
Measurement procedures have no rationale other than to under
pin validity.
Quality control for raters
As an example of what measurement expertise can contribute to
our understanding of language tests, aml
our ability to develop
fair and meaningful tests, we will look at the question of quality
control procedures for raters. To what extent is there agreement
between raters, and where there is disagreement, what can be
done about it?
As investigation of rater agreement depends on the comparison
of ratings, the first step involves careful data collection. A rating
design is prepared in which raters are asked to carry out a number
of ratings, with overlap between raters so that they each indepen
dently rate the same performances. In this way the ratings of one
rater can be compared with the ratings of others.
Imagine a rating system (suggestive of the fiction of Franz
Kafka) in which the ratings which candidates get depend not at all
5 6 S URVEY
on the quality of their performances, but entirely on the whim of
the rater. Occasionally, the rating would (by chance) be fair, but
mostly it would not, and one would never know which rating
accidentally reflected the candidate's ability, and which did not.
The ratings would be entirely unreliable. Looked at mathemati
cally, the ratings of one rater for a set of performances would bear
little relationship to those of another, and would not be pre
dictable from them. The reason for this is that the only thing caus
ing differences in scores is the whim of individual raters, not the
quality of the performance, to which the rater is indifferent.
Imagine the opposite (and equally fanciful) case of the ideal rat
ing system. In this case, the only thing driving the ratings is the
quality of the performance, so it shouldn't matter who the judge
is, as he/she will recognize that quality and allocate the perfor
mance to the appropriate rating category accordingly. Looked at
mathematically, in such a situation the ratings of any individual
rater for a set of performances would be perfectly predictable
from knowledge of the ratings given to those performances by
another rater; they would be identical. If I wanted to know how
candidate Laura fared with Rater B, I need only find out how she
fared with Rater A and I would know.
In reality, the situation will lie somewhere between these two
extremes. But exactly where? How much dependable information
on the quality of performances do scores from a rater contain,
and how much do they reflect the whim of that rater? Measurement
methods can help us tackle this question very precisely. They can
do so because they can draw on mathematical methods for
exploring the extent to which one set of measures is predictable
from another set for the same class of individuals or objects.
Such mathematical methods for establishing predictable
numerical relations of this kind originated in the rather prosaic
field of agriculture, in order to explore the predictive relationship
between varying amounts of fertilizer and associated crop yields.
But the methods apply equally well to human beings, for example,
in working out the extent to which the weight of a set of adult males
of a given age group is predictable from their height. A set of
statistics
or single summary figures has been developed to capture
any such predictive relationship. One of these, the correlation coef
ficient
r,
is frequently used in language assessment. It expresses
M E A S U R E M ENT
5 7
the extent to which one score set is knowable from another, and
uses a scale from o (no correspondence between the score sets at
all, as in the Kafkaesque situation) to
I
(perfect correspondence,
as in the ideal rating system) . When used to express the extent of
predictability of ratings between raters, and hence inter-rater
agreement, this coefficient is called a reliability coefficient and
expresses inter-rater reliability. Let us say we calculate this coeffi
cient for each pairing of raters who are taking part in the rating
scheme, and come up with some figures on the o to
I
scale. How
are we to interpret these figures? And what level of reliability as
expressed by the statistic should we be demanding of these raters?
Benchmarks for minimum acceptable inter-rater agreement
range from o. 7 to 0.9 on this scale, depending on what is at stake,
and what other information about candidates we may have (for
example, their scores on other parts of the test}. 0.7 represents a
rock-bottom minimum of acceptable agreement between raters:
this value can be understood as representing about s o% agree
ment and s o% disagreement between a pair of raters-hardly
impressive. 0.9 is a much more satisfactory level (representing
about 8o% agreement, 20% disagreement overall); but achieving
this level among raters may involve careful attention to the clear
wording of criteria and rigorous training of raters in their inter
pretation.
Obviously, it is useful to have a commonly understood scale for
expressing the degree of rater agreement in this way. It allows for
ready communication about the extent to which one can depend
on ratings from any assessment scheme involving raters, and to
set standards. It also allows you to study the impact of rater train
ing in improving rater reliability, to identify individual raters
whose ratings are inconsistent with those of others, to provide
certification for consistent raters, and to have confidence, once
overall levels of agreement are high, in the workability of the rat
ing scheme.
Correlation coefficients are not the only means of studying
agreement between raters. When a single classification decision is
to be made, a classification analysis can be carried out. This is a
very simple procedure which can easily be done by hand. Imagine
two raters (A and B) each of whom independently rates a set of
performances from 3 0 candidates. They are required to say
S 8 S URVEY
whether the performance demonstrates a required level of compe
tence, or otherwise. A table is drawn up, setting out the rating cat
egories ('Competent'/'Not Competent') available to each rater,
and the frequency of agreement and disagreement between the
raters, as in Figure 6.
I
Rater A
Competent
Not competent
Competent
1 1 1 1 1 1 1 1 1 1 1 1 1
I l l
[
=
13]
[
=
3]
Rater B
Not competent
I I II
1 1 1 1 1 1 1 1 1 1
[
=
4]
[
=
10]
F 1 G u R E 6.
I
Classification analysis for two raters
The pairs of ratings for each candidate's performance are con
sidered in turn. Where there is agreement that the performance
demonstrates competence, a mark is made in the upper left hand
cell in the table; where the raters agree that the performance fails
to demonstrate competence, a mark is made in the lower right
cell. The cases of disagreement are similarly noted. The number of
marks in each cell is then totalled. In this case, the raters agreed
about 23 of the 3 0 performances, and disagreed about
7·
We can
then report this as percentage agreement: 23/3 0
=
77%.
Where more than two classification categories are available,
the above kind of information on frequency of misclassification
can be complemented by information on how far apart the raters
were in particular instances-one level apart, two levels apart,
and so on. This information can be used in rater training, rater
certification, and research, as with the inter-rater reliability coef
ficients discussed above.
There are a range of further and more complex statistical
analyses procedures for the investigation of ratings which we
need not go into here: they can be taken as more elaborate varia
tions on the same basic themes.
Investigating the properties of individual test items
While investigating rater characteristics is important in guarantee
ing the meaningfulness and fairness of assessment in performance
M EA S U R E M ENT
5 9
tests, other kinds of quality control procedures are necessary in
paper-and-pencil tests (for this distinction, see Chapter r; for item
formats, see Chapter 4). In tests with a number of individual objec
tively scored test items, for example, in tests of language compre
hension, or tests of knowledge of individual points of grammar or
vocabulary, it is usual to carry out a procedure known as item
analysis.
This procedure involves the careful analysis of score pat
terns on each of the test items. The analysis tells us how well each
item is working, that is, the contribution it is making to the overall
picture of candidates' ability emerging from the test.
Item analysis is a normal part of test development. Before a test
is introduced in its final format, a pilot version of the test is devel
oped. This will contain a number of draft items (many more than
are needed, so that only the best ones will survive the piloting),
possibly in a variety of item formats of interest. This version is
then taken by a group of individuals with the same learner profile
as the ultimate test-takers (the number has to be sufficiently large
for analyses of patterns of responses to items to be possible). This
stage of test development is known as trialling or trying out. The
effectiveness of items (and hence of formats) is evaluated using
the item analysis procedures described later in this chapter, and
the test revised before the operational version of the test (the ver
sion that will actually be used in test administrations with candi
dates) is finalized.
Item analysis usually provides two kinds of information on
items:
item facility,
which helps us decide if test items are at the right
level for the target group, and
item discrimination,
which allows us to see if individual items
are providing information on candidates' abilities consistent
with that provided by the other items on the test.
Item facility
expresses the proportion of the people taking the test
who got a given item right. (Item difficulty is sometimes used to
express similar information, in this case the proportion who got
an item wrong.) Where the test purpose is to make distinctions
between candidates, to spread them out in terms of their perfor
mance on the test, the items should be neither too easy nor too dif
ficult. If the items are too easy, then people with differing levels of
60
S U RVEY
ability or knowledge will all get them right, and the differences in
ability or knowledge will not be revealed by the item. Similarly, if
the items are too hard, then able and less able candidates alike
will get them wrong, and the item won't help us in distinguishing
between them. Item facility is expressed on a scale from o (no-one
got the item right) to
r
(everybody got it right); for example, an
item facility of 0. 3 7 means that 3 7% of those who took the item
got it right. Ideal item facility is o.
5
but of course it is hard to hit
this target exactly, and a range of item facilities from
o .
3 3 to o.67
is usually accepted. Even though, as we have seen, items that are
very easy (items with high item facility) don't distinguish between
candidates, it may be useful to include some at the beginning of a
test in order to ease candidates into the test and to allow them a
chance to get over their nerves. It may also be worth including a
few rather hard items near the end of the test in order to distin
guish between the most able candidates, if that information is rel
evant, for example in deciding who shall get prizes in a
competitive examination.
Analysis of item discrimination addresses a different target: con
sistency of performance by candidates across items. The usual
method for calculating item discrimination involves comparing
performance on each item by different groups of test-takers: those
who have done well on the test overall,
and
those who have done
relatively poorly. For example, as items get harder, we would
expect those who do best on the test overall to be the ones who in
the main get them right. Poor item discrimination indices are a
signal that an item deserves revision.
If there are a lot of items with problems of discrimination, the
information coming out of the test is confusing, as it means that
some items are suggesting certain candidates are relatively better,
while others are indicating that other individuals are better; no
clear picture of the candidates' abilities emerges from the test.
(The scores, in other words, are misleading, and not reliable indi
cators of the underlying abilities of the candidates. ) Such a test
will need considerable revision. The overall capacity of a multi
item test such as a comprehension test or a test of grammar or
vocabulary to define levels of knowledge or ability among candi
dates consistently is referred to as the reliability of the test. As with
the rater-mediated assessment indices discussed above, a statisti-
M E A S U REMENT
6 r
cal index known as a reliability coefficient is available to express
on a scale of
o
to r the extent to which the test overall is succeed
ing in these terms. This index is broadly interpretable in the same
way as the inter-rater reliability indices discussed above. We nor
mally look for reliabilities on comprehension tests, or on tests of
grammar or vocabulary, of
0.9
or better. A reliability of
0 . 9
means that scores on the test are providing about 8o% reliable
information on candidates' abilities, with about 20% attributable
to randomness or error.
Norm-referenced and criterion-referenced measurement
Approaches to testing can be defined in terms of the broad
measurement assumptions they make. Two approaches are particu
larly relevant within language testing: norm-referenced and
criterion-referenced measurement.
Norm-referenced measurement
adopts a framework of compari
son between individuals for understanding the significance of any
single score. Each score is seen in the light of other scores, partic
ularly in terms of its frequency (how often such a score typically
occurs in a much larger group of test-takers). In daily life we oper
ate with an idea of typical frequencies of occurrence for particular
values of height, weight, and so on. For example, you will hear
people saying 'That little girl is tall for her age' or 'He's rather
overweight' or 'She's average looking.' We have internalized a
sense of how often we will see young men of a range of heights.
Men of average height are so common as to be unremarkable;
exceptionally tall men-for example, athletes in sports where
height may be an advantage are often the subject of comment.
The typical distribution of height in this population of young men
is well recognized, for example, by shopkeepers selling men's
clothing, who will keep abundant stock of trousers with the most
common leg measurements, but far fewer items of unusual size
which would fit basketball champions or jockeys.
If we carefully measured the height of a large number of sub
jects from the population of interest, we could keep count of how
frequently measurement within given ranges of height occurred.
In other words, we could develop information on the distribution
of these frequencies of occurrences of heights across the men we
had measured. Statisticians interested in measurement have done
62
S U RVEY
just this for a number of biological attributes, and it turns out that
the distribution in each case is broadly similar. Statisticians have
attempted to capture these typical frequencies in an idealized for
mat known as the normal distribution. The highest frequencies
occur near the average (or mean), and known proportions occur
at given distances either side of the mean, thus giving the curve of
the distribution its well-known bell shape (cf. Figure 6.2). The
mathematical character of the normal distribution has been
intensively studied for decades, and has predictable properties
which can then be applied in measurement.
F I G U R E 6.2
The bell curve of the normal distribution
Norm-referenced approaches to measurement assume that test
scores will be like height or other biological measures, that is, nor
mally distributed across the population of interest. Most scores
will be around the average, and the further away from the average
a score is, the more unusual it is likely to be. Thus, in norm-refer
enced measurement, an individual performance is evaluated not
in terms of its quality compared with some criterion performance
('Did it meet what was required?') but in terms of its typicality for
the population in question ('How good was it compared with the
performances of others?').
Norm-referenced measurement has several advantages. In con
texts where this is appropriate it allows for distinct levels of per
formance to be defined, and allows for distinctions between
individual performances to be made. In addition, the procedures
for investigating the reliability and aspects of the validity of
norm-referenced scores are well established and well known.
However, from an educational point of view its dependence on
M EA S U R E M ENT
63
comparisons across a population has been seen as being inappro
priately competitive, and discouraging for the 'average' student.
An alternative approach which does not use a comparison
between individuals as its frame of reference is known as criterion
referenced measurement.
Here, individual performances are eval
uated against a verbal description of a satisfactory performance
at a given level. In this way, a series of performance goals can be
set for individual learners and they can reach these at their own
rate. In this way, motivation is maintained, and the striving is for
a 'personal best' rather than against other learners. Of course,
even here comparison may creep in, as learners will compare the
levels they and others have reached. Raters, too, will inevitably
have in their heads a reference map of the range of achievement
they have come to expect as teachers or raters, and locate the cur
rent performance accordingly. Nevertheless, in principle it is use
ful to distinguish the two broad approaches to assessment.
Because criterion-referenced measurement involves evaluation of
performance against descriptors, it typically involves judgement
as to how a performance should be classified. Thus, measurement
procedures used in criterion referenced approaches will include
the indices of the quality of raters (inter-rater reliability indices,
classification analysis, and so on) presented earlier in this chapter.
Norm-referenced approaches require a score
d
istr
i
bution
,
whose
frequencies can be modelled in terms of the expected frequencies
of the normal distribution. A score distribution implies the exis
tence of a range of possible scores. Language tests which involve
multiple items (and hence a range of possible total scores) gener
ate such distributions, and so norm-referenced approaches are
more typically associated with comprehension tests, or tests of
grammar and vocabulary.
New approaches to measurement
New measurement approaches continually emerge. The most sig
nificant of them is known by the general name of Item Response
Theory ( IRT) .
IRT
represents a new approach to item analysis (see
earlier discussion). This, on the face of it, unexciting characteris
tic has important practical implications. It greatly facilitates the
formerly very difficult business of test equating (producing tests of
equivalent difficulty) . It also permits test linking, that is, using
64
S URVEY
tests of differing but known relative difficulty to measure the
growth of individuals over time. IRT also makes possible the
development of computer adaptive tests, a form of computer
delivered test to be discussed in detail in Chapter 8 . IRT has also
made great strides in the analysis of data from performance
assessments, particularly through the branch of IRT known as
Rasch measurement.
Readers wishing to learn more about these
new developments are referred to the suggestions for further
reading in Section
3
(References).
Conclusion
In this chapter we have considered a number of ways in which
concepts and practices from the field of educational measurement
or psychometrics have had an impact on the area of language
assessment. We distinguished different approaches to measure
ment, with different sets of assumptions, and some of the most
common techniques associated with each for investigating the
quality of language tests. We also drew attention to the new
developments taking place in the field.
It has been argued recently that too obsessive a concern with
measurement considerations can have a destructive effect educa
tionally. For example, the move away from multiple-choice items
in favour of
assessment
of
integrated
performances is in line with
communicative approaches to language teaching and arguably
therefore likely to have a beneficial impact on the curriculum and
on classroom practice. But it is also more difficult to achieve
acceptable levels of reliability in rater-mediated assessment than
it is on multi-item multiple choice tests. Which consideration
validity or reliability-should predominate in such a case? This
brings up one of the central issues in testing, namely that one
might test what is readily testable rather than what needs to be
tested to provide a proper assessment of language ability. And the
question of what counts as proper assessment involves a consid
eration of the social and educational responsibility of language
assessment. These are matters to be taken up in the following
chapter.
The social character of
language tests
Introduction
At a moment of dramatic intensity in the theatre, the glare of a
single spotlight can isolate an individual actor from his or her sur
roundings. The spotlight focuses the spectator's attention on the
psychological state of the character being portrayed. Temporarily
at least, the surroundings, including other actors present, are
rendered invisible for the audience. Until fairly recently, thinking
about language assessment was like this. It focused exclusively
on the skills and abilities of the individual being assessed.
Educational assessment has traditionally drawn its concepts and
procedures primarily from the field of psychology, and more
specifically from the branch of psychology known as psychomet
rics, that is, the measurement of individual cognitive abilities. But
what does the bright spotlight of this individualizing perspective
exclude? What lies behind, around? Imagine the spotlight going
off to be replaced by normal stage lighting: the other actors on the
stage are revealed. Now imagine the performance continuing, but
the house lights coming up, so that the audience is revealed.
Imagine finally the side curtains being pulled back and the stage
set removed to expose all the personnel working behind the
scenes. The individual performance is now exposed as forming
Chia sẻ với bạn bè của bạn: |