Oxford Introductions to Language Study
Language Testing
Tim McNamara is Associate Professor
in the Department of Linguistics and
Applied Linguistics at the University
of Melbourne.
Published in this series:
Rod Ellis:
Second Language Acquisition
Claire Kramsch:
Language and Culture
Thomas Scovel:
Psycholinguistics
Bernard Spolsky:
Sociolinguistics
H. G. Widdowson:
Linguistics
George Yule:
Pragmatics
Oxford Introductions to Language Study
Series Editor H.G. Widdowson
Tim McNamara
OXFORD
UNIVERSITY PRESS
OXFORD
UNIVERSITY PRESS
Great Clarendon Street, Oxford ox2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University's objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand 1\n·key Ukraine Vietnam
OXFORD and OXFORD ENGLISH are registered trade marks of
Oxford University Press in the UK and in certain other countries
© Oxford University Press 2000
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2000
2014 2013 2012 2011 2010
10 9 8 7
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press (with
the sole exception of photocopying carried out under the conditions stated
in the paragraph headed 'Photocopying'), or as expressly permitted by law, or
under terms agreed with the appropriate reprographics rights organization.
Enquiries concerning reproduction outside the scope of the above should
be sent to the ELT Rights Department, Oxford University Press, at the
address above
You must not circulate this book in any other binding or cover
and you must impose this same condition on any acquirer
Photocopying
The Publisher grants permission for the photocopying of those pages
marked 'photocopiable' according to the following conditions. Individual
Pll!chasers may make copies for their own use or for use by classes that
they teach. School purchasers may make copies for use by staff and students,
but this permission does not extend to additional schools or branches
Under no circumstances may any part of this book be photocopied for resale
Any websites referred to in this publication are in the public domain and
their addresses are provided by Oxford University Press for information only.
Oxford University Press disclaims any responsibility for the content
ISBN·1.3: 978 019 437222 0
Printed in China
To Terry Quinn
Contents
Preface
Author's preface
SECTION I
Survey
I
1
Testing, testing . . . What is a language test?
3
Understanding language testing
4
Types of test
5
Test purpose
6
The criterion
7
The test-criterion relationship
IO
Conclusion
II
2
Communication and the design of language tests
l3
Discrete point tests
I3
Integrative and pragmatic tests
l4
Communicative language tests
I
6
Models of communicative ability
I7
Conclusion
2
I
3
The testing cycle
23
Understanding the constraints
24
Test content
2 5
Test method
26
Authenticity of response
27
Fixed and constructed response formats
29
Test specifications
3 I
Test trials
32
Conclusion
3 3
4
The rating process
3 5
Establishing a rating procedure
3 6
The problem with raters
3 7
Establishing a framework for making judgements
3 8
Rating scales
40
Holistic and analytic ratings
43
Rater training
44
Conclusion
44
5
Validity: testing the test
47
Threats to test validity
so
Test content
so
Test method and test construct
52
The impact of tests
5 3
Conclusion
54
6
Measurement
5 5
Introduction
5 5
Measurement
s 6
Quality control for raters
s 6
Investigating the properties of individual test items
59
Norm�referenced and criterion-referenced measurement
62
New approaches to measurement
64
Conclusion
65
7
The social character of language tests
67
Introduction
67
The institutional character of assessment
68
Assessment and social policy
68
Assessment and educational policy
69
The social responsibility of the language tester
70
Ethical language testing
72
Accountability
72
Wash back
73
Test impact
74
Codes of professional ethics for language testers
75
Critical language testing
76
Conclusion
77
8
New directions-and dilemmas?
79
Computers and language testing
79
Technology and the testing of speaking
8 r
Dilemmas: whose performance?
8 3
S ECT I O N
2
Readings
87
S ECT I O N
3
References
I2I
S ECT I O N 4
Glossary
131
Preface
Purpose
What justification might there be for a series of introductions to
language study? After all, linguistics is already well served with
introductory texts: expositions and explanations which are com
prehensive, authoritative, and excellent in their way. Generally
speaking, however, their way is the essentially academic one of
providing a detailed initiation into the discipline of linguistics,
and they tend to be lengthy and technical: appropriately so, given
their purpose. But they can be quite daunting to the novice. There
is also a need for a more general and gradual introduction to lan
guage: transitional texts which will ease people into an under
standing of complex ideas. This series of introductions is designed
to serve this need.
Their purpose, therefore, is not to supplant but to support the
more academically oriented introductions to linguistics: to pre
pare the conceptual ground. They are based on the belief that it is
an advantage to have a broad map of the terrain sketched out
before one considers its more specific features on a smaller scale,
a general context in reference to which the detail makes sense. It is
sometimes the case that students are introduced to detail without
it being made clear what it is a detail of. Clearly, a general under
standing of ideas is not sufficient: there needs to be closer
scrutiny. But equally, close scrutiny can be myopic and meaning
less unless it is related to the larger view. Indeed it can be said that
the precondition of more particular enquiry is an awareness of
what, in general, the particulars are about. This series is designed
to provide this large-scale view of different areas of language
P R E F ACE
XI
study. As such it can serve as preliminary to (and precondition
for) the more specific and specialized enquiry which students of
linguistics are required to undertake.
But the series is not only intended to be helpful to such stu
dents. There are many people who take an interest in language
without being academically engaged in linguistics per se. Such
people may recognize the importance of understanding language
for their own lines of enquiry, or for their own practical purposes,
or quite simply for making them aware of something which fig
ures so centrally in their everyday lives. If linguistics has revealing
and relevant things to say about language, this should presum
ably not be a privileged revelation, but one accessible to people
other than linguists. These books have been so designed as to
accommodate these broader interests too: they are meant to be
introductions to language more generally as well as to linguistics
as a discipline.
Design
The books in the series are all cut to the same basic pattern. There
are four parts: Survey, Readings, References, and Glossary.
Survey
This is a summary overview of the main features of the area of
language study concerned: its scope and principles of enquiry, its
basic concerns and key concepts. These are expressed and
explained in ways which are intended to make them as accessible
as possible to people who have no prior knowledge or expertise in
the subject. The Survey is written to be readable and is unclut
tered by the customary scholarly references. In this sense, it is sim
ple. But it is not simplistic. Lack of specialist expertise does not
imply an inability to understand or evaluate ideas. Ignorance
means lack of knowledge, not lack of intelligence. The Survey,
therefore, is meant to be challenging. It draws a map of the sub
ject area in such a way as to stimulate thought and to invite a crit
ical participation in the exploration of ideas. This kind of
conceptual cartography has its dangers of course: the selection of
what is significant, and the manner of its representation, will not
be to the liking of everybody, particularly not, perhaps, to some
XII
P R E F ACE
of those inside the discipline. But these surveys are written in the
belief that there must be an alternative to a technical account on
the one hand, and an idiot's guide on the other if linguistics is to
be made relevant to people in the wider world.
Readings
Some people will be content to read, and perhaps re-read, the
summary Survey. Others will want to pursue the subject and so
will use the Survey as the preliminary for more detailed study. The
Readings provide the necessary transition. For here the reader is
presented with texts extracted from the specialist literature. The
purpose of these Readings is quite different from the Survey. It is
to get readers to focus on the specifics of what is said, and how it
is said, in these source texts. Questions are provided to further
this purpose: they are designed to direct attention to points in
each text, how they compare across texts, and how they deal with
the issues discussed in the Survey. The idea is to give readers an
initial familiarity with the more specialist idiom of the linguistics
literature, where the issues might not be so readily accessible, and
to encourage them into close critical reading.
References
One way
of moving into more detailed study is through
the
Readings. Another is through the annotated References in the
third section of each book. Here there is a selection of works
(books and articles) for further reading. Accompanying com
ments indicate how these deal in more detail with the issues dis
cussed in the different chapters of the Survey.
Glossary
Certain terms in the Survey appear in bold. These are terms used
in a special or technical sense in the discipline. Their meanings are
made clear in the discussion, but they are also explained in the
Glossary at the end of each book. The Glossary is cross-refer
enced to the Survey, and therefore serves at the same time as an
index. This enables readers to locate the term and what it signifies
in the more general discussion, thereby, in effect, using the Survey
as a summary work of reference.
P R E F ACE
XIII
Use
The series has been designed so as to be flexible in use. Each title is
separate and self-contained, with only the basic format in com
mon. The four sections of the format, as described here, can be
drawn upon and combined in different ways, as required by the
needs, or interests, of different readers. Some may be content with
the Survey and the Glossary and may not want to follow up the
suggested References. Some may not wish to venture into the
Readings. Again, the Survey might be considered as appropriate
preliminary reading for a course in applied linguistics or teacher
education, and the Readings more appropriate for seminar dis
cussion during the course. In short, the notion of an introduction
will mean different things to different people, but in all cases the
concern is to provide access to specialist knowledge and stimulate
an awareness of its significance. This series as a whole has been
designed to provide this access and promote this awareness in
respect to different areas of language study.
H . G . W I D D O W S O N
Author's acknowledgements
Language testing is often thought of as an arcane and difficult
field, and politically incorrect to boot. The opportunity to pro
vide an introduction to the conceptual interest of the field and to
some of its procedures has been an exciting one. The immediate
genesis for this book came from an invitation from Henry
Widdowson, who proved to be an illuminating and supportive
editor throughout the process of the book's writing. It was an
honour and a pleasure to work with him.
The real origins of the book lay further back, when over 15
years ago Terry Quinn of the University of Melbourne urged me
to take up a consultancy on language testing at the Australian
Language Centre in Jakarta. Terry has been an invaluable sup
port and mentor throughout my career in applied linguistics,
nowhere more so than in the field of language testing, which in his
usual clear-sighted way he has always understood as being inher
ently political and social in character, a perspective which I am
XIV
P R E F ACE
only now, after twelve years of research in the area, beginning to
properly understand. I am also grateful to my other principal
teachers about language testing, Alan Davies, Lyle Bachman, and
Bernard Spolsky, and to my friend and colleague Elana Shohamy,
from whom I have learnt so much in conversations long into the
night about these and other matters. I also owe a deep debt to
Sally Jacoby, a challenging thinker and great teacher, who has
helped me frame and contextualize in new ways my work in this
field. My colleagues at Melbourne, Brian Lynch and Alastair
Pennycook, have dragged me kicking and screaming at least some
way into the postmodern era. The Language Testing Research
Centre at the University of Melbourne has been for over a decade
the perfect environment within which thinking on language test
ing can flourish, and
I
am grateful to (again) Alan Davies and to
Cathie Elder, and to all my other colleagues there. Whatever clar
ity the book may have is principally due to my dear friend and
soulmate Lillian Nativ, who remains the most difficult and criti
cal student I have had. Being a wonderful teacher herself she will
never accept anything less than clear explanations. The students
to whom
I
have taught language testing or whose research I have
supervised over the years have also shaped this book in consider
able ways. At OUP,
I
have had excellent help from Julia Sallabank
and Belinda Penn.
On a more personal note
I
am grateful for the continuing sup
port and friendship of Marie-Therese Jensen and the love of our
son Daniel.
T I M M cNAMARA
P R E F A C E
XV
S EC T I O N
I
Survey
Testing, testing ...
What is a language test?
Testing is a universal feature of social life. Throughout history
people have been put to the test to prove their capabilities or to
establish their credentials; this is the stuff of Homeric epic, of
Arthurian legend. In modern societies such tests have proliferated
rapidly. Testing for purposes of detection or to establish identity has
become an accepted part of sport (drugs testing), the law (DNA
tests, paternity tests, lie detection tests), medicine (blood tests, can
cer screening tests, hearing, and eye tests), and other fields. Tests to
see how a person performs particularly in relation to a threshold of
performance have become important social institutions and fulfil a
gatekeeping function in that they control entry to many important
social roles. These include the driving test and a range
of
tests in edu
cation and the workplace. Given the centrality of testing in so<;:ial
life, it is perhaps surprising that its practice is so little understood. In
fact, as so often happens in the modern world, this process, which so
much affects our lives, becomes the province of experts and we
become dependent on them. The expertise of those involved in test
ing is seen as remote and obscure, and the tests they produce are typ
ically associated in us with feelings of anxiety and powerlessness.
What is true of testing in general is true also of language testing,
not a topic likely to quicken the pulse or excite much immediate
interest. If it evokes any reaction, it will probably take the form of
negative associations. For many, language tests may conjure up
an image of an examination room, a test paper with questions,
desperate scribbling against the clock. Or a chair outside the
interview room and a nervous victim waiting with rehearsed
phrases to be called into an inquisitional conversation with the
examiners. But there is more to language testing than this.
T E S T I N G , T E S T I N G
3
To begin with, the very nature of testing has changed quite rad
ically over the years to become less impositional, more humanis
tic, conceived not so much to catch people out on what they do
not know, but as a more neutral assessment of what they do.
Newer forms of language assessment may no longer involve the
ordeal of a single test performance under time constraints.
Learners may be required to build up a portfolio of written or
recorded oral performances for assessment. They may be
observed in their normal activities of communication in the lan
guage classroom on routine pedagogical tasks. They may be
asked to carry out activities outside the classroom context and
provide evidence of their performance. Pairs of learners may be
asked to take part in role plays or in group discussions as part of
oral assessment. Tests may be delivered by computer, which may
tailor the form of the test to the particular abilities of individual
candidates. Learners may be encouraged to assess aspects of their
�wn abilities.
Clearly these assessment activities are very different from the
solitary confinement and interrogation associated with tradi
tional testing. The question arises, of course, as to how these dif
ferent activities have developed, and what their principles of
design might be. It is the purpose of this book to address these
questions.
Understanding language testing
There are many reasons for developing a critical understanding of
the principles and practice of language assessment. Obviously
you will need to do so if you are actually responsible for language
test development and claim expertise in this field. But many other
people working in the field of language study more generally will
want to be able to participate as necessary in the discourse of this
field, for a number of reasons.
First, language tests play a powerful role in many people's lives,
acting as gateways at important transitional moments in educa
tion, in employment, and in moving from one country to another.
Since language tests are devices for the institutional control of
individuals, it is clearly important that they should be under
stood, and subjected to scrutiny. Secondly, you may be working
with language tests in your professional life as a teacher or
4
S U RV E Y
administrator, teaching to a test, administering tests, or relying
on information from tests to make decisions on the placement of
students on particular courses.
Finally, if you are conducting research in language study you
may need to have measures of the language proficiency of your
subjects. For this you need either to choose an appropriate exist
ing language test or design your own.
Thus, an understanding of language testing is relevant both for
those actually involved in creating language tests, and also more
generally for those involved in using tests or the information they
provide, in practical and research contexts.
Types of test
Not all language tests are of the same kind. They differ with
respect to how they are designed, and what they are for: in other
words, in respect to test
method
and test
purpose.
In terms of method, we can broadly distinguish traditional
paper-and-pencil language tests
from performance tests. Paper-and
pencil tests take the form of the familiar examination question
paper. They are typically used for the assessment either of sepa
rate components of language knowledge (grammar, vocabulary
etc.) or of receptive understanding (listening and reading compre
hension). Test items in such tests, particularly if they are profes
sionally made standardized tests, will often be in fixed response
format
in wh
i
c
h
a
number of possible responses is presented from
which the candidate is required to choose. There are several types
of fixed response format, of which the most important is multiple
choice format,
as in the following example from a vocabulary test:
Select the most appropriate completion of the sentence.
I
wonder what the newspaper says about the new play.
I
must
read the
(a)
criticism
(b)
opinion
':·
(c)
review
(d)
critic
Items in multiple choice format present a range of anticipated
likely responses to the test-taker. Only one of the presented alter
natives (the
key,
marked here with an asterisk) is correct; the
T E S T I N G , T E S T I N G
5
others (the
distractors)
are based on typical confusions or misun
derstandings seen in learners' attempts to answer the questions
freely in try-outs of the test material, or on observation of errors
made in the process of learning more generally. The candidate's
task is simply to choose the best alternative among those pre
sented. Scoring then follows automatically, and is indeed often
done by machine. Such tests are thus efficient to administer and
score, but since they only require picking out one item from a set
of given alternatives, they are not much use in testing the produc
tive skills of speaking and writing, except indirectly.
In performance based tests, language skills are assessed in an
act of communication. Performance tests are most commonly
tests of speaking and writing, in which a more or less extended
sample of speech or writing is elicited from the test-taker, and
judged by one or more trained raters using an agreed rating proce
dure.
These samples are elicited in the context of simulations of
real-world tasks in realistic contexts.
Test purpose
Language tests also differ according to their
purpose.
In fact, the
same form of test may be used for differing purposes, although in
other cases the purpose may affect the form. The most familiar
distinction in terms of test purpose is
that between achievement
and proficiency tests.
Achievement tests
are associated with the process of mstruc
tion. Examples would be: end of course tests, portfolio assess
ments, or observational procedures for recording progress on the
basis of classroom work and participation. Achievement tests
accumulate evidence during, or at the end of, a course of study in
order to see whether and where progress has been made in terms
of the goals of learning. Achievement tests should support the
teaching to which they relate. Writers have been critical of the use
of multiple choice standardized tests for this purpose, saying that
they have a negative effect on classrooms as teachers teach to the
test, and that there is often a mismatch between the test and the
curriculum, for example where the latter emphasizes perfor
mance. An achievement test may be self-enclosed in the sense that
it may not bear any direct relationship to language use in the
world outside the classroom {it may focus on knowledge of par-
6
S U RV E Y
ticular points of grammar or vocabulary, for example). This will
not be the case if the syllabus is itself concerned with the outside
world, as the test will then automatically reflect that reality in the
process of reflecting the syllabus. More commonly though,
achievement tests are more easily able to be innovative, and to
reflect progressive aspects of the curriculum, and are associated
with some of the most interesting new developments in language
assessment in the movement known as alternative assessment.
This approach stresses the need for assessment to be integrated
with the goals of the curriculum and to have a constructive rela
tionship with teaching and learning. Standardized tests are seen
as too often having a negative, restricting influence on progressive
teaching. Instead, for example, learners may be encouraged to
share in the responsibility for assessment, and be trained to evalu
ate their own capacities in performance in a range of settings in a
process known as self-assessment.
Whereas achievement tests relate to the past in that they mea
sure what language the students have learned as a result of teach
ing,
proficiency tests
look to the future situation of language use
without necessarily any reference to the previous process of
teaching. The future 'real life' language use is referred to as the cri
terion.
In recent years tests have increasingly sought to include
performance features in their
d
esign
, w
hereby characteristics
of
the criterion setting are represented. For example, a test of the
communicative abilities of health professionals in work settings
will be based on representations of such workplace tasks as com
municating with patients or other health professionals. Courses
of study to prepare candidates for the test may grow up in the
wake of its establishment, particularly if it has an important gate
keeping function, for example admission to an overseas univer
sity, or to an occupation requiring practical second language
skills.
The criterion
Testing is about making inferences; this essential point is
obscured by the fact that some testing procedures, particularly in
performance assessment, appear to involve direct observation.
Even where the test simulates real world behaviour-reading a
newspaper, role playing a conversation with a patient, listening to
T E S TING, T E S T ING
7
a lecture-test performances are not valued in themselves, but
only as indicators of how a person would perform similar, or
related, tasks in the real world setting of interest. Understanding
testing involves recognizing a distinction between the
criterion
(relevant communicative behaviour in the target situation) and
the
test.
The distinction between test and criterion is set out for
performance-based tests in Figure r.r
Test
A performance
or series of
performances,
simulating/
representing or
sampled from
the criterion
(observed)
Characterization
of the essential
features of the
riterion influence
the design of
test
c
inferences about
F I G u R E r. r
Test and criterion
Criterion
A series of
s
performances
subsequent to
the test; the
target
(unobservable)
Test performances are used as the basis for making inferences
about criterion performances. Thus, for example, listening to a
lecture in a test is used to infer how a person would cope with lis
tening to lectures in the course of study he/she is aiming to enter.
It is important to stress that although this criterion behaviour, as
relevant to the appropriate communicative role (as nurse, for
example, or student), is the real object of interest, it cannot be
accounted for as such by the test. It remains elusive since it cannot
be directly observed.
There has been a resistance among some proponents of direct
testing
to this idea. Surely test tasks can be authentic samples of
behaviour? Sometimes it is true that the materials and tasks in
language tests can be relatively realistic but they can never be
real.
For example, an oral examination might include a conversation,
or a role-play appropriate to the target destination. In a test of
English for immigrant health professionals, this might be between
a doctor and a patient. But even where performance test materials
appear to be very realistic compared to traditional paper-and-
8
S U RV E Y
pencil tests, it is clear that the test performance does not exist for
its own sake. The test-taker is not really reading the newspaper
provided in the test for the specific information within it; the test
taking doctor is not really advising the 'patient'. As one writer
famously put it, everyone is aware that in a conversation used
to assess oral ability 'this is a test, not a tea party'. The effect of
test method on the realism of tests will be discussed further in
Chapter
3·
There are a number of other limits to the authenticity of tests,
which force us to recognize an inevitable gap between the test and
the criterion. For one thing, even in those forms of direct perfor
mance assessment where the period in which behaviour is
observed is quite extended (for example, a teacher's ability to use
the target language in class may be observed on a series of lessons
with real students), there comes a point at which we have to stop
observing and reach our decision about the candidate-that is,
make an inference about the candidate's probable behaviour in
situations subsequent to the assessment period. While it may be
likely that our conclusions based on the assessed lessons may be
valid in relation to the subsequent unobserved teaching, differ
ences in the conditions of performance may in fact jeopardize
their validity (their generalizability). For example, factors such as
the careful preparation of lessons when the teacher was under
observation may not be replicated in the criterion, and the effect
of this cartrtot
be known in advance. The point is that observation
of behaviour as part of the activity of assessment is naturally self
limiting, on logistical grounds if for no other reason. In fact, of
course, most test situations allow only a very brief period of sam
pling of candidate behaviour-usually a couple of hours or so at
most; oral tests may last only a few minutes. Another constraint
on direct knowledge of the criterion is the testing equivalent of the
Observer's Paradox: that is, the very act of observation may
change the behaviour being observed. We all know how tense
being assessed can make us, and conversely how easy it some
times is to play to the camera, or the gallery.
In judging test performances then, we are not interested in the
observed instances of actual use for their own sake; if we were,
and that is all we were interested in, the sample performance
would not be a test. Rather, we want to know what the particular
TESTING, TESTING 9
performance reveals of the potential for subsequent performances
in the criterion situation. We look so to speak underneath or
through the test performance to those qualities in it which are
indicative of what is held to underlie it.
If our inferences about subsequent candidate behaviour are
wrong, this may have serious consequences for the candidate and
others who have a stake in the decision. Investigating the defensi
bility of the inferences about candidates that have been made on
the basis of test performance is known as test validation, and is the
main focus of testing research.
The test-criterion relationship
The very practical activity of testing is inevitably underpinned by
theoretical understanding of the relationship between the crite
rion and test performance. Tests are based on theories of the
nature of language use in the target setting and the way in which
this is understood will be reflected in test design. Theories of
language and language in use have of course developed in very
different directions over the years and tests will reflect a variety of
theoretical orientations. For example, approaches which see per
formance in the criterion as an essentially cognitive activity will
understand language use in terms of cognitive constructs such as
knowledge, ability, and proficiency. On the other hand, ap
proaches which conceive of criterion performance as a social and
interactional achievement will emphasize social roles and interac
tion in test design. This will be explored in detail in Chapter 2.
However, it is not enough simply to accept the proposed rela
tionship between criterion and test implicit in all test design.
Testers need to check the empirical evidence for their position in
the light of candidates' actual performance on test tasks. In other
words, analysis of test data is called for, to put the theory of the
test-criterion relationship itself to the test. For example, current
models of communicative ability state that there are distinct
aspects of that ability, which should be measured in tests. As a
result, raters of speaking skills are sometimes required to fill in a
grid where they record separate impressions of aspects of speak
ing such as pronunciation, appropriateness, grammatical accu
racy, and the like. Using data (test scores) produced by such
procedures, we will be in a position to examine empirically the
IO
S U RVEY
relationship between scores given under the various categories.
Are the categories indeed independent? Test validation thus
involves two things. In the first place, it involves understanding
how, in principle, performance on the test can be used to infer
performance in the criterion. In the second place, it involves using
empirical data from test performances to investigate the defensi
bility of that understanding and hence of the interpretations (the
judgements about test-takers) that follow from it. These matters
will be considered in detail in Chapter
5,
on test validity.
Conclusion
In this chapter we have looked at the nature of the test-criterion
relationship. We have seen that a language test is a procedure for
gathering evidence of general or specific language abilities from
performance on tasks designed to provide a basis for predictions
about an individual's use of those abilities in real world contexts.
All such tests require us to make a distinction between the
data
of
the learner's behaviour, the actual language that is produced in
test performance, and what these data signify, that is to say what
they count as in terms of
evidence
of 'proficiency', 'readiness for
communicative roles in the real world', and so on. Testing thus
necessarily involves interpretation of the data of test performance
as
evidence of knowledge or ability of one kind or another. Like
the soothsayers of ancient Rome, who inspected the entrails of
slain animals in order to make their interpretations and subse
quent predictions of future events, testers need specialized knowl
edge of what signs to look for, and a theory of the relationship of
those signs to events in the world. While language testing resem
bles other kinds of testing in that it conforms to general principles
and practices of measurement, as other areas of testing do, it is
distinctive in that the signs and evidence it deals with have to do
specifically with language. We need then to consider how views
about the nature of language have had an impact on test design.
T E S T I N G , TEST I N G
II
Communication and the design
of language tests
Essential to the activities of designing tests and interpreting the
meaning of test scores is the view of language and language use
embodied in the test. The term test construct refers to those
aspects of knowledge or skill possessed by the candidate which
are being measured. Although this term is taken from psychology,
we should note that the knowledge or skill being assessed does
not have to be defined in psychological terms. Thus some scholars
have taken a social rather than psychological view of language
performance and would define the test construct accordingly.
Defining the test construct involves being clear about what
knowledge of language consists of, and how that knowledge is
deployed in actual performance (language use). Understanding
what view the test takes of language use in the criterion is neces
sary for determining the link between test and criterion in per
formance testing. This is not just an academic matter. It has
important practical implications, because according to what view
the test takes, the 'look' of the test will be different, reporting of
scores will change, and test performance will be interpreted dif
ferently. The difference of format between paper-and-pencil tests
and performance tests is not just incidental; it reflects an implicit
difference between views of language and language use.
Discrete point tests
Early theories of test performance, influenced by structuralist
linguistics, saw knowledge of language as consisting of mastery
of the features of the language as a system. This position was
clearly articulated by Robert Lado in his highly influential book
Language Testing,
published in r 9 6 1 . Testing focused on
C O M M U N I C A T I O N AND T H E D E S I G N O F LANGU A G E T E S T S
I 3
candidates' knowledge of the grammatical system, of vocabulary,
and of aspects of pronunciation. There was a tendency to atomize
and decontextualize the knowledge to be tested, and to test
aspects of knowledge in isolation. Thus, the points of grammar
chosen for assessment would be tested one at a time; and tests of
grammar would be separate from tests of vocabulary. Material to
be tested was presented with minimal context, for example in an
isolated sentence. This practice of testing separate, individual
points of knowledge, known as discrete point testing, was rein
forced by theory and practice within psychometrics, the emerging
science of the measurement of cognitive abilities. This stressed the
need for certain properties of measurement, particularly reliabil
ity, or consistency of estimation of candidates' abilities. It was
found that this could be best achieved through constructing a test
consisting of many small items all directed at the same general tar
get-say, grammatical structure, or vocabulary knowledge. In
order to test these individual points, item formats of the multiple
choi<;:e question type were most suitable. While there was also
realization among some writers that the integrated nature of per
formance needed to be reflected somewhere in a test battery, the
usual way of handling this integration was at the level of skills
testing,
so that the four language macroskills of listening, reading,
writing, and speaking were in various degrees tested (again, in
strict isolation from one another) as a supplement to discrete
point tests. This period of language testing has been called the
psychometric-structuralist period and was in its heyday in the
r96os; but the practices adopted at that time have remained
hugely influential.
Integrative and pragmatic tests
Within a decade, the necessity of assessing the practical language
skills of foreign students wishing to study at universities in Britain
and the US, together with the need within the communicative
movement in teaching for tests which measured productive
capacities for language, led to a demand for language tests which
involved an integrated performance on the part of the language
user. The discrete point tradition of testing was seen as focusing
too exclusively on knowledge of the formal linguistic system for
its own sake rather than on the way such knowledge is used to
14
S U RV E Y
achieve communication. The new orientation resulted in the
development of tests which integrated knowledge of relevant sys
tematic features of language (pronunciation, grammar, vocabu
lary) with an understanding of context. As a result, a distinction
was drawn between discrete point tests and integrative tests such
as speaking in oral interviews, the composing of whole written
texts, and tests involving comprehension of extended discourse
(both spoken and written). The problem was that such integrative
tests tended to be expensive, as they were time consuming and dif
ficult to score, requiring trained raters; and in any case were
potentially unreliable (that is, where judges were involved, the
judges would disagree).
Research carried out by the American, John Oller, in the 1 970s
seemed to offer a solution. Oller offered a new view of language
and language use underpinning tests, focusing less on knowledge
of language and more on the psycholinguistic processing involved
in language use. Language use was seen as involving two factors:
( r ) the on-line processing of language in real time (for example, in
naturalistic speaking and listening activities), and ( 2) a 'prag
matic mapping' component, that is, the way formal knowledge of
the systematic features of language was drawn on for the expres
sion and understanding of meaning in context. A test of language
use had to involve both of these features, neither of which was felt
to be captured in the discrete point tradition of testing. Further,
Oller proposed what came to be known as the Unitary Competence
Hypothesis,
that is, that performance on a whole range of tests
(which he termed pragmatic tests) depended on the same underly
ing capacity in the learner-the ability to integrate grammatical,
lexical, contextual, and pragmatic knowledge in test perfor
mance. He argued that certain kinds of more economical and
efficient tests, particularly the cloze test (a gap-filling reading
test), measured the same kinds of skills as those tested in produc
tive tests of the types listed above. It was argued that a doze test
was an appropriate substitute for a test of productive skills
because it required readers to integrate grammatical, lexical, con
textual, and pragmatic knowledge in order to be able to supply the
missing words. A doze test was a reading test, consisting of a text of
approximately 400 words in length. After an introductory sentence
or two which was left intact, words were systematically removed-
C O M M U N I CATI O N A N D T H E D E S I G N O F L A N G U A G E T E S T S
1 5
every 5th, 6th or 7th word was a typical procedure-and replaced
with a blank. The task was for the reader to supply the missing
word. Various scoring methods (exact word replacement, any
acceptable word replacement) were tried out and seemed to pro
vide much the same information about the relative abilities of
readers. Such tests were easy to construct, relatively easy to score,
were based on a compelling theory of language use, and seemed
an attractive alternative to more elaborate and expensive tests of
the productive skills of speaking and writing. The doze thus
became a very popular form of test in the 1 970s and early 198os
(and is still widely used today).
Unfortunately, further work soon showed that doze tests on
the whole seemed mostly to be measuring the same kinds of things
as discrete point tests of grammar and vocabulary. It seems that
there are no short cuts in the testing of communicative skills.
Communicative language tests
From the early 1970s, a new theory of language and language use
began to exert a significant influence on language teaching and
potentially on language testing. This was Hymes's theory of com
municative competence, which greatly expanded the scope of
what was covered by an understanding of language and the abil
ity to use language in context, particularly in terms of the social
demands of performance. Hymes saw that knowing a language
was more than knowing its rules of grammar. There were cultur
ally specific rules of use which related the language used to fea
tures of the communicative context. For example, ways of
speaking or writing appropriate to communication with close
friends may not be the same as those used in communicating with
strangers, or in professional contexts. Although the relevance of
Hymes's theory to language testing was recognized more or less
immediately on its appearance, it took a decade for its actual
impact on practice to be felt, in the development of communica
tive language tests. Communicative language tests ultimately
came to have two features:
r
They were performance tests, requiring assessment to be carried
out when the learner or candidate was engaged in an extended
act of communication, either receptive or productive, or both.
1 6
S U RVEY
2 They paid attention to the social roles candidates were likely to
assume in real world settings, and offered a means of specifying
the demands of such roles in detail.
The second of these features distinguishes communicative lan
guage tests from the integrative/pragmatic testing tradition. The
theory of communicative competence represented a profound
shift from a psychological perspective on language, which sees
language as an internal phenomenon, to a sociological one, focus
ing on the external, social functions of language.
Developments in Britain were particularly significant. The
Royal Society of Arts developed influential examinations in
English as a Foreign Language with innovative features such as
the use of authentic texts and real world tasks; and the British
Council and other authorities developed communicative tests of
English as a Foreign Language for overseas students intending to
study at British universities. These latter tests in some cases
involved careful study of the communicative roles and tasks fac
ing such students in Britain as the basis for test design; this stage
of the process is known as a job analysis. This approach has con
tinued to be used in the development of tests in occupational set
tings. For example, in the development of an Australian test of
English as a second language for health professionals, those
familiar with clinical situations in hospital settings were sur
veyed, and tasks such as communicating with patients, presenting
cases to colleagues, and so on were identified and ranked accord
ing to criteria such as complexity, frequency, and importance as
the basis for subsequent test task design. Test materials were then
developed to simulate such roles and tasks where possible.
Models of communicative ability
The practical and imaginative response to the challenge of com
municative language testing was matched by a continuing theo
retical engagement with the idea of communicative competence
and its implications for the performance requirement of com
municative language testing. Various writers have tried to specify
the components of communicative competence in second lan
guages and their role in performance. This has been done in order
to provide a comprehensive framework for test development and
C O M M U N I CATI O N A N D T H E D E SI G N O F L A N G U A G E T E S T S
17
testing research, and a basis for the interpretation of test perfor
mance.
In their first form, such models specified the components of
knowledge of language without dealing in detail with their role in
performance. Various aspects of knowledge or competence were
specified in the early r9 8os by Michael Canale and Merrill Swain
in Canada:
r
grammatical
or formal competence, which covered the kind
of knowledge (of systematic features of grammar, lexis, and
phonology) familiar from the discrete point tradition of
testing;
2
sociolinguistic
competence, or knowledge of rules of language
use in terms of what is appropriate to different types of inter
locutors, in different settings, and on different topics;
3
strategic
competence, or the ability to compensate in perfor
mance for incomplete or imperfect linguistic resources in a sec
ond language; and
4
discourse
competence, or the ability to deal with extended use
of language in context.
Note that strategic competence is oddly named as it is not a type
of stored knowledge, as the first two aspects of competence
appear to be, but a capacity for strategic behaviour in perfor
mance, which is likely to involve non-cognitive issues such as
confidence, preparedness to take risks, and so on. Discourse com
petence similarly has elements of a general intellectual flexibility
in negotiating meaning in discourse, in addition to a stored
knowledge aspect-in this case, knowledge of the way in which
links between different sentences or ideas in a text are explicitly
marked, through the use of pronouns, conjunctions, and the
like.
Further years of discussion and reflection on this framework
have led to its more detailed reformulation. There has, to begin
with, been a further specification of different components of
knowledge that would appear to be included in communicative
competence. Thus Lyle Bachman, for example, has identified sub
categories of knowledge within the broader categories of gram
matical, discourse, and sociolinguistic competencies. At the same
1 8
S U RV E Y
time, strategic competence no longer features as a component of
such knowledge. In fact, the notion of strategic competence
remains crucial in understanding second language performance,
but it has been reconceptualized. Instead of referring to a com
pensatory strategy for learners, it is seen as a more general phe
nomenon of language use. In this view, strategic competence is
understood as a general reasoning ability which enables one to
negotiate meaning in context.
This reworking of the idea of strategic competence has impor
tant implications for assessment. If strategic competence is not
Chia sẻ với bạn bè của bạn: |