Oxford Introductions to Language Study Language Testing

part of a larger collective activity, one which is deliberate, con

tải về 2.79 Mb.

Chế độ xem pdf

trang	7/15
Chuyển đổi dữ liệu	16.04.2022
Kích	2.79 Mb.
	#51661

1 2 3 4 5 6 7 8 9 10 ... 15

(by-Tim-McNamara)-Language-Testing

part of a  larger  collective  activity,  one  which  is  deliberate,  con
structed for a  particular purpose.  It involve the  efforts  of many
others in addition to the individual whose performance is 'in the
spotlight'.
This chapter presents a perspective on assessment which focuses
on  the  larger  framing  and  social  meaning  of  assessment.  Such
a  perspective  has  drawn  on  diverse  fields  including  sociology,
THE S O C I A L  CHARACTER O F  LAN G U A G E  T E S T S
67

political  and  cultural theory,  and  discourse analysis for  its  ana
lytic tools and concepts, together with an expanded notion of test
validity.
The institutional character of assessment
The  individualized  and  individualizing  focus  of  traditional
approaches described so far is really rather  surprising when we
consider  the  inherently  institutional  character  of  assessment.
When test reforms are introduced within the educational system,
they  are  likely  to  figure  prominently  in  the  press  and  become
matters of public concern. This is because they impinge directly
on people's lives.  When an assessment is made,  it is not done by
someone  acting  in  a  private  capacity,  motivated  by  personal
curiosity about the other individual, but in an institutional role,
and serving institutional purposes.  These  will typically involve
the fulfilment of policy objectives in education and other areas of
social  policy.  And  social  practice  raises  questions  of  social
responsibility.
Assessment and social policy
Language tests have a long history of use as instruments of social
and cultural exclusion.  One of the earliest recorded instances is
the shibboleth test, mentioned in the Old Testament. Following a
decisive military battle between two neighbouring ethnic groups,
members of the vanquished group attempted to escape by blend
ing in with their culturally and linguistically very similar victors.
The
two
groups  spoke varieties  of a
single language
aml
it
was
typically possible to distinguish between speakers of either vari
ety by the way they pronounced words beginning with a sibilant
sound. The victors pronounced such words with an  [sh]  sound,
the  vanquished  with  the  sound  [s] .   So  the  word  'shibboleth'
(meaning according to some authorities 'an ear of wheat', others
'a stream') was used as a single item language test by the victori
ous group in order to detect the enemy in their midst. Individuals
suspected of being members of the vanquished were asked to say
this word,  and if they pronounced it 'sibboleth', they failed the
test. In this case, failure was fatal since the test-takers were imme
diately put to death. Poor performance on a test may have serious
consequences, though fortunately not usually as dire as this.
68
S U RVEY

Notice  that  the  test  here  is  a  test  of authenticity  of identity,
rather than  of proficiency;  a  single instance is enough to  betray
the identity which the test aims to detect. A more recent instance
of a detection test is the proposal in the  19 6os, but never imple
mented,  for  a  language  test to  be  used  by the  Royal  Canadian
Mounted  Police  to  exclude  homosexual  recruits.  Word  lists
which included some items of homosexual slang (words such as
camp, cruise, fruit,
and
trade)
would be presented to recruits, and
the sweatiness  of their  palms  (a  sign  of nervousness)  would  be
measured  electrically.  It  was  assumed  that  only  homosexuals
familiar with the subculture in which these terms were used, with
secondary  slang  meanings,  would recognize and  respond  to the
ambiguity of the terms. They would become nervous, sweat, and
be detected. In this test, a perfect score was zero !
More  conventional  proficiency  tests  have  also  been  used  for
purposes  of  exclusion.  Prior  to  the  Second  World  War  the
Australian Government used a language test as part of their pol
icy  to  exclude  immigrants  other  than  those  coming  from  the
British Isles. Those  applying to immigrate could be administered
a dictation test in any language selected by the immigration offi
cer.  If the  person  passed the  test  in  English,  then  any  one  of a
range of other languages could be used until the candidate failed.
In  one notorious case,  a  Hungarian Jewish refugee from Hitler's
persecutions applied for immigrant status. He was a polyglot and
passed the test in a number of languages before finally failing in
Gaelic, thereby being refused entry and thus facing a tragic fate in
Europe. The blatancy of such a practice is not readily replicated
elsewhere, but it illustrates the possibility that language tests can
form part of a politically and morally objectionable policy.
Assessment and educational policy
Assessment serves  policy  functions  in educational contexts, too.
One example is in the area of vocational education and training
for  adults.  Most  industrialized  countries  have,  in  recent  years,
responded to the need for the upgrading of the workforce in the
face  of rapid  technological  change  by developing more  flexible
policies  for  the  recognition  and  certification  of  specific  work
related skills, each of which may be termed a competency. National
competency frameworks,  consisting  of an ordered series of 'can
T H E  S O C IAL CHARACTER  OF  LAN G U A G E  T E S T S
69

do'  statements describing levels of performance on relevant job
related tasks, have been adopted. Language and literacy compe
tency frameworks have been developed as part of these policies.
In international education, tests are  used to control access to
educational opportunities. Typically, international students need
to meet a  standard  on  a test of language  for  academic  purposes
before they are admitted to the university of their choice. Is this
reasonable? Should access to educational opportunity be restricted
on the basis of a language test? If it is agreed that some assessment
of language  ability is reasonable in  this context,  then  questions
arise regarding the  level  of proficiency to be required, and how
this should be determined. Further, should the assessment of lan
guage  proficiency  be  carried  out  within  the  context  of perfor
mance  on typical academic tasks?  But then, does this  not mean
that those who have had some experience of such tasks have the
advantage over those who  do  not?  If this  is so, then  one might
question the fairness of such tasks as instruments for the testing of
language ability.  One can also raise the question of how native
speakers might perform on such integrated tasks, and why, given
that they are  admitted to the  same courses of study, they should
not also be required to subject themselves to assessment.
The social responsibility of the language tester
The policies and practices discussed in the preceding two sections
throw up a host of questions about fairness, and about the policy
issues surrounding testing practice. They also raise the question
of the responsibilities of language testers. Recently, serious atten
tion has  been given to these issues for the first time,  an overdue
development,  one  might  say,  given  the  essentially  institutional
character of testing.
Imagine the following situation involving the use of language
tests within immigration policy. You live in an English-speaking
country which accepts  substantial numbers of new settlers each
year. The current immigration policy distinguishes between cate
gories of intending settlers. The claims of refugees are privileged
in various ways, as are those of family members of local citizens
(settled immigrants have the right to apply to bring into the coun
try parents who are living in the country of origin) .  English lan
guage proficiency and knowledge of local cultural practices have
70
S U RVEY

not been a criterion in selection in such cases. A further category of
individuals with no prior connection to the country,  and who are
not  refugees, may  also  apply  for  immigration;  but  the  selection
process for them is much tougher-approximately only one in ten
who apply is granted permission to settle. Selection criteria for this
category  of  applicants  include  educational  level,  type  of work
expertise,  age,  and  proficiency  in  English,  among  other  things.
English language proficiency is currently assessed informally by an
immigration  officer  at  the  time  of  interview.  The  immigration
authorities approach you to be part of a team commissioned with
the development of a  specific test for  the  purpose  of more accu
rately determining the proficiency of intending immigrants in this
category. What ethical issues do you face?
On the  one  hand, the  advent  of the new test  might  appear to
promote fairness. Obviously, as judgements in the current infor
mal procedures are not made by trained language evaluators, and
no quality control procedures  are in place, there are inconsisten
cies in standards, and hence unfairness to individuals. A carefully
constructed test, both more relevant in its content, and more reli
able in its decisions,  appears on the face of it to be fairer for the
majority.  On the other hand, the introduction of such an instru
ment raises worrying possibilities. Might not the authorities, once
it is in place, be tempted to use it on previously exempt categories,
for example refugees or family members ? Who will be in charge
of  interpreting  scores  on  the  test?  Who  will  set  cut-scores  for
'passing' and 'failing'? In response to your inquiries on this point,
you  are  informed  that  cut-scores  will  vary  according  to  the
requirements of immigration policy, higher when there is political
pressure to restrict immigration numbers,  lower when there is a
labour shortage and immigrant numbers are set to rise. The polit
ical nature of the test is revealed by such facts-where does that
leave  you  as  a  socially  responsible  test  developer?  Should  you
refuse to get involved?
Such cases raise issues of the ethics of language testing practice,
which are becoming a matter of considerable current debate. We
can distinguish two views, both of which acknowledge the social
and political role of tests. One holds that language testing practice
can be made ethical, and stresses the individual responsibility of
testers to ensure that they are. The other sees tests as essentially
T H E  S O C I A L  C HARACTER O F  LAN G U A G E  T E S T S
7 I

sociopolitical constructs, which, since they are designed as instru
ments of power and control,  must therefore be subjected to the
same kind of critique as are all other political structures in society.
We may refer to the first view as ethical language testing; the latter
is usually termed critical language testing.
Ethical language testing
Those who argue that language testing can be an ethical activity
take  either  a  broader  or  more  restricted  view  of  the  ethics  of
testing. We can call the former the social responsibility view, the
latter the traditional view.
Those who advocate the position  of socially responsible  lan
guage  testing  reject  the  view  that  language  testing  is  merely  a
scientific  and  technical  activity.  They  appeal  to  recent  develop
ments in thinking about validity, especially to the notion of con
sequential  validity.  In  general,  this  means  that  evaluation  of  a
test's  validity  needs  to  take  into  account  the  wanted  and
unwanted consequences that follow from the introduction of the
test. Some take the view that consequential validity, like validity
of other kinds (as  discussed in Chapter 5 ), is the responsibility of
the test developer and needs to be taken into account, not only by
anticipating  possible  consequences  in  test  design,  but  also  by
monitoring its effects in implementation.
Generally,  this  expanded  sense  of  responsibility  sees  ethical
testing practice as involving test developers in taking responsibil
ity
for the effects  of tests. There are three main areas of concern
here. One of these is accountability. This has to do with a sense of
responsibility to the people most immediately affected by the test,
principally the test-takers, but also  those who will use the infor
mation it provides. The test (and hence the test developer) need to
be accountable to them. A second area relates to the influence that
testing has on teaching, the so-called washback effect. The third
involves  a consideration of the effect of a test beyond the class
room, the ripples or waves it makes in the wider educational and
social world: what we can call the test impact.
Accountability
Ethical testing practice is seen as involving making tests account
able to test-takers. Test developers are typically more preoccupied
7 2
S U RV E Y

with satisfying the demands of those commissioning the test, and
with their own work of creating a workable test. Test-takers are
seldom represented on test development committees which super
vise the work of test development, and represent the interests of
stakeholders. Minimally, accountability would require test devel
opers to provide test-takers with complete information on what is
expected of them in the test. Such information is often provided in
the  form  of  a  test  users'  handbook  or  manual,  which  provides
information on the rationale for the test and its structure, general
information on its  content and the format of items, and sample
items.
More  substantially,  test  developers  should  be  required  to
demonstrate that the test content  and format are relevant to can
didates, and that the testing practice is accountable to their needs
and interests. Too often, traditional testing procedures and  for
mats may be preferred even in situations where they are no longer
relevant. For example, British examinations originally developed
for the  British  secondary school  system  are  still used  in Africa,
despite the inappropriateness of their content and format.
An aspect of accountability is the question of determining the
norms of language behaviour which will act as a reference point
in the assessment. This will include issues such as the appropriate
variety
of the language to be tested. In an era where no single vari
ety of English constitutes a norm everywhere, the question arises
of how much of the variation among English speakers it is appro
priate to include in a test.
Consider, for example, the TOEFL test, used primarily for selec
tion of international students to universities in the United States.
Given the diversity of varieties of English, both native and non
native, typically encountered in the academic environment there,
it might  be  argued  that it is  responsible to include examples of
those varieties in the test rather than to include only samples of
the standard variety.
Washback
The power of tests in determining the life chances of individuals
and in influencing the reputation of teachers and schools means
that they can have a strong influence on the curriculum. The effect
of  tests  on  teaching  and  learning  is  known  as  test  washback.
T H E  S O C IA L  CHARACTER O F  LANGUAGE T E S T S   73

Ethical language testing practice, it is felt, should work to ensure
positive wash back from tests.
For example, it is sometimes argued that performance assess
ments have better washback than multiple choice test formats or
other  individual  item  formats,  such  as  doze,  which  focus  on
isolated elements  of knowledge or skill.  As performance assess
ments  required  integration  of  knowledge  and  skills  in  perfor
mance  on  realistic  tasks,  preparation  for  such  assessments  will
presumably  encourage  teachers  and  students  to  spend  time
engaged in performance of such tasks as part of the teaching. In
contrast, multiple choice format item tests of knowledge of gram
mar  or  vocabulary  may  inhibit  communicative  approaches  to
learning and teaching.
Authorities responsible for assessment sometimes use assessment
reform to  drive curriculum reform, believing that the  assessment
can  be  designed  to  have  positive  washback  on  the  curriculum.
However, research both on the presumed negative washback of
conservative  test formats,  and  on the presumed positive  wash
back of communicative assessment (assumed to be more progres
sive)  has  shown  that  washback  is  often  rather  unpredictable.
Whether or not the desired effect is achieved will depend on local
conditions  in  classrooms,  the  established  traditions  of teaching,
the immediate motivation  of learners,  and  the frequently unpre
dictable ways in which classroom interactions develop. These can
only be esLablisheu after the event, posr hoc, on the basis of infor
mation collected once the reform has been introduced.
Test
impact
Tests can also have effects beyond the classroom. The wider effect
of tests  on  the  community  as  a  whole,  including  the  school,  is
referred to as test impact. For example, the existence of tests such
as TOEFL, used as gatekeeping mechanisms for international edu
cation, and administered to huge numbers of candidates all over
the world, has effects beyond the classroom, in terms  of educa
tional policy and the allocation of resources to education. In cer
tain areas of the world, university selection is  based directly on
performance in the assessments of the senior year of high school.
This has  often  led  to  the  existence  of tightly  controlled  formal
examinations, partly in order to make what tended to become a
74
S U RVEY

very competitive assessment as psychometrically reliable as possi
ble.  However,  in  an  era where most students are  completing  a
secondary  education,  such  an  assessment  no  longer  meets  the
needs  of the majority of students. A curriculum  and assessment
reform in favour of continuous assessment and the completion of
projects and assignments in such a case would have widespread
impact on families, universities, employers, and employment and
welfare services. In fact, in one such case, part of the impact of the
reform was to open the door to abuses of the assessment process
by wealthy  families, who  could afford to  hire  private  tutors  to
coach their children through the projects they had to complete in
order to gain the scores they needed to enter the university of their
choice. Test impact is likely to be complex and unpredictable.
Codes of professional ethics for language testers
In contrast to those advocating the direct social responsibility of
the tester, a more traditional approach involves limiting the social
responsibility of language testers to questions of the professional
ethics of their practice. In this view, the approach to the ethics of
language testing practice should be the same as that taken within
other  areas  of  professional  practice,  such  as  medicine  or  law.
Professional bodies of language testers should formulate codes of
practice  which  will  guide  language  testers  in  their  work.  The
emphasis  is  on  good  professional  practice:  that  is,  language
testers  should  in general rake responsibiliLy
for  Lhe development
of quality language tests.  The  larger  questions  of the politics of
language testing fall not so much within the domain of the ethics
of language  testing practice  as  such;  instead  they represent  the
ethical  questions  that  all  citizens  must  face-for  example,  on
issues such as capital punishment, abortion and the like.
Those  taking  this  view  understand  consequential  validity  as
concerning consequential impediments to the interpretability of
test scores. For example, in the case of the notorious Australian
dictation test discussed earlier, test developers were presumably
aware of the uses to which the test was to be put. But instead of
arguing  that  language  testers  have  an  ethical  responsibility  to
object to the policy behind the test in such a case, it may be suffi
cient (and arguably more effective) to oppose the test on the basis
of professional validity arguments. What is wrong with this test is
T H E  S O C IA L  CHARACTER  OF  LANGUAGE T E S T S
7 5

that  there  was  only  one  acceptable  inference  possible from  the
test:  that  the  test-taker  was  unsuitable  for  acceptance  into
Australia. Proficiency in the range of languages tested was not rel
evant to the question of the person's suitability for settlement in
Australia. The problem with the test, in this view, is that the test
construct is not meaningful or interpretable in this context.  It is
not  a  valid test. The  fact  that it  constitutes  an  offence  against
social justice thus does not need to be addressed directly; rather,
the test is found wanting within an expanded theory of validity,
that is, one which includes consequential validity.
Critical language testing
A much more  radical view of the  social and  political role  of tests
is being formulated as part of the developing area known as critical
applied  linguistics.  This  applies  current  social  theory and critical
theory to issues within applied linguistics generally. Language test
ing, as a quintessentially institutional activity, is facing increasing
scrutiny from this perspective. The basic tenets of such a view are
that the principles and practices that have become established as
common sense or common knowledge are actually ideologically
loaded to favour those in power, and so need to be exposed as an
imposition on the  powerless.  In this  view,  there  would  be  little
point  in tinkering with existing institutional constructs,  working
within the framework they determine. What is needed is a radical
reconstruction which changes the whole ideological foundations.
In  this  perspective  the  very  concept  of testing,  of  language  or
anything  else,  gets  redefined  in  socio-political  terms.  Critical
language  testing is  best understood as  an  intellectual  project  to
expose the role of tests in this exercise of power. For example, the
existence of language testing on a huge international scale-what
some  have  called  industrialized  language  testing-is  ripe  for
critical analysis. There are  hundreds of thousands  of individual
administrations of the TOEFL test in any year, in a huge number
of countries; what are we to make of this phenomenon in critical
terms ?
From the perspective of critical language testing, the emphasis
in ethical language testing on the individual responsibility of the
language  tester  is  misguided  because  it  presupposes  that  this
would operate within the established institution of testing, and so
76
S U RV E Y

essentially  accept  the  status  quo  and  concede  its  legitimacy.
Critical language testing at its most radical is not reformist since
reform is  a  matter  of modification not total replacement.  At  its
most radical indeed, it would not recognize testing as we know it
at all. Given this, it is perhaps unsurprising that language testers
themselves  have  found  it  difficult to  articulate  this  critique,  or
have interpreted it as  implying the  necessity  for  individual  ethi
cally responsible behaviour on the part of testers. The critique, if
and when it comes, may emerge most forcefully from outside the
field.  Given the disciplinary borders of knowledge and influence
in the  field,  however,  any  criticism  from  outside  may  be  heard
only with difficulty by practitioners within.
Conclusion
In this chapter we  have  examined the  institutional character  of
tests and the implications of this for understanding the nature of
language testing as a social practice, and the responsibility of lan
guage testers. Language testing, like language itself, cannot ulti
mately be isolated from wider social and political implications. It
is perhaps not surprising after all that the field has only belatedly
grasped this fact, and even now is uncertain about the extent to
which it is able or willing to articulate a thorough critique of its
practices. This may best be left to those not involved in language
testing.  Language testers themselves meanwhile stand to benefit
from a greater awareness of language testing as a  social practice.
It may lead to a more responsible exercise of the power of tests,
and a more deeply questioning approach to the questions of test
score meaning which lie at the heart of the validity of language
tests.
THE S O C I A L  CHARACTER O F  LAN G U A G E  T E S T S
77

New directions-and dilemmas?
We live in a time of contradictions. The speed and impressiveness
of  technological  advance  suggest  an  era  of great certainty  and
confidence. Yet  at the  same time current social theories under
mine  our  certainties,  and  have  engendered  a  profound  ques
tioning  of  existing  assumptions  about  the  self  and  its  social
construction.  Aspects  of these  contradictory trends  also  define
important points of change in language testing. The applications
of technological innovations  in  language testing remain for the
most part rooted in traditional modernist assumptions about the
nature  of performance  and  the  possibilities  of  measurement  of
language ability. It is  assumed,  for example, that there is such a
thing  as  'ability' which  is located in the  mind  of the  candidate,
which is,  as  it were,  projected directly in performance; that the
individual candidate is solely responsible for his/her performance
in the test;  and that  ability can be measured  more  or less  objec
tively. But it is these very individualizing modernist assumptions
of testing practice which are now being challenged by new theo
ries of performance. Language testing is a field in crisis, one which
is masked by the impressive appearance of technological advance.
Computers and language testing
Rapid  developments  in  computer  technology have  had  a  major
impact  on test delivery.  Already, many important  national  and
international language tests, including
TOEFL,
are moving to com
puter  based  testing  (CBT).
Stimulus  texts  and  prompts  are  pre
sented  not  in  examination  booklets  but  on  the  screen,  with
candidates being required to key in their responses. The advent of
CBT
has not necessarily involved  any change in the test content,
NEW D I R E C T I O N S - AN D  D I LEMMAS ?
79

which  may  remain  quite  conservative  in  its  assumptions,  but
often simply represents a change in test method.
The proponents of computer based testing can point to a num
ber of advantages.  First,  scoring  of fixed  response items  can be
done automatically, and the candidate can be given a score imme
diately. Second, the computer can deliver tests that are tailored to
the particular abilities of the candidate. It seems inefficient for all
candidates to take all the questions on a test; clearly some are so
easy for some candidates that they provide little information on
their abilities; others are too hard to  be of use. It makes sense to
use the very limited time  available for  testing to  focus  on those
items that are just within, and just beyond a candidate's threshold
of ability.
Computer  adaptive tests  do just this. At the  beginning  of the
test, a small number of common items are presented to all candi
dates.  Depending on  how  an individual candidate performs  on
those items, he/she is subsequently presented only with items esti
mated  to  be  within  his/her  likely  ability  range.  The  computer
updates its estimate of the candidate's ability after each response.
In  this  way,  the  test  adapts  itself  to  the  candidate.  Such  tests
require the prior creation of an item bank, a large group of items
which have been thoroughly trialled,  and whose likely difficulty
for candidates at given levels of ability has been estimated as pre
cisely as possible.
Items are drawn  from the  item bank in response to the perfor
mance of the candidate on each item, until a point where a stable
and precise estimate of the candidate's  ability is achieved. In this
way  each  candidate  will  receive  a  test  consisting  of  a  possibly
unique combination of items from the bank, a test suited precisely
to the candidate's ability. The existence of large item banks makes
possible a third advantage of computer based testing. Tests can be
provided on demand, because so many item combinations are pos
sible that test security is not compromised. Computer adaptive tests
of grammar and vocabulary have long been available, but recently
similar tests of listening and reading skills have been developed.
The  use  of computers  for  the  delivery  of test  materials  raises
questions of validity, as we might expect. For example, different
levels  of familiarity with  computers  will  affect  people's  perfor
mance  with  them,  and  interaction with the  computer may be  a
8o
S U RV E Y

stressful  experience  for  some.  Attempts  are  usually  made  to
reduce  the  impact  of  prior  experience  by  the  provision  of  an
extensive tutorial  on  relevant skills  as  part  of the  test  (that  is,
before the test proper begins) .  Nevertheless, the question about
the impact of computer delivery still remains.
Questions about the importance of different kinds of presenta
tion format are raised or exacerbated by the use of computers. In
a writing test, the written product will appear in typeface and will
not  be  handwritten;  in  a  reading  test,  the  text  to  be  read  will
appear on a screen, not on paper. Do raters react differentially to
printed versus handwritten texts? Is any inference we might draw
about  a  person's  ability  to  read  texts  presented  on  computer
screens generalizable to that person's ability to read texts printed
on paper, and vice versa? In computerized tests of written compo
sition, composing processes are likely to be different, because of
word processing capacities available on the computer. Do such
differences  in  aspects  of test method result  in different conclu
sions  about  a  candidate's  ability?  A  complex  programme  of
research is needed to answer these questions.
The  ability of computers to carry out various  kinds  of auto
matic processes on spoken or written texts is having an impact on
testing.  These will include the  ability to  do  rapid counts  of the
number of tokens of individual words, to analyse the grammar of
sentences, to count pauses, to calculate the range of vocabulary,
and to analyse features of pronunciation. Already these automatic
measures  of pronunciation  or writing quality are  being used in
place of a second human rating of performances, and have been
found to contribute as much to overall reliability as a human rating.
Of course, such computer operations have limitations. For example,
in the testing of speaking, they are bound to be better at acoustic
than auditory aspects of pronunciation, and cannot readily iden
tify intelligibility since this is a function of unpredictable contex
tual  factors.  Nevertheless,  we  can  expect  many  further  rapid
advances in these fields, with direct application to testing.
Technology and the testing of speaking
While computers  represent the most rapid point of technological
change, other less complex technologies, which have been in use
for some time, have led to similar validity questions.
NEW D I R E CT I O N S - AN D  D I LE M MA S ?
8 r

Tape  recorders  can  be used  in the administration of speaking
tests.  Candidates are presented with a  prompt on tape,  and are
asked to respond as if they were talking to a person, the response
being recorded  on  tape.  This  performance  is  then  scored  from
the  tape.  Such  a  test  is called  a  semi-direct test  of speaking,  as
compared  with  a  direct  test  format  such  as  a  live  face-to-face
interview.
But not everybody likes speaking to tapes! We all know the dif
ficulty many people experience in leaving messages on answering
machines.  Most  test-takers  prefer  a  direct  rather  than  a  semi
direct format if given the formats. But the question then arises as
to whether these options are equivalent in testing terms. How far
can you infer the same  ability from performance on different for
mats? It is possible for somebody to be voluble in direct face-to
face interaction but tongue-tied when confronted with a machine,
and vice versa. Research looking at the performance of the same
candidates under each condition has shown that this is a complex
issue, as not all candidates react in the same way (hardly surpris
ing, of course). Some candidates prefer the tape, some prefer a live
interlocutor,  and performance generally improves  in the  condi
tion that is preferred. But we must also add the interlocutor fac
tor.  Some  candidates  get  on  well with  particular  interlocutors,
others are inhibited by them. And there is the rater factor.  Some
raters  react negatively to  tapes,  and  to  particular interlocutors,
and may, without realizing it, either compensate or 'punish' the
candidate when giving their ratings.
Given such issues, why are semi-direct tests used? Cost consid
erations and the logistics of mass test administration are likely to
favour their use.
The semi-direct format is cheaper to administer, as a live inter
locutor  (the  person  who  interacts  with  the  candidate)  does  not
have to be provided. On the other hand, the fact that the tape still
has to  be  individually rated means that the test is by no means
inexpensive;  and  in  many  face-to-face  speaking  tests  the  inter
locutor and the rater are the same person, so that no real saving is
achieved. In addition, the preparation of the tape and the supply
of recording equipment is expensive. Nevertheless, in appropriate
circumstances,  considerable  economies  can  be  achieved. A  fur
ther advantage is that in cases of languages where there are only a
8 2
S U RV E Y

small number of candidates presenting for assessment at any one
time, testing can be provided virtually on demand in any location.
This would not be possible if a trained interlocutor for that lan
guage had to be found. Finally, research has demonstrated that
the  interlocutor  you  interact with may affect  your  score.  Some
interlocutors  elicit  performances  which  trigger  a  favourable
impression of the  candidate;  others have the reverse effect.  The
problem is that raters  typically  don't  realize  that it  is the  inter
locutor's behaviour which is contributing to the  impression gen
erated-a classic case of 'blame the victim'. As a semi-direct test
removes  the interlocutor variable-all candidates face the same
prompt, delivered by tape-it might be felt that the  semi-direct
test has the potential to be a fairer test.
The  issues  raised  by semi-direct tests  of speaking  are  rapidly
becoming more urgent  as pressure to  make tests more commu
nicative leads to an increased demand for speaking tests. But such
tests can often only feasibly be provided in a semi-direct format,
given  huge  numbers  of candidates  sitting for  the  test  in  a  large
number of countries worldwide, as for example with a test such as
TOEFL.
The issue here is a fundamental one. It illustrates the ten
sion between the feasibility of tests (the need to design and admin
ister them practically and cheaply if they are to be of any use at
all), and their validity. There are three basic critical dimensions of
tests (validity, reliability, and feasibility) whose demands need to
be balanced. The right balance will depend on the test context and
test purpose.
Dilemmas: whose performance?
The  speed  of technological  advances  affecting  language  testing
sometimes  gives  an  impression  of  a  field  confidently  moving
ahead,  notwithstanding  the  issues  of validity  raised  above.  But
concomitantly the  change in perspective from the  individual to
the social nature of test performance has provoked something of
an intellectual crisis in the  field.  In  Chapter
7
we  looked at the
social nature of test performance in a larger political and cultural
sense; here will examine the social character of performance at a
more micro level, at the level of interaction. Developments in dis
course analysis and pragmatics have revealed the essential inter
activity of all communication. This is especially clear in relation
N E W  D I R E C T I O N S  - AN D  D I L E M MA S ?
8 3

to the assessment of speaking. The problem is that of isolating the
contribution of a single individual (the candidate) in a joint com
municative activity. As soon as you try to test use (as  opposed to
usage)  you  cannot  confine  yourself to  the  single  individual.  So
whose performance are we assessing?
Take  the  following  example.  A  Thai  nurse  working  with
elderly patients  in an American  geriatric hospital setting is liked
and respected by her patients  and supervising colleagues,  and is
effective in her work despite glaring deficiencies in her grammar,
vocabulary and pronunciation in English.  The  people  she  com
municates with expect to have to take some responsibility for the
success of the communication, in view of her limited English pro
ficiency. They contribute through the active process of drawing
inferences from what she has said, checking that they have under
stood, and seeking clarification in various ways. All of these activ
ities  on  their  part  contribute  to  successful  communication  with
her. Her professional knowledge of nursing is excellent, and this
helps in the framing of her communication, to make it relevant.
With her professional competence, pleasant personality, and the
need for her interlocutors to communicate with her, clinical com
munication seems to be successful; there is no reason  to  exclude
her from the workplace, even though this might be suggested by a
'cold'  assessment  of  her  communication  in
the
absence  of  an
interlocutor, and in non-clinical contexts.
A
contrasting  example.
A
nurse  from  Hong  Kong,  a  native
speaker of Cantonese and a competent speaker of English by most
standards,  is  at the  centre  of a  controversy in  a  hospital  in an
English-speaking country. A sudden emergency with a patient in
the ward requires the nurse to make a telephone call to the recep
tionist,  a  native  speaker of English,  for urgent  help.  The  recep
tionist claims not to be able to understand the nurse, the message
does not get through, and the patient dies. It turns  out that the
receptionist has a reputation for being racist. It is possible that she
in a sense refused to understand? Whatever the explanation, com
munication did not take place.  Whom should we  blame for this
breakdown?
In each of these examples, it is not clear who is responsible for
the success or failure of the communication. It seems that success
or  failure  is  a  joint  achievement:  the  communication  is  a  co-
84  S URVEY

construction.  In assessment,  should we  not  then  take  the  inter
locutor into account in our predictions of successful communica
tion? But how can that be done? And how can this be made to fit
the institutional need for a score about individual candidates on
their own, not about individuals and their interlocutors? Is profi
ciency best understood as something that individuals carry round
in their  heads with them,  or does it only  exist in actual perfor
mances,  which  are  never  solo ?  Note  that  the  issue  of the  joint
responsibility for communication raised here relates not only to
communication involving non-native speakers; it is equally rele
vant for communication between native speakers. What is at issue
here are general pragmatic conditions of normal communication,
and the difficulty of pinning them down in any testing procedure.
This is then another fundamental dilemma for language testing.
The issues raised here show the way in which language testing,
as in other fields of assessment,  is crucially dependent on defini
tions of the test construct. It is  thus, in a way, vulnerable to our
evolving  understanding  of  language  and  communication,  and
cannot be protected by its success in other  aspects,  for example
advances in the technical aspects of psychometrics or in the tech
nology of assessment. The disconcerting aspect of the current sit
uation  is that a  growing  loss  of confidence in the  possibility  or
even desirability of locating competence in the individual, as illus
trated in the examples presented above,  seems  to  challenge the
very adequacy of our current theories of
m
e
a
su
reme
nt
,
with their
promise of providing a single summary score as the basis for the
reliable classification decision that we seek. Instead of the individ
ual carrying a measurable proficiency round in his or her head, we
have  a  multiplicity  of  selves  in  interaction  in  a  multiplicity  of
interactional contexts. How can measurement do justice to this?
And in the dazzle of technological advance, we may need a con
tinuing  reminder  of  the  nature  of  communication  as  a  shared
human activity, and that the idea that one of the participants can
be replaced by a machine is really a technological fantasy.
Language  testing remains  a  complex and perplexing activity.
While insights from evolving theories of communication may be
disconcerting, it is necessary to fully grasp them and the challenge
they pose if our assessments are to have any chance of having the
meaning we intend them to have. Language testing is an uncertain
N EW  D I R E CT I O N S - AN D  D I LEMMAS ?
8 5

and approximate business at the best of times, even if to the out
sider this may be camouflaged by its impressive, even daunting,
technical  (and  technological)  trappings,  not  to  mention  the
authority of the institutions whose goals tests serve. Every test is
vulnerable to good questions, about language and language use,
about measurement, about test procedures, and about the uses to
which the information in tests is to  be  put.  In particular,  a  lan
guage test is only as good as the theory of language on which it is
based,  and  it  is within this area  of theoretical  inquiry into the
essential nature of language and communication that we need to
develop our ability to ask the next question. And the next.
8 6
S U RV E Y

S ECT I O N   2
Readings
Chapter
1
Testing, testing . . .  What  is a  language test?
Text 1
A LA N   D A V I E S :
'The construction of language tests' in
J. P.B. Allen and Alan Davies (eds.):
The Edinburgh Course
in Applied Linguistics Volume
4:
Testing and Experimental
Methods.
Oxford University Press 1977, pages
4 5-46
In  this  paper,  Davies  distinguishes  four  important  uses  or
functions  of language  tests:  achievement,  proficiency,  apti
tude, and diagnostic. In this extract he discusses the first two
of these.
Achievement
Achievement  or  attainment  tests  are  concerned  with  assessing
what has been learned of a known syllabus. This may be within a
school  or  within  a  total  educational  system.  Thus  the  typical
external  school  examinations  ( 'Ordinary'  level  or  'Advanced'
level  in  England,  'Highers'  in  Scotland),  the  university  degree
exams  and so on are all examples of achievement tests. The use
being made of the measure is to find out just how much has been
learned  of  what  has  been  taught  (i.e.,  of  the  syllabus).
Achievement type tests end there. Although the primary interest is
in the past, i.e. what has b
e
en learned,
very often some further use
is made of the same test in order to  make meaningful decisions
about the pupils' future. It would, presumably, be possible to be
interested entirely in the past of the pupils; Carroll's 'meaningful
R EA D I N G S
8 7

decisions' then would refer to the syllabus, i.e., to any necessary
alterations to it that might be necessary or to the teaching method
to be used for the next group of students. But achievement tests
are almost always used for other purposes as well. It is important
to recognize this and to account for it in one's test construction.
But, as will be maintained later under validity, this is essentially a
function of the syllabus. All that an achievement test can do is to
indicate how much of a syllabus has been learned; it cannot make
predictions  as to  pupils'  future performance unless the  syllabus
has been expressly designed for this purpose.
I>
What are some of the functions  of the  examinations Davies
mentions  (external  school  examinations,  university  degree
examinations)  other  than  looking  back  over  what has  been
learned?
I>
What  'future performance' does the writer have in mind? In
what way can the design of a syllabus be used as the basis for
predictions as to pupils' future performance?
Proficiency
Proficiency  tests,  as we  see it, are  concerned with assessing what
has been learned of a known or an unknown syllabus. Here we see
the  distinction  between  proficiency  and  achievement.  In  the  non
language field we might consider,  say,  a driving test as a kind of
proficiency test since there is the desire to apply a  common  stan
dard to all who present themselves whatever their previous driving
experience, over which of course there has been no control at all.
In  the  language  field  there  are  several  well-known  proficiency
exams of the same journeyman kind: the  Cambridge Proficiency
Exams,  the  Michigan  Tests,  the  Test  of  English  as  a  Foreign
Language (TOEFL) and English Proficiency Test Battery (EPTB).
These  all  imply that a  common  standard  is being applied to all
comers. More sophisticated proficiency tests (more sophisticated
in use, not in design) may be constructed as research tools to deter
mine just how much control over a language is needed for certain
purposes, for example medical studies in a second language.
I>
How  does  the  fact  that a  proficiency  test  may  relate  to  an
unknown syllabus  serve  as  the  basis  for a  distinction  from
achievement tests?
8 8
REA D I N G S

C>
If syllabus content is absent as a basis for the content of a pro
ficiency test, how can we decide what it should contain?
Chapter
2
Communication and  the design of language tests
Text 2
R O B E R T   L A D O :
Language Testing: The Construction
and Use of Foreign Language Tests.
Longmans  1961,
pages 22-24
Lado presents the case for basing language tests on a theory of
language description and a theory of learning, in particular on
the points of structural contrast between the learner's first lan
guage and the  target language.  His recommendations about
testing  dominated practice for  nearly  twenty  years,  and are
still influential in powerful tests such as
TOEFL.
The theory of language testing assumes that language is a system
of habits of communication.  These habits permit the communi
cant to give his conscious attention to the over-all meaning he is
conveying or  perceiving.  These  habits  involve  matters  of form,
meaning  and  distribution  at  several  levels  of  structure,  namely
those  of  the  sentence,  clause,  phrase,  word,  morpheme  and
phoneme.  Within  these  levels  are  structures  of  modification,
sequence, parts of sentences.  Below them are habits of articula
tion,  syllable type,  and  collocations.  Associated  with  them  and
sometimes  part  of them  are  patterns  of  intonation,  stress  and
rhythm . . . .
The  individual  is not aware that so much of what he  does  in
using language is done through a complex system of habits. When
he attempts to communicate in a foreign language that he knows

tải về 2.79 Mb.

Chia sẻ với bạn bè của bạn:

1 2 3 4 5 6 7 8 9 10 ... 15

Oxford Introductions to Language Study Language Testing

part of a larger collective activity, one which is deliberate, con­

part of a larger collective activity, one which is deliberate, con