<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0123-4641</journal-id>
<journal-title><![CDATA[Colombian Applied Linguistics Journal]]></journal-title>
<abbrev-journal-title><![CDATA[Colomb. Appl. Linguist. J.]]></abbrev-journal-title>
<issn>0123-4641</issn>
<publisher>
<publisher-name><![CDATA[Facultad de Ciencias y Educación de la Universidad Distrital, Bogotá Colombia]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0123-46412014000200004</article-id>
<article-id pub-id-type="doi">10.14483/udistrital.jour.calj.2014.2.a03</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Classical test theory and item response theory: Two understandings of one high-stakes performance exam]]></article-title>
<article-title xml:lang="es"><![CDATA[Teoría clásica de la evaluación y teoría de respuesta al ítem: dos comprensiones de un examen avanzado de proficiencia]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Janssen]]></surname>
<given-names><![CDATA[Gerriet]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Meier]]></surname>
<given-names><![CDATA[Valerie]]></given-names>
</name>
<xref ref-type="aff" rid="A03"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Trace]]></surname>
<given-names><![CDATA[Jonathan]]></given-names>
</name>
<xref ref-type="aff" rid="A04"/>
</contrib>
</contrib-group>
<aff id="A02">
<institution><![CDATA[,Universidad de los Andes  ]]></institution>
<addr-line><![CDATA[Bogotá ]]></addr-line>
<country>Colombia</country>
</aff>
<aff id="A03">
<institution><![CDATA[,University of California  ]]></institution>
<addr-line><![CDATA[Santa Barbara ]]></addr-line>
<country>United States</country>
</aff>
<aff id="A04">
<institution><![CDATA[,University of Hawai  ]]></institution>
<addr-line><![CDATA[Mânoa ]]></addr-line>
<country>United States</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2014</year>
</pub-date>
<volume>16</volume>
<numero>2</numero>
<fpage>167</fpage>
<lpage>184</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_arttext&amp;pid=S0123-46412014000200004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_abstract&amp;pid=S0123-46412014000200004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_pdf&amp;pid=S0123-46412014000200004&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Language testing professionals and teacher educators have articulated the need for a broad variety stakeholders--including classroom teachers-- to develop assessment literacy. In this paper, we argue that when teachers are involved in local assessment development projects, they can expand their assessment knowledge and skills beyond what is necessary for conducting principled classroom assessments. We further claim that a particular analytic approach, Rasch analysis, should be considered as one possible element of this expanded assessment literacy. To this end, we use placement exam data from one Colombian university to illustrate how analyses from item response theory perspectives (Rasch analysis) differ from, and can usefully complement classical test theory.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Evaluadores de lengua y formadores de maestros argumentan que los involucrados en el campo de la educación, incluyendo los maestros de aula, deben desarrollar un conocimiento profundo en el tema de la evaluación. Planteamos que los profesores, a la hora de estar involucrados en el desarrollo de proyectos de evaluación, puedan expandir sus conocimientos y habilidades para ir más allá de la evaluación tradicional del aula. Para alcanzar este fin, proponemos que la herramienta de análisis Rasch sea considerada como una parte de esta expansión de conocimiento. En este ensayo, a través de los datos obtenidos de un examen de clasificación de lengua aplicado en un contexto universitario colombiano, ilustramos cómo el análisis Rasch puede complementar la teoría clásica de la evaluación.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[assessment literacy]]></kwd>
<kwd lng="en"><![CDATA[classical test theory]]></kwd>
<kwd lng="en"><![CDATA[item response theory]]></kwd>
<kwd lng="en"><![CDATA[language testing]]></kwd>
<kwd lng="en"><![CDATA[Rasch analysis]]></kwd>
<kwd lng="es"><![CDATA[evaluación de alfabetización]]></kwd>
<kwd lng="es"><![CDATA[teoría de evaluación clásica]]></kwd>
<kwd lng="es"><![CDATA[teoría de respuesta al ítem]]></kwd>
<kwd lng="es"><![CDATA[evaluación de lenguas]]></kwd>
<kwd lng="es"><![CDATA[análisis Rasch]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[  <font size="2" face="Verdana">       <p align="left">DOI: <a href="http://dx.doi.org/10.14483/udistrital.jour.calj.2014.2.a03" target="_blank">http://dx.doi.org/10.14483/udistrital.jour.calj.2014.2.a03</a>     <p align="right"><b>Research Article</b> </p>      <p align="center"><font size="4" face="Verdana"><b>Classical test theory and item response theory: Two understandings of one high-stakes performance exam<sup>1</sup> </b></font></p>      <p align="center"><font size="3" face="Verdana"><b>Teor&iacute;a cl&aacute;sica de la evaluaci&oacute;n y teor&iacute;a de respuesta al &iacute;tem: dos comprensiones de un examen avanzado de proficiencia  </b></font> </p>     <p align="left">&nbsp;</p>     <p align="left"><b>Gerriet  Janssen, M.A.<sup>2</sup>, Valerie Meier, M.A.<sub><sup>3</sup></sub>, and Jonathan Trace, M.A.<sup>4</sup></b> </p>     <p><sup>1</sup> The  research presented in this paper was supported by funding from Universidad de  Los Andes, Departamento de Lenguajes y Estudios Socioculturales, and the  University of Hawai&#39;i, M&#257;noa, Graduate Student Organization.     <br>   <sup>2</sup>  Universidad de los Andes, Bogot&aacute;, Colombia and University of Hawai&#39;i, M&#257;noa,  Honolulu, United States. <a href="mailto:gjanssen@hawaii.edu">gjanssen@hawaii.edu</a>     <br>   <sup>3</sup>  University of California, Santa Barbara, United States. <a href="mailto:valmeier@gmail.com">valmeier@gmail.com</a>     ]]></body>
<body><![CDATA[<br>   <sup>4</sup>  University of Hawai&#39;i, M&#257;noa, Honolulu, United States.  <a href="mailto:jtrace@hawaii.edu">jtrace@hawaii.edu</a> </p>     <p>Citation / Para citar este  art&iacute;culo: Janssen, G., Meier, V., &amp; Trace, J. (2014). Classical Test Theory and Item Response  Theory: Two understandings of one high-stakes performance exam. <i>Colombian  Applied Linguistics Journal, 16</i>(2), 167-184.</p> <hr>     <p>Received:  15-Nov-2013 / Accepted: 15-Jun-2014</p>     <p><b><font size="3">Abstract</font></b></p>     <p>Language  testing professionals and teacher educators have articulated the need for a  broad variety stakeholders--including classroom teachers-- to develop  assessment literacy. In this paper, we argue that when teachers are involved in  local assessment development projects, they can expand their assessment  knowledge and skills beyond what is necessary for conducting principled  classroom assessments. We further claim that a particular analytic approach,  Rasch analysis, should be considered as one possible element of this expanded  assessment literacy. To this end, we use placement exam data from one Colombian  university to illustrate how analyses from item response theory perspectives  (Rasch analysis) differ from, and can usefully complement classical test theory. </p>     <p><b>Keywords: </b>assessment  literacy, classical test theory, item response theory, language testing, Rasch  analysis </p> <hr>     <p><b><font size="3">Resumen </font></b>     <br> Evaluadores de lengua y formadores de  maestros argumentan que los involucrados en el campo de la educaci&oacute;n,  incluyendo los maestros de aula, deben desarrollar un conocimiento profundo en  el tema de la evaluaci&oacute;n. Planteamos que los profesores, a la hora de estar  involucrados en el desarrollo de proyectos de evaluaci&oacute;n, puedan expandir sus  conocimientos y habilidades para ir m&aacute;s all&aacute; de la evaluaci&oacute;n tradicional del  aula. Para alcanzar este fin, proponemos que la herramienta de an&aacute;lisis Rasch  sea considerada como una parte de esta expansi&oacute;n de conocimiento. En este  ensayo, a trav&eacute;s de los datos obtenidos de un examen de clasificaci&oacute;n de lengua  aplicado en un contexto universitario colombiano, ilustramos c&oacute;mo el an&aacute;lisis  Rasch puede complementar la teor&iacute;a cl&aacute;sica de la evaluaci&oacute;n. </p>     <p><b>Palabras clave: </b>evaluaci&oacute;n de  alfabetizaci&oacute;n, teor&iacute;a de evaluaci&oacute;n cl&aacute;sica, teor&iacute;a de respuesta al &iacute;tem,  evaluaci&oacute;n de lenguas, an&aacute;lisis Rasch.</p> <hr>     <p><b><font size="3">Introduction</font> </b> </p>     ]]></body>
<body><![CDATA[<p>As high-stakes tests, including  language tests, become ever more ubiquitous and influential, language  assessment professionals have articulated the need for a broad variety of  stakeholders to develop assessment literacy. Taylor  (2009) describes assessment literacy as &quot;an  appropriate balance of technical know-how, practical skills, theoretical  knowledge, and understanding of principles &hellip; all firmly contextualized within a  sound understanding of the role and function of assessment within education and  society&quot; (p. 27). Teacher educators have recognized that appropriate assessment  practices are integral to teaching and learning, even though these practices  are often inadequately employed; this has prompted these educators to argue  that developing assessment literacy be a central goal of pre-service teacher education  and professional development (Popham, 2009). Popham  (2009) argues that teachers&#39; assessment  literacy must encompass the skills and knowledge necessary to make defensible  decisions about (a) the high-stakes tests increasingly (mis)used on behalf of  standards-based accountability movements, as well as (b) classroom-based  assessment that can be used to enhance teaching and learning. Regarding  assessment literacy, Colombian scholars L&oacute;pez  Mendoza and Bernal Arandia (2009) have  suggested that in order to support language learning, language teachers need to  develop the competencies necessary to &quot;develop, use, score, and interpret&quot;  classroom assessments. </p>     <p>While teachers  undoubtedly need to understand how their classroom can be impacted by the  summative uses of high-stakes testing and the formative uses of classroom  assessments, some or perhaps many teachers find themselves involved in contexts  that require an even greater range of knowledge and skills. This is  particularly the case when teachers become involved in developing  program-level, norm-referenced tests, such as placement exams. The  construction, principled use, and systematic evaluation of such tests often  require a more sophisticated set of conceptual and empirical tools than what is  typically needed when planning and implementing classroom assessments; the  responsibility for such tests often rests with local program insiders,  including classroom teachers, rather than external testing experts. </p>     <p>We argue that involvement in local  test development projects, particularly when they are a part of  internally-motivated accountability efforts, can be an excellent catalyst for  developing assessment literacy in terms of the knowledge Taylor  (2009) described above: &quot;technical know-how,  practical skills, theoretical knowledge, and understanding of principles.&quot; For  example, in the <i>15</i>(1) issue of CALJ, Janssen  and Meier (2013) described how local stakeholders made  gains in all of these areas when they participated in an &quot;iterative,  self-reflective, test development process &#91;that&#93; provide&#91;d&#93; opportunities for  professional development and deeper engagement in accountability projects&quot; (p.  100). Since most test development processes for placement exams employed at the  institutional level are necessarily iterative in that these processes typically  consist of phases of trialing and operationalization (Bachman  &amp; Palmer, 2010, pp. 144-145; Kane,  2013), the multiple iterations of test development and  analysis provided local stakeholders with repeated opportunities to make gains  in their assessment literacy. </p>     <p>In this paper, we would like to propose  that projects concerning placement tests specifically provide language teachers  the opportunity to further extend their assessment literacy because the  high-stakes nature of placement tests requires items that both &quot;fit&quot; and  &quot;function&quot; with the intended test taker population (see the section on CTT  below)5. We propose here  that  item response theory analyses  complements the basic CTT techniques presented in Janssen and Meier (2013): descriptive  statistics, estimates of reliability, and other measures of classical test  theory. Item response theory provides powerful analytical tools that, even in  their most basic applications, can be a valuable option in the analysis of  local, high-stakes tests. To this end, the present paper seeks (a) to provide  readers with an introduction to test analyses from item response theory  perspectives and (b) to answer the following research questions: How can basic  item response theory analyses be used to evaluate the performance of a  norm-referenced placement test, and how do the results of such an analysis  compare with those of classical test theory on one Colombian high-stakes  placement test? </p>     <p>We begin this  paper with a brief overview of the theories behind classical test theory (CTT)  and item response theory (IRT) analyses and then address our research questions  by using data from one Colombian high-stakes placement exam. Hopefully, we will  successfully transmit our enthusiasm for this approach with language teachers,  administrators, and others involved in local test development and program  evaluation efforts, so that they will be encouraged to apply item response  theory analyses to their own projects and enhance their assessment literacy. </p>     <p><b><font size="3">Literature Review</font>: Classical Test  Theory (CTT) and Item Response Theory (IRT) CTT and its Use in Test Analysis</b></p>     <p> As the name would imply, Classical  Test Theory (CTT) is one traditional way of understanding test scores. CTT is  thought to be classical in that it is &quot;well-established, having resisted the  erosion of time&quot; (Mu&ntilde;iz, 2003, p. 192), a quantitative approach that had its start in the early 20th century; still today, CTT&#39;s principles are &quot;alive and  well&quot; in language assessment (Brown, 2013, p. 2)6. A  central CTT concept considers a test measurement&#39;s <i>reliability</i>: measurements  taken today should be nearly equivalent to one taken tomorrow, and there should  be little <i>variance </i>or <i>error </i>in the scores<i>. </i>More specifically,  CTT posits that underlying any <i>observed score </i>on a test is the test  taker&#39;s <i>true score</i>. This true score would be very close a test taker&#39;s  average score if he or she could hypothetically take the same exam a very large  number of times (obviously discounting any practice effects). The <i>true score  variance </i>would be the variation in these true scores, which would happen  even though true scores are conceived of as being free from measurement error.  Each observed score has its own variance (<i>observed score variance</i>),  which is a cumulative result of problems in the environment, exam  administration, scoring, poor test items, or examinee-related factors (Brown,  2013, p. 4). The difference between the true  score variance (i.e., how much the scores might vary when free of measurement  error) and the observed score variance is called the <i>error variance. </i>This  relationship gives us equation 1). below, the cornerstone of CTT. </p>     <blockquote> Observed Score Variance =  True Score Variance + Error Variance.</blockquote>       <p>Given this basic relationship, CTT  moves forward to focus on a variety of reliability measures that are available  to language testers for assessing the consistency of their assessment  instruments (cf., Cronbach&#39;s alpha, KR20, KR21, split-half reliability). These  different reliability measures have been described in encyclopedia entries (cf.  Brown, 2013, pp. 3-19; Sawaki, 2013), being  elegantly summarized in Brown&#39;s Table 1 (2013, pp. 19). Further, in-depth  coverage of these topics is offered in several canonical books on testing (cf. Bachman, 2004, pp.  153-170; Brown,  2005, pp. 177-181; Crocker &amp; Algina, 1986,  pp.105-152). </p>     <p>Grounded in this  understanding of reliability, CTT also provides measures for the analysis of  the individual test items. Two basic measures of item analysis are item  facility (IF) and item discrimination (ID). IF&#151;also called <i>item difficulty </i>and  labeled as <i>p </i>(see Crocker &amp; Algina, 1986, p. 311)&#151;is the measure of the percentage of students who  answered a test item correctly: how easy the item was for the specific test  population. In norm-referenced tests such as most placement tests, IF values  should fall within the range of .30 (relatively difficult; 30% of the test  takers answered the item correctly) to .70 (relatively easy; 70% of the test  takers answered the item correctly) (Brown, 2005); the mean IF value should be approximately .50, so as  to maximize distribution of the test takers into different classifications. The  IF statistic can be said to describe the degree to which a test &quot;fits&quot; the  population it is being used with; IF values that widely differ from those  suggested above would be evidence of a test not &quot;fitting&quot; the local population. </p>     ]]></body>
<body><![CDATA[<p>The other test  item statistic, ID, is a measurement of the degree to which an item separates  the more proficient test-takers from the less proficient test takers; ID values  are a proxy for the degree to which a test item is &quot;functioning.&quot; Ideally,  proficient test takers will answer an item correctly while less proficient  test-takers will not, which means that the assessment is functioning well. The  ID statistic is calculated by subtracting its IF value for a predetermined  percentage of the lowest performing test takers from the IF value of the same  percentage of the top performing test takers (Bachman,  2004, p. 125; Brown,  2005, pp. 68-71). Crocker and Algina (1986) present different percentages for calculating ID values; we follow Brown  (2005) and use the top third and lower third.  Ebel  (1979) has presented a set of guidelines for  interpreting ID values; Ebel&#39;s guidelines are thought to be field standards and  are reported in Brown (2005) as well as  in Crocker  and Algina (1986). ID values have a possible range  between -1.0 and +1.0, with +1.0 representing that the top percentage of test  takers always answer the item correctly while the bottom third always answer  the same item incorrectly. In an opposite fashion, a test item with an ID value  of -1.0 would have the lower percentage of test takers always answering the  item correctly, while the most able percentage of test takers always do NOT  answer the same item correctly&#151;a test item that truly is NOT functioning well  for classification purposes! Ebel suggests that items with ID values of .40 and  higher are considered excellent; .30-.39 are considered to be reasonably good,  but potentially requiring modification; .20-.29 are considered marginal, with  substantial revision being needed; and .19 and below are considered poor, and  should be rejected or reworked. </p>     <p><i>CTT in One Previous Study</i></p>     <p> One study from the Colombian context  that considers the use of CTT tools in the evaluation of an assessment  instrument is Janssen  and Meier&#39;s (2013). This article&#39;s appendix presents a  selection of specific IF statistics for the exam these authors studied (p.  112). One can calculate for the items that this appendix displays that the  vocabulary (VO) and grammar (GR) items did not fit the test taker population  well, as IF values ranged from 0.69-0.96, with the average IF value for these  two sections being 0.81, notably above Brown&#39;s recommended ranges. However, the  reading comprehension (RC) questions fit the population much better, with IF  values ranging from 0.42-0.72 (with the exception of one outlying IF value of  0.23) and the average IF value for this section being 0.55.</p>     <p>  The appendix in Janssen  and Meier (2013, p. 112) also presents a  selection of specific ID statistics for the exam they studied. One can  calculate for the items that this appendix displays  that the vocabulary (VO) and grammar (GR) items generally function adequately  well with this test taker population in terms of separating proficient from  non-proficient test-takers, as the ID values ranged from 0.05-0.55, with the  average IF value for these two sections being 0.35. Here, it is worth noticing  that the ID values for three items were quite low, which had the effect of  lowering the average ID values to this still acceptably good value. The reading  comprehension (RC) questions functioned much better with this specific  population; indeed, the ID values for the RC items ranged from 0.31-0.79 (with  the exception of one outlying IF value of -0.17) and the average ID value for  this section was 0.51. Janssen and Meier (2013) recommended that these four outlying items described above be considered  for omission from the test item pool. </p>     <p><i>CTT: Limitations </i> </p>     <p>Despite its  usefulness, CTT has several important limitations that have led researchers to  look for complementary approaches. Bachman (2004) describes five shortcomings of CTT, but here we are  primarily concerned with one: item analysis from CTT perspectives &quot;is  essentially sample-based descriptive statistics&quot; (Bachman,  2004, p. 139). This means that, for example, IF and  ID values are only representative of the specific sample of examinees from  which they were calculated, so that making generalizations across different  groups of examinees&#151;or across different test formats&#151;may not be possible.  Because of its dependence on a specific sample, it is difficult for CTT to  handle the more complex assessment situations that occur with great regularity,  such as measuring test taker performance at different points in time (pre/  post); using different test forms which contain different items of different  difficulty; or having raters assign scores to different elements of a  performance exam. Still, CTT successfully completes the essential task of basic  item analysis in a test development protocol for a homogenous population: it  &quot;determine&#91;s&#93; flaws in test items &hellip; evaluate&#91;s&#93; the effectiveness of  distracters &hellip; and determine&#91;s&#93; item statistics for use in subsequent test  development work&quot; (Hambleton &amp; Dirir, 2003, p. 189). Though CTT provides a variety of easy-to-use tools  which can be applied for a basic description of how a specific sample of test  takers performed on a specific test, more complex analytic approaches are  required for many language assessment situations. IRT-based analyses are one  family of such analytical tools. </p>     <p><i>Item Response Theory (IRT) and Its Use  in Test Analysis</i></p>     <p> At its core, Item Response Theory  (IRT) addresses CTT&#39;s limitation of using descriptive units that are not  comparable between different assessments or between different points within the  same assessment. To examine this last thought, consider what a 0.10 difference  in IF values on one assessment represents: all the scientist knows from  comparing items with IF values of 0.45 and 0.55 is that 10% more test takers  completed the second item correctly than the first, which is also the case for  the items with IF values 0.10 and 0.20. What is not known is the relative  difficulties of the items: It cannot be said that the first item is 10% more  difficult than the second item. IRT analyses, however, do give us a way to  exactly quantify the differences between item difficulties and even between  test taker performances. </p>     <p>To quantify the differences  between two item difficulties (or two test taker performances) IRT uses as its  metric a <i>derived measure</i>, a measure comprised of two fundamental  measurements. A derived measure that all readers should be somewhat familiar  with is the concept of <i>density</i>, the combination of the fundamental  measurements <i>mass </i>and <i>volume</i>. In a similar way, IRT analyses use a  derived measure based on the probability that a test taker will correctly  answer an item of a certain level of difficulty. This derived metric is what  makes the family of IRT models so powerful, and it also allows for the  inclusion of many different relevant <i>facets </i>of the testing situation  into the statistical model. Among other things, facets such as item difficulty,  prompt difficulty, rubric category difficulty, test taker ability, or rater  severity can be included in one model and can be directly compared using a  single unit called a <i>logit</i>. </p>     <p>In this paper we  will address one of the simpler forms of IRT modeling, which can be easily used  with tests that produce dichotomous data (e.g., multiple choice items). This  basic analysis calculates the probability for a correct response based on the  relationship between an item&#39;s difficulty and a test taker&#39;s ability (Bond  &amp; Fox, 2007). In this model, test takers have a  50% chance of answering an item correctly when both their ability level and the  difficulty level of the item are equal. When changes occur in either item  difficulty or person ability, the probability shifts accordingly (i.e., less  person ability or more item difficulty will lead to a lower chance of success,  and vice-versa). Based on these probabilities, item difficulty and person  ability measures are calculated as logits and arranged along a true interval  scale. </p>     ]]></body>
<body><![CDATA[<p>Using a  probability model based on an interval scale makes it possible to understand  how items perform independently of a specific sample, and one major benefit of  IRT analyses is the ability to generalize findings about data that is  considered to be unidimensional (Ellis &amp; Ross, 2014, p. 1270). For example, consider a test for which some of the  items were working but others required revision. Using only CTT, we could  identify problematic items, revise these items, and re-administer the test, but  unless the new sample of test takers is nearly identical to the old one, we are  likely to end up with different item statistics for those items that were not  revised, making comparisons difficult if not impossible. Using IRT  analyses&#151;when data is thought to be unidimensional&#151;items within one exam  administration or across multiple administrations, which may include items that  are different across the two tests, can be directly compared. Using IRT, it is  possible to exactly quantify the difference in item difficulties in logit  units. We can predict, for instance, how revised items might have performed for  a sample of test takers who did not encounter them, as well as how items shared  between the two groups performed in relation to the combined samples<sup>7</sup>. </p>     <p>Because it constructs a  probability-based model, IRT analyses require unidimensionality, as has been  alluded to in the above paragraphs. What this means is that all of the items on  a test or test section should measure a common factor, construct, or latent  trait (e.g., reading comprehension, pragmatic competence) in order for  assumptions about difficulty and ability to be justified. More complex IRT  models, well beyond the scope of this paper (see van der  Linden and Hambleton, (1997) or Ostini and  Nering (2006) for elaborate descriptions of these  more complex models), permit measuring different latent traits simultaneously;  nevertheless, each of these constructs of its own right should be  unidimensional. This is to say that different constructs on an assessment  instrument can each be unidimensional, even if the instrument includes multiple  constructs (Henning,  Hudson, &amp; Turner, 1985; Wright  &amp; Linacre, 1989)8. As a  final comment on  unidimensionality,  it is important to highlight that without unidimensionality for the latent  trait(s) or construct(s) being measured, the probability model will fail. This  is because without unidimensionality for each latent trait, we cannot  meaningfully order ability and difficulty along the same scale. This is  reflected in a high degree of misfit within the model, which is measured  principally by a statistic called <i>infit mean square. </i> </p>     <p><b><font size="3">Methodology</font> </b></p>     <p>So that readers would be able to  easily compare the CTT and IRT approaches, we reanalyzed the data presented in Janssen  and Meier (&#91;2013) according to both CTT and IRT  approaches. The university in question generously permitted us to study this  data. To respect the test takers and protect their identities, only data that  were released for research purposes were used in this study; all data were  stripped of identifying information before their release and analysis by these  researchers. </p>     <p><i>Participants</i></p>     <p> Scores were collected from two  placement test administrations for which dichotomously scored item-level data  (i.e., 0/ 1, incorrect/ correct) was available for individual test takers (<i>n </i>= 190). This is admittedly a convenience sample; however, as there was no  noticeable difference in the total exams scores between this sub-group and the  larger pool of PhD applicants who have taken this test, this sample was taken  to adequately reflect the larger population of test takers at this university. </p>     <p><i>Test Instruments </i> </p>     <p>The  reading test comprised one section of a three-part placement exam given to  incoming PhD students at one Colombian university9.  The reading test is a computer-based, timed exam (70 minutes), consisting of 78  multiple-choice questions that target three language constructs: grammar,  reading comprehension, and vocabulary. The reading comprehension questions are  passage-based, while the grammar and vocabulary sections of the exam may or may  not be contextualized within paragraph-length texts. The distribution of  passage-based and independent items across test constructs is presented in <a href="#tab1">Table 1</a>, a modified version of Janssen and Meier&#39;s <a href="#tab2">Table 2</a> (2013, p. 106). </p>     <p align="center"><a name="tab1"></a><img src="img/revistas/calj/v16n2/v16n2a04tab1.jpg"></p>     <p align="center"><a name="tab2"></a><img src="img/revistas/calj/v16n2/v16n2a04tab2.jpg"></p>     ]]></body>
<body><![CDATA[<p><i>Rasch Analysis</i></p>     <p> Rasch analysis&#151;one of the simplest  analyses in the IRT family&#151;was conducted on the reading subtest data using <i>Winsteps </i>v3.70.0.1 (Linacre,  2010). Within the data set, 62 instances of missing values  were found for 15 examinees across various items. As Rasch  analysis can account for missing data in its model without any adjustment,  these values were retained and coded with the value of &quot;N&quot; within the input  file so that <i>Winsteps </i>could recognize these as missing data compared to  valid responses. To illustrate what a Rasch analysis input file looks like, we  have included a condensed version of our input in <a href="#(ape1)">Appendix A</a>;  for readers interested in pursuing Rasch analysis, Linacre  (2012) includes myriad examples of input files that can be  adapted to the needs of particular testing situations.</p>     <p><b><font size="3">Results </font></b> </p>     <p>While even a basic Rasch analysis produces a vast amount of potentially  useful information, in this section we concentrate on introducing three key  results: summary statistics of person and item measures; individual person and  item fit statistics; and the vertical ruler. These are not only central when  understanding how well a test is performing but also are relatively easy to  grasp for those new to Rasch analysis.</p>     <p><i>Initial  Analysis</i></p>     <p>Summary  statistics for the initial analysis are displayed in  Table 2 for both test items and persons.  Descriptive statistics&#151;mean, standard deviation  (SD), maximum (Max), and minimum (Min)&#151;are  displayed for each column, which, from left to  right, show raw scores; logit measures produced by Rasch  analysis; the standard error of these  measures; and mean-square (MNSQ) and  standardized z-score (ZSTD) values for the  two types of  fit statistics reported by the model, infit and  outfit (described in more detail in the following  paragraph). At the bottom of the table is the  person-separation reliability estimate, which can be  interpreted the same way as Cronbach alpha. This  reliability estimate is appropriately high (.93),  signifying that the Rasch measures in the model  that quantify test taker ability are separating  test takers of different abilities 93% of the time.  Table 2 also indicates ways in which  parts of the  test are not functioning optimally.</p>     <p><b><i>Fit. </i></b>The question of model fit is one of utmost importance; without good  model fit, no otherstatistics produced by the Rasch model are worth considering, as the  model is thought to not work. Misfitting persons or items are those that do not  conform to expected response patterns based on the model that the Rasch  analysis has produced. Instances of misfitting persons might arise when a  less-proficient test taker answers very difficult items correctly (due, for  example, to lucky guessing or cheating) or when a more-proficient test taker  answers very easy questions incorrectly (due, for example, to carelessness).  Instances of misfitting items might arise due to problems with quality (i.e.,  items that are poorly worded) or multidimensionality (i.e., items that develop  a different theoretical construct). For both persons and items, misfit occurs  when the observed response patterns vary from their expected patterns in such  an erratic way that accurate predictions cannot be made&#151;and the placement of  the person or item within the model cannot be done accurately.Conversely, persons and items can also be identified as overfitting when  they are too perfectly consistent, as Rasch models assume that there will  naturally be some amount of variation. Overfitting items do not degrade  measurement and so are typically retained, but misfitting items are more  problematic. High numbers of misfitting persons (more than 5% of the sample)  can suggest problems with the test as a whole; a small number of misfitting  persons is acceptable, though additional sources of information about these  test takers&#39; abilities should be sought if the test is used as the basis for  high-stakes decisions. For the purpose of analysis, severely misfitting items  should be removed so that they do not distort the Rasch model; for the purposes  of test development, the content of misfitting items should be carefully  scrutinized to determine if revision or removal is appropriate.</p>     <p>Table 2 displays MNSQ and  ZSTD values for both infit and outfit statistics. While both can used to  estimate fit, here we discuss only infit statisticsfor reasons described in the  previous paragraph<sup>10</sup>.Identifying misfit is accomplished in much the same  way we might identify an outlier in other forms of analysis: we can look at the  distance an item or person is from the mean MNSQ infit value and judge whether  or not this distance represents an acceptable amount of variation or if it is  an anomaly in the data. While there are no &quot;hard-and-fast rules&quot; for making  these judgments (Bond &amp; Fox, 2007, p. 242), a number of useful guidelines do exist.  One rule of thumb is that MNSQ values (in this study, for individual persons  and individual items) greater than 1.30 or less than 0.70 signal misfit and  overfit, respectively (Bond &amp; Fox, 2007). A second guideline is to look for MNSQ  values of greater than two standard deviations away from the mean MNSQ in  either direction, which is perhaps the more applicable guideline as it related  directly to the distribution of the data (McNamara,  1996). Accordingly, in the current study,  misfitting persons have infit MNSQs of greater than 1.24 and less than 0.72 (<i>M </i>= 0.98; <i>SD </i>= 0.13), and misfitting items have infit MNSQs of greater  than 1.28 and less than 0.72 (<i>M </i>= 1.00, <i>SD </i>= 0.14). With these  ranges of fit values in mind, one next observes the fit statistics for each  individual item and test taker (see Tables 3 and 4, respectively), and one can make the  determination of whether each item and test taker fits within the model.  Following a similar methodology, ZSTD scores can also be used to interpret fit;  however, they are sensitive to sample size and might be less reliable in certain  instances (Linacre,  2012).</p>     <p><a href="#tab3">Table 3</a> displays a partial  output of item fit statistics for the reading subtest, ordered according to  descending infit MNSQ values. This table focuses on the items that fit  less-well, and for the sake of brevity we have omitted the vastmajority  of the reading test&#39;s items, which fit the model well. The first column  displays the item number, followed by the number of responses for this item in  the data set (Count), the difficulty of each item measured in logits (Measure),  the standard error of measurement (<i>SEM</i>), and infit statistics. Based on  the criteria of misfitting items being those with MNSQ values more than two <i>SD</i>s  from the mean (<i>M </i>= 1.00, <i>SD </i>= 0.14, fromTable 2), three items  have MNSQ values above 1.28, which indicates misfit (items 78, 69, and 10, with  infit MNSQ values of 1.61, 1.39, and 1.37, respectively, shaded in grey in  <a href="#tab3">Table 3</a>). According to the same criteria, one item has an MNSQ value below .72  and is overfitting (item 49, infit MNSQ = 0.66, also shaded in grey in <a href="#tab3">Table  3</a>). All other items appear to be functioning within the expectations of the  model.</p>     <p align="center"><a name="tab3"></a><img src="img/revistas/calj/v16n2/v16n2a04tab3.jpg" alt=""></p>     ]]></body>
<body><![CDATA[<p>One next should consider the actions to be taken in light of the above  data. As stated earlier, overfitting items can be safely retained. The  misfitting infit MNSQ values are not so large as to suggest misfit that would  degrade measurement (i.e., values above 2.0; see Linacre,  2012, p. 553,for more details), yet it can be worthwhile to omit such items and  re-run the analysis to see what effect this has on key results. Additionally,  though it is beyond the scope of this paper, the content of the misfitting  items should be examined to identify the source of misfit and determine what  revisions, if any, would be appropriate.It is  interesting to note that CTT analyses also signaled  similar results for these misfitting test items: the  one test item flagged by IRT as being most widely  misfitting (78), was found in CTT analyses to  be markedly more difficult than the test section  average difficulty (IF = 0.23 and 0.79 respectively);  furthermore, the item discrimination value for  this item was -0.17, indicating that lessproficient test-takers  were slightly more able to answer this  difficult question correctly than moreproficient test-takers.</p>     <p>It is  interesting to note that CTT analyses also signaled  similar results for these misfitting test items: the  one test item flagged by IRT as being most widely  misfitting (78), was found in CTT analyses to  be markedly more difficult than the test section  average difficulty (IF = 0.23 and 0.79 respectively);  furthermore, the item discrimination value for  this item was -0.17, indicating that lessproficient test-takers  were slightly more able to answer this  difficult question correctly than moreproficient test-takers.</p>     <p>Person fit statistics are displayed in a similar way in <a href="#tab4">Table 4</a>, again ordered by  descending infit MNSQ values. Person ID occupies the first column, followed by  the number of observed responses for that person (Count), test-taker ability  measured in logits (Measure), standard error ofmeasurement (<i>SEM</i>), and  infit statistics. Person fit can be interpreted in much the same way as was  described for item fit, with values greater than two <i>SD</i>s above or below  the mean signaling as misfit or overfit, respectively. Based on these criteria,  there are six test takers with MNSQ values above 1.24(<i>M </i>= 0.98; <i>SD </i>=  0.13, from Table 2) who can be  identified as misfitting, and one test taker with a MNSQ value below 0.72 who  can be identified as overfitting. Unlike items, which can be relatively easily  removed from a test when there is evidence of misfit, persons cannot be so  simply excluded from a test. As the number of misfitting persons is  comparatively small&#151;only six out of 190, or about 3%&#151;there is little reason to  be concerned with their effect on the analysis at this stage.</p>     <p align="center"><a name="tab4"></a><img src="img/revistas/calj/v16n2/v16n2a04tab4.jpg" alt=""></p>     <p><b><i>Person and  Item Measures</i></b>. While the spread of both  persons and items were normally distributed,  as is expected in a placement exam, the  distribution of test taker abilities is not well matched by  the distribution of item difficulties. A study of  Table 2 reveals that the mean personability is noticeably higher (M = 1.86)  than that of item  difficulty, which is set at 0.00 by default in the model.  This indicates that the test as a whole was  relatively easy for examinees. Had the test been  well matched to the population, the mean estimate  of person ability would have been closer to  0.00 (Bond &amp; Fox, 2007). Moreover, while the  maximum value for person ability is 4.31 logits,  the most difficult item on the test is only 3.71  logits (again, see Table 2). This result indicates  that there are no items in this test section  appropriately matched to the students at the highest  ability levels.</p>     <p>This discrepancy between person ability and item difficulty measures is  perhaps better represented in the vertical ruler (<a href="#fig1">Figure 1</a>), a graphic  visualization produced within Rasch analyses that presents the interval scale  along which persons and items have been plotted according to their logit  measure. If a test is well matched to the population, the range of test taker  abilities will be complemented by items of commensurate difficulty, such that  test takers and items line up along the length the vertical ruler. In <a href="#fig1">Figure 1</a>, persons are shown  on the left side of the axis by ability, with each &quot;X&quot; corresponding to one  person, while items are arranged by difficulty along the right side by item  number; the higher the position on the vertical ruler, the greater the test  taker&#39;s ability or the item&#39;s difficulty. Again there is a clear mismatch  between test takerability and item difficulty, with the vast majority of test  takers falling above the midpoint of the scale (0.00 logits) and items being  spread more evenly around the mean. From <a href="#fig1">Figure 1</a>, it is easy to see that there  is only one single item (78) above 2.50 logits, whereas there are a large  number of persons with ability measures greater than 2.50. One can also see  many items have difficulty measures below -1.00 logits, yet there are notest  takers whose ability measures are that low. The vertical rulers produced by Rasch analyses are one of its important benefits: a non-expert user can literally see at first blush the degree to which the test-taker ability levels match the items&#39;difficulty. This evidences how an assessment instrument can be repurposed to include a variety of easier or more difficult items, depending on the test takers&#39; abilities.</p>     <p align="center"><a name="fig1"></a><img src="img/revistas/calj/v16n2/v16n2a04fig1.jpg" alt=""></p>     <p><i>Follow-Up  Analysis, Misfitting Items Removed</i></p>     <p>Based on  these initial results, a second Rasch analysis was  conducted with the three misfitting items removed  (k = 75). This was done to confirm the degree to  which the original analysis was affected by  the misfitting items. Summary statistics for the  revised model are shown in <a href="#tab5">Table 5</a>, along with a  revised vertical ruler (<a href="#fig2">Figure 2</a>).</p>     <p align="center"><a name="fig2"></a><img src="img/revistas/calj/v16n2/v16n2a04fig2.jpg" alt="">"></p>     ]]></body>
<body><![CDATA[<p>&nbsp;</p>     <p align="center"><a name="tab5"></a><img src="img/revistas/calj/v16n2/v16n2a04tab5.jpg" alt=""></p>     <p>Removing the  misfitting items from the initial analysis  produced a slight increase in the person-separation  reliability estimate of .94. Not surprisingly,  though, removing misfitting items did not  noticeably affect the basic mismatch betweentest taker ability and item  difficulty. In fact,because one  of the items removed (78) was the most  difficult item in the initial analysis, there isan even  larger discrepancy between the number of examinees  with high ability and the number ofitems of appropriate difficulty. In terms of  misfit and overfit,  infit MNSQ statistics continue to indicate that  there continue to be a small number of items that  misfit, which can be seen in the detailed item  fit statistics output (see <a href="#tab6">Table 6</a>). It is worth  noting that these misfitting items in this new analysis were  nearly misfitting in the original analysis  (compare Tables 3 and 6). Thus we can conclude that  improving the performance of this test will  require more than just the revision of afew  misfitting items; it will involve a systematic replacement  of easy items with more difficultitems that  correspond to the ability levels of the PhD students  applying to this university.</p>     <p align="center"><a name="tab6"></a><img src="img/revistas/calj/v16n2/v16n2a04tab6.jpg" alt=""></p>       <p><b><font size="3">Discussion</font></b></p> We began this  paper emphasizing our belief in&#151;and many  scholars&#39; call for&#151;promoting more  assessment literacy by program teachers and other  stakeholders, and it is only fitting that we conclude  focusing on the same theme. More specifically,  the &quot;know-how, practical skills, theoretical  knowledge, and understanding of principles &hellip;  all firmly contextualized within a sound  understanding of the role and function of assessment  within education and society&quot; that Taylor (2009,  p. 27) calls for becomes vitally important  when one considers the different uses that tests  may have, and that test developers may be the only  line of defense test takers may have in assuring  that the tests being used to evaluate them are  fair, appropriate, and valid. We hope we have  highlighted the importance of two theoretical  frameworks&#151;CTT and IRT&#151;that can be used to  understand the quality of test items for the test  taker populations being assessed, and we hope that  program teachers begin to inform themselves  from a variety of perspectives about the quality  of the instruments they are designing and  employing.</p>     <p>Furthermore, in this paper we revealed how basic IRT analysis could be  used to evaluate the performance of a norm-referenced placement test,and how  these results compared with those of classical test theory methodologies. The results of Rasch analysis presented in the previous section mirrored the most central findings reported by Janssen and Meier (2013) using CTT: that the current version of the reading portion of the placement exam is not well matched to the prospective PhD students whose academic English reading ability it is designed to measure, though the items generally function well. The results of both types of analysis additionally suggest that this test would benefit from the elimination of several items that are too easy and the inclusion of a greater number of more difficult items.</p>     <p>Janssen and Meier (2013) based their  conclusions on measures of central tendency and dispersion, which indicated  that the reading test section was broadly speaking too easy for the population  sample; they also used IF values to identify specific items which were and were  not of suitable difficulty (p. 109) and ID values to ascertain which items were  and were not effectively differentiating between more and less able test takers  (p. 109). The conclusions in this current analysis were reached based on the  discrepancy between item difficulty measures and test taker ability measures  which, unlike descriptive statistics and item analysis values based on raw  scores, are plotted on a single interval scale and thus are directly  comparable. Importantly, while logit measures and standard errors for each  individual item or test taker can be reported in tabular form (e.g., Tables 3, 4, 6), Rasch analysis handily produces a vertical ruler  that efficiently summarizes the relationship between item difficulty and test  taker ability. In our experience, we have found that non-specialist but  interested stakeholders such as program administrators can more intuitively  grasp the implications of a vertical ruler describing test items and test  takers than they can the import of a table of IF and ID values, especially  since values for item difficulty values and test-taker ability are reported in  integer units. Thus although learningRasch analysis requires a bit more initial  effort, this effort is repaid when it comes time to sharing findings of a test analysis. This is one of the benefits of using IRT analyses.</p>     <p>Moreover, Rasch analysis provides additional information not available through CTT, such as fit statistics, which can flag test items and test takers that require additional review. Item fit statistics can alert test developers to problems with item quality, while person fit statistics can alert those responsible for making decisions about test takers when additional sources of information should be collected in order to make defensible inferences about test takers&#39; abilities (i.e., for misfitting test takers). Thus, while we are not suggesting that IRT analyses supplant CTT, we suggest that even the basic output presented in this paper make important contributions to understanding and evaluating test performance.</p>     <p><b><font size="3">Conclusion</font></b></p>     <p> While the Rasch analysis results further confirm CTT results, they also  provide useful additional resources, including (a) a single graphic, the  vertical ruler, which neatly captures the relationship between item difficulty  and test taker ability and can be used to clearly and efficiently communicate  these finding to other test stakeholders; and (b) and the identification of  misfitting items. Moreover, while we did not use Rasch analysis to compare the  performance of test items across different groups of examinees, in the  literature review we suggested that this was a major advantage to the  sample-independent nature of the IRT approach. In this particular instance, the  misfitting items we identified could be revised and their performance analyzed  across groups of different examinees provided two test forms were linked  through a common set of anchor items. While this next step is beyond the scope  of this paper, we hope this brief introduction to the possibilities of Rasch  analysis has demonstrated the value of this analytic approach and  perhaps inspired those involved in the  development of local, high-stakes exams to extend their  assessment literacy by delving more deeply into  the topic.</p>     ]]></body>
<body><![CDATA[<p><b><font size="3">References </font></b> </p>     <!-- ref --><p>Bachman, L. (2004). <i>Statistical analyses for  language assessment</i>. New York, NY: Cambridge University Press.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000084&pid=S0123-4641201400020000400001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Bachman, L. &amp; Palmer, A. (2010). <i>Language  assessment in practice</i>. New York, NY: Oxford University Press.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000086&pid=S0123-4641201400020000400002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Bond, T., &amp; Fox, C. (2007). <i>Applying the Rasch  model: Fundamental measurement in the human sciences</i>, (2nd ed.). New York,  NY: Routledge.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000088&pid=S0123-4641201400020000400003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Brennan, R. L. (2001). <i>Generalizability theory</i>.  New York, NY: Springer-Verlag.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000090&pid=S0123-4641201400020000400004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref -->  </p>     <!-- ref --><p>Brown, J. D. (2005). <i>Testing in language programs:  A comprehensive guide to English language assessment</i>. New York, NY:  McGraw-Hill.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000092&pid=S0123-4641201400020000400005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Brown, J. D. (2013). Classical theory reliability. In  A. Kunnan (Ed.), <i>Companion to language assessment</i>, Vol. 3. Hoboken, NJ:  Wiley Blackwell.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000094&pid=S0123-4641201400020000400006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p><font face="Verdana" size="2">Chapelle, C. (2012). Validity argument for language assessment: The   framework is simple... <i>Language Testing, 29(1)</i>,19-27. doi:10.1177/0265532211417211.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000096&pid=S0123-4641201400020000400007&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </font></p>     <!-- ref --><p>Crocker, L., &amp; Algina, J. (1986). <i>Introduction  to classical and modern test theory</i>, (1st ed.). Belmont, CA: Wadsworth  Group/ Thomson Learning.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000098&pid=S0123-4641201400020000400008&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Ebel, R. L. (1979<i>). Essentials of educational  measurement</i>, 1st edition. Upper Saddle River, NJ: Prentice Hall.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000100&pid=S0123-4641201400020000400009&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Ellis, D., &amp; Ross, S. (2014). Item response theory  in language testing. In A. Kunnan (Ed.), <i>Companion to language assessment</i>,  Vol. 3. Hoboken, NJ: Wiley Blackwell.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000102&pid=S0123-4641201400020000400010&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Ferlazzo, F. (2003). Generalizability theory. In Fern&aacute;ndez-Ballesteros  (Ed.), <i>Encyclopedia of psychological assessment </i>(pp. 425-429). London,  UK: Sage Publications.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000104&pid=S0123-4641201400020000400011&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></p>     <!-- ref --><p>Hambleton, R., &amp;  Dirir, M. (2003). Classical and modern item analysis. In Fern&aacute;ndez-Ballesteros  (Ed.), <i>Encyclopedia of psychological assessment </i>(pp. 188-192). London,  UK: Sage Publications.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000106&pid=S0123-4641201400020000400012&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Henning, G., Hudson, T.,  &amp; Turner, J. (1985). Item response theory and the assumption of  unidimensionality for language tests. <i>Language Testing, 2(2)</i>, 141-154.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000108&pid=S0123-4641201400020000400013&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Janssen, G., &amp; Meier,  V. (2013). Establishing placement test fit and performance: Serving local  needs. <i>Colombian Applied Linguistics Journal, 15(1)</i>, 100-113.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000110&pid=S0123-4641201400020000400014&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Kane, M. (2006). Validation.  In R. Brennan (Ed.), <i>Educational measurement </i>(4th ed.) (pp. 17-64).  Westport, CT: American Council on Education / Praeger.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000112&pid=S0123-4641201400020000400015&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Kane, M. (2013).  Validating the interpretations and uses of test scores. <i>Journal of  Educational Measurement, 50(1)</i>, 1-73. doi:10.1111/jedm.12000.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000114&pid=S0123-4641201400020000400016&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Linacre, J. M. (2010). <i>Winsteps </i>(Version 3.70.0.1). Chicago, IL: MESA Press.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000116&pid=S0123-4641201400020000400017&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Linacre, J. M. (2012). <i>A  user&#39;s guide to Winsteps software manual</i>. Chicago, IL: MESA Press.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000118&pid=S0123-4641201400020000400018&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>L&oacute;pez Mendoza, A. A.,  &amp; Bernal Arandia, R. (2009). Language testing in Colombia: A call for  more teacher education and teacher training in language assessment. <i>PROFILE,  11(2)</i>, 55-70.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000120&pid=S0123-4641201400020000400019&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </p>     <!-- ref --><p>Marcoulides, G., &amp; Ing, M. (2013). The use of Generalizability  Theory in language assessment. In A. Kunnan, (Ed.), <i>The companion to  language assessment, Vol. 3 (pp. 1207-1223).</i><i>New York, NY: John Wiley &amp; Sons, Inc. DOI:</i> <i>10.1002/9781118411360.wbcla014</i> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000122&pid=S0123-4641201400020000400020&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><i>McNamara, T. (1996). Measuring second language</i>   <i>performance. New York: Longman.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000123&pid=S0123-4641201400020000400021&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></i></p>     <!-- ref --><p><i>Mu&ntilde;iz, J. (2003). Classical test theory. In Fern&aacute;ndez-</i> <i>Ballesteros (Ed.), Encyclopedia of psychological</i> <i>assessment. London, UK: Sage Publications.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000125&pid=S0123-4641201400020000400022&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></i></p>     <!-- ref --><p><i>Ostini, R., &amp; Nering, M. (2006). Polytomous item</i> <i>response theory models. Thousand Oaks, CA: Sage</i> <i>Publications.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000127&pid=S0123-4641201400020000400023&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></i></p>     <!-- ref --><p><i>Popham, W. J. (2009). Assessment literacy for teachers:</i> <i>Faddish or fundamental? Theory Into Practice,</i> <i>48(1), 4-11. doi:10.1080/0040584080257753</i>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000129&pid=S0123-4641201400020000400024&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><font face="Verdana" size="2"><i>Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), Te companion to     language assessment. Vol.3. Hoboken, NJ: Wiley Blackwell</i>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000130&pid=S0123-4641201400020000400025&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><i>Shavelson, R., &amp; Webb, N. (1991). Generalizability</i> <i>theory: A primer. London, UK: Sage.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000132&pid=S0123-4641201400020000400026&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></i> </p>     <!-- ref --><p><i>Taylor, L. (2009). Developing assessment literacy.</i> <i>Annual Review of Applied Linguistics, 29, 21-36.</i> <i>doi:10.1017/S026719050909003</i>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000134&pid=S0123-4641201400020000400027&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><i>Van der Linden, W., &amp; Hambleton, R. (1997).</i> <i>Handbook of modern item response theory. New</i> <i>York, NY: Springer-Verlag.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000135&pid=S0123-4641201400020000400028&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></i></p>     <!-- ref --><p><i>Wright, B. D., &amp; Linacre, J. M. (1989). Observations</i> <i>are always ordinal; Measurements, however must</i> <i>be interval (MESA Research Memorandum No.</i> <i>44). MESA Psychometric Laboratory. Retrieved</i> <i>from: </i><a href="http://www.rasch.org/memo44.htm" target="_blank"> <i>www.rasch.org/memo44.htm  </i></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000137&pid=S0123-4641201400020000400029&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><p><a name="(ape1)"><img src="img/revistas/calj/v16n2/v16n2a04ape1.jpg"></a></p> </font>      ]]></body><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bachman]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<source><![CDATA[Statistical analyses for language assessment]]></source>
<year>2004</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[Cambridge University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bachman]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Palmer]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Language assessment in practice]]></source>
<year>2010</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[Oxford University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bond]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Fox]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<source><![CDATA[Applying the Rasch model: Fundamental measurement in the human sciences]]></source>
<year>2007</year>
<edition>2</edition>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[Routledge]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brennan]]></surname>
<given-names><![CDATA[R. L.]]></given-names>
</name>
</person-group>
<source><![CDATA[Generalizability theory]]></source>
<year>2001</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[Springer-Verlag]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brown]]></surname>
<given-names><![CDATA[J. D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Testing in language programs: A comprehensive guide to English language assessment]]></source>
<year>2005</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[McGraw-Hill]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brown]]></surname>
<given-names><![CDATA[J. D.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Classical theory reliability]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Kunnan]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Companion to language assessment]]></source>
<year>2013</year>
<volume>3</volume>
<publisher-loc><![CDATA[Hoboken^eNJ NJ]]></publisher-loc>
<publisher-name><![CDATA[Wiley Blackwell]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chapelle]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Validity argument for language assessment: The framework is simple...]]></article-title>
<source><![CDATA[Language Testing]]></source>
<year>2012</year>
<volume>29</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>19-27</page-range></nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Crocker]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Algina]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Introduction to classical and modern test theory]]></source>
<year>1986</year>
<edition>1</edition>
<publisher-loc><![CDATA[Belmont^eCA CA]]></publisher-loc>
<publisher-name><![CDATA[Wadsworth Group/ Thomson Learning]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ebel]]></surname>
<given-names><![CDATA[R. L.]]></given-names>
</name>
</person-group>
<source><![CDATA[Essentials of educational measurement]]></source>
<year>1979</year>
<edition>1</edition>
<publisher-loc><![CDATA[Upper Saddle River^eNJ NJ]]></publisher-loc>
<publisher-name><![CDATA[Prentice Hall]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ellis]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Ross]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Item response theory in language testing]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Kunnan]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Companion to language assessment]]></source>
<year>2014</year>
<volume>3</volume>
<publisher-loc><![CDATA[Hoboken^eNJ NJ]]></publisher-loc>
<publisher-name><![CDATA[Wiley Blackwell]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ferlazzo]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Fernández-Ballesteros]]></surname>
</name>
</person-group>
<source><![CDATA[Encyclopedia of psychological assessment]]></source>
<year>2003</year>
<page-range>425-429</page-range><publisher-loc><![CDATA[London ]]></publisher-loc>
<publisher-name><![CDATA[Sage Publications]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hambleton]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Dirir]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Classical and modern item analysis]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Fernández-Ballesteros]]></surname>
</name>
</person-group>
<source><![CDATA[Encyclopedia of psychological assessment]]></source>
<year>2003</year>
<page-range>188-192</page-range><publisher-loc><![CDATA[London ]]></publisher-loc>
<publisher-name><![CDATA[Sage Publications]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Henning]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Hudson]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Turner]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Item response theory and the assumption of unidimensionality for language tests]]></article-title>
<source><![CDATA[Language Testing]]></source>
<year>1985</year>
<volume>2</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>141-154</page-range></nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Janssen]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Meier]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Establishing placement test fit and performance: Serving local needs]]></article-title>
<source><![CDATA[Colombian Applied Linguistics Journal]]></source>
<year>2013</year>
<volume>15</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>100-113</page-range></nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kane]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Validation]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Brennan]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Educational measurement]]></source>
<year>2006</year>
<edition>4</edition>
<page-range>17-64</page-range><publisher-loc><![CDATA[Westport^eCT CT]]></publisher-loc>
<publisher-name><![CDATA[American Council on Education / Praeger]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kane]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Validating the interpretations and uses of test scores]]></article-title>
<source><![CDATA[Journal of Educational Measurement]]></source>
<year>2013</year>
<volume>50</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-73</page-range></nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Linacre]]></surname>
<given-names><![CDATA[J. M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Winsteps (Version 3.70.0.1)]]></source>
<year>2010</year>
<publisher-loc><![CDATA[Chicago^eIL IL]]></publisher-loc>
<publisher-name><![CDATA[MESA Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Linacre]]></surname>
<given-names><![CDATA[J. M.]]></given-names>
</name>
</person-group>
<source><![CDATA[A user&#39;s guide to Winsteps software manual]]></source>
<year>2012</year>
<publisher-loc><![CDATA[Chicago^eIL IL]]></publisher-loc>
<publisher-name><![CDATA[MESA Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[López Mendoza]]></surname>
<given-names><![CDATA[A. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Bernal Arandia]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Language testing in Colombia: A call for more teacher education and teacher training in language assessment]]></article-title>
<source><![CDATA[PROFILE]]></source>
<year>2009</year>
<volume>11</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>55-70</page-range></nlm-citation>
</ref>
<ref id="B20">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Marcoulides]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Ing]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The use of Generalizability Theory in language assessment]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Kunnan]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[The companion to language assessment]]></source>
<year>2013</year>
<volume>3</volume>
<page-range>1207-1223</page-range><publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[John Wiley & Sons]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B21">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[McNamara]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Measuring second language performance]]></source>
<year>1996</year>
<publisher-loc><![CDATA[New York ]]></publisher-loc>
<publisher-name><![CDATA[Longman]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B22">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Muñiz]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Classical test theory]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Fernández- Ballesteros]]></surname>
</name>
</person-group>
<source><![CDATA[Encyclopedia of psychological assessment]]></source>
<year>2003</year>
<publisher-loc><![CDATA[London ]]></publisher-loc>
<publisher-name><![CDATA[Sage Publications]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B23">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ostini]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Nering]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Polytomous item response theory models]]></source>
<year>2006</year>
<publisher-loc><![CDATA[Thousand Oaks^eCA CA]]></publisher-loc>
<publisher-name><![CDATA[Sage Publications]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B24">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Popham]]></surname>
<given-names><![CDATA[W. J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Assessment literacy for teachers: Faddish or fundamental?]]></article-title>
<source><![CDATA[Theory Into Practice]]></source>
<year>2009</year>
<volume>48</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>4-11</page-range></nlm-citation>
</ref>
<ref id="B25">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sawaki]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Classical test theory]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Kunnan]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Te companion to language assessment]]></source>
<year>2013</year>
<volume>3</volume>
<publisher-loc><![CDATA[Hoboken^eNJ NJ]]></publisher-loc>
<publisher-name><![CDATA[Wiley Blackwell]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B26">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Shavelson]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Webb]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<source><![CDATA[Generalizability theory: A primer]]></source>
<year>1991</year>
<publisher-loc><![CDATA[London ]]></publisher-loc>
<publisher-name><![CDATA[Sage]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B27">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Taylor]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Developing assessment literacy]]></article-title>
<source><![CDATA[Annual Review of Applied Linguistics]]></source>
<year>2009</year>
<volume>29</volume>
<page-range>21-36</page-range></nlm-citation>
</ref>
<ref id="B28">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Van der Linden]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
<name>
<surname><![CDATA[Hambleton]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Handbook of modern item response theory]]></source>
<year>1997</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[Springer-Verlag]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B29">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wright]]></surname>
<given-names><![CDATA[B. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Linacre]]></surname>
<given-names><![CDATA[J. M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Observations are always ordinal; Measurements, however must be interval]]></source>
<year>1989</year>
<volume>44</volume>
<publisher-name><![CDATA[MESA Psychometric Laboratory]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
