Look for any answer that fits the question, even if more than one choice may be true on a test.

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

Look for any answer that fits the question, even if more than one choice may be true on a test.


Page 2

  • Adams, W. K., & Wieman, C. E. (2011). Development and validation of instruments to measure learning of expert-like thinking. International Journal of Science Education, 33(9), 1289–1312. https://doi.org/10.1080/09500693.2010.512369.

    Article  Google Scholar 

  • Alnabhan, M. (2002). An empirical investigation of the effects of three methods of handling guessing and risk taking on the psychometric indices of a test. Social Behavior and Personality, 30, 645–652.

    Article  Google Scholar 

  • Angelo, T. A. (1998). Classroom assessment and research: An update on uses, approaches, and research findings. San Francisco: Jossey-Bass.

    Google Scholar 

  • Ávila, C., & Torrubia, R. (2004). Personality, expectations, and response strategies in multiple-choice question examinations in university students: A test of Gray’s hypotheses. European Journal of Personality, 18(1), 45–59. https://doi.org/10.1002/per.506.

    Article  Google Scholar 

  • Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.

    Book  Google Scholar 

  • Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. https://doi.org/10.1007/s11092-008-9068-5.

    Article  Google Scholar 

  • Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. https://doi.org/10.1007/BF02291411.

    Article  Google Scholar 

  • Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26(4), 381–409.

    Article  Google Scholar 

  • Briggs, D., Alonzo, A., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiple-choice items. Educational Assessment, 11(1), 33–63. https://doi.org/10.1207/s15326977ea1101_2.

    Article  Google Scholar 

  • Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer-Verlag Retrieved from https://www.springer.com/us/book/9780387953649.

    Google Scholar 

  • Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36(9), 805–811.

    Article  Google Scholar 

  • Chiu, T.-W., & Camilli, G. (2013). Comment on 3PL IRT adjustment for guessing. Applied Psychological Measurement, 37(1), 76–86. https://doi.org/10.1177/0146621612459369.

    Article  Google Scholar 

  • Couch, B. A., Hubbard, J. K., & Brassil, C. E. (2018). Multiple–true–false questions reveal the limits of the multiple–choice format for detecting students with incomplete understandings. BioScience, 68(6), 455–463. https://doi.org/10.1093/biosci/biy037.

    Article  Google Scholar 

  • Couch, B. A., Wood, W. B., & Knight, J. K. (2015). The molecular biology capstone assessment: A concept assessment for upper-division molecular biology students. CBE-Life Sciences Education, 14(1), ar10. https://doi.org/10.1187/cbe.14-04-0071.

    Article  Google Scholar 

  • Couch, B. A., Wright, C. D., Freeman, S., Knight, J. K., Semsar, K., Smith, M. K., et al. (2019). GenBio-MAPS: A programmatic assessment to measure student understanding of vision and change core concepts across general biology programs. CBE—Life Sciences Education, 18(1), ar1. https://doi.org/10.1187/cbe.18-07-0117.

    Article  Google Scholar 

  • Cronbach, L. J. (1941). An experimental comparison of the multiple true-false and multiple multiple-choice tests. Journal of Educational Psychology, 32(7), 533.

    Article  Google Scholar 

  • Crouch, C. H., & Mazur, E. (2001). Peer instruction: Ten years of experience and results. American Journal of Physics, 69(9), 970–977. https://doi.org/10.1119/1.1374249.

    Article  Google Scholar 

  • de Ayala, R. J. (2008). The theory and practice of item response theory (1st ed.). New York: The Guilford Press.

    Google Scholar 

  • Diamond, J., & Evans, W. (1973). The correction for guessing. Review of Educational Research, 43(2), 181–191.

    Article  Google Scholar 

  • Dudley, A. (2006). Multiple dichotomous-scored items in second language testing: Investigating the multiple true–false item type under norm-referenced conditions. Language Testing, 23(2), 198–228. https://doi.org/10.1191/0265532206lt327oa.

    Article  Google Scholar 

  • Eagan, K., Stolzenberg, E. B., Lozano, J. B., Aragon, M. C., Suchard, M. R., & Hurtado, S. (2014). Undergraduate teaching faculty: The 2013–2014 HERI faculty survey. Los Angeles: Higher Education Research Institute, UCLA Retrieved from https://www.heri.ucla.edu/monographs/HERI-FAC2014-monograph-expanded.pdf.

    Google Scholar 

  • Ellis, A. P. J., & Ryan, A. M. (2003). Race and cognitive-ability test performance: The mediating effects of test preparation, test-taking strategy use and self-efficacy. Journal of Applied Social Psychology, 33(12), 2607–2629. https://doi.org/10.1111/j.1559-1816.2003.tb02783.x.

    Article  Google Scholar 

  • Ericsson, K. A., Krampe, R. T., & Tesch-romer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100(3), 363–406.

    Article  Google Scholar 

  • Fox, J. (2010). Bayesian item response modeling. New York: Springer.

    Book  Google Scholar 

  • Frary, R. B. (1988). Formula scoring of multiple-choice tests (correction for guessing). Educational Measurement: Issues and Practice, 7(2), 33–38. https://doi.org/10.1111/j.1745-3992.1988.tb00434.x.

    Article  Google Scholar 

  • Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education: An International Journal of Research and Studies, 21(4), 357–364.

    Article  Google Scholar 

  • Frisbie, D. A. (1992). The multiple true-false item format: A status review. Educational Measurement: Issues and Practice, 11(4), 21–26.

    Article  Google Scholar 

  • Frisbie, D. A., & Sweeney, D. C. (1982). The relative merits of multiple true-false achievement tests. Journal of Educational Measurement, 19(1), 29–35. https://doi.org/10.1111/j.1745-3984.1982.tb00112.x.

    Article  Google Scholar 

  • Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534.

    Article  Google Scholar 

  • Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016. https://doi.org/10.1007/s11222-013-9416-2.

    Article  Google Scholar 

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. https://doi.org/10.1207/S15324818AME1503_5.

    Article  Google Scholar 

  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park: SAGE Publications, Inc.

  • Handelsman, J., Miller, S., & Pfund, C. (2007). Scientific teaching. New York: W. H. Freeman and Co.

    Google Scholar 

  • Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141–158.

    Article  Google Scholar 

  • Hubbard, J. K., & Couch, B. A. (2018). The positive effect of in-class clicker questions on later exams depends on initial student performance level but not question format. Computers & Education, 120, 1–12. https://doi.org/10.1016/j.compedu.2018.01.008.

    Article  Google Scholar 

  • Javid, L. (2014). The comparison between multiple-choice (mc) and multiple true-false (mtf) test formats in Iranian intermediate EFL learners’ vocabulary learning. Procedia - Social and Behavioral Sciences, 98, 784–788. https://doi.org/10.1016/j.sbspro.2014.03.482.

    Article  Google Scholar 

  • Kalas, P., O’Neill, A., Pollock, C., & Birol, G. (2013). Development of a meiosis concept inventory. CBE-Life Sciences Education, 12(4), 655–664. https://doi.org/10.1187/cbe.12-10-0174.

    Article  Google Scholar 

  • Kim (Yoon), Y. H., & Goetz, E. T. (1993). Strategic processing of test questions: The test marking responses of college students. Learning and Individual Differences, 5(3), 211–218. https://doi.org/10.1016/1041-6080(93)90003-B.

    Article  Google Scholar 

  • Kreiter, C. D., & Frisbie, D. A. (1989). Effectiveness of multiple true-false items. Applied Measurement in Education, 2(3), 207–216.

    Article  Google Scholar 

  • National Research Council (NRC). (2012). Discipline-based education research: Understanding and improving learning in undergraduate science and engineering. Washington, D.C.: National Academies Press.

    Google Scholar 

  • Nehm, R. H., & Reilly, L. (2007). Biology majors’ knowledge and misconceptions of natural selection. BioScience, 57(3), 263–272. https://doi.org/10.1641/B570311.

    Article  Google Scholar 

  • Nehm, R. H., & Schonfeld, I. S. (2008). Measuring knowledge of natural selection: A comparison of the CINS, an open-response instrument, and an oral interview. Journal of Research in Science Teaching, 45(10), 1131–1160. https://doi.org/10.1002/tea.20251.

    Article  Google Scholar 

  • Newman, D. L., Snyder, C. W., Fisk, J. N., & Wright, L. K. (2016). Development of the Central Dogma Concept Inventory (CDCI) assessment tool. CBE-Life Sciences Education, 15(2), ar9. https://doi.org/10.1187/cbe.15-06-0124.

    Article  Google Scholar 

  • Parker, J. M., Anderson, C. W., Heidemann, M., Merrill, J., Merritt, B., Richmond, G., & Urban-Lurain, M. (2012). Exploring undergraduates’ understanding of photosynthesis using diagnostic question clusters. CBE-Life Sciences Education, 11(1), 47–57. https://doi.org/10.1187/cbe.11-07-0054.

    Article  Google Scholar 

  • Piñeiro, G., Perelman, S., Guerschman, J. P., & Paruelo, J. M. (2008). How to evaluate models: Observed vs. predicted or predicted vs. observed? Ecological Modelling, 216(3), 316–322. https://doi.org/10.1016/j.ecolmodel.2008.05.006.

    Article  Google Scholar 

  • Pomplun, M., & Omar, H. (1997). Multiple-mark items: An alternative objective item format? Educational and Psychological Measurement, 57(6), 949–962.

    Article  Google Scholar 

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainments tests. Copenhagen: Danish Institute for Educational Research.

  • Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13. https://doi.org/10.1111/j.1745-3992.2005.00006.x.

    Article  Google Scholar 

  • Semsar, K., Brownell, S., Couch, B. A., Crowe, A. J., Smith, M. K., Summers, M. M. et al. (2019). Phys-MAPS: A programmatic physiology assessment for introductory and advanced undergraduates. Advances in Physiology Education, 43(1), 15–27. https://doi.org/10.1152/advan.00128.2018.

  • Smith, M. K., Wood, W. B., & Knight, J. K. (2008). The Genetics Concept Assessment: A new concept inventory for gauging student understanding of genetics. CBE-Life Sciences Education, 7(4), 422–430. https://doi.org/10.1187/cbe.08-08-0045.

    Article  Google Scholar 

  • Stan Development Team. (2017). Stan modeling language users guide and reference manual, version 2.15.0 (version 2.15.0). http://mc-stan.org.

    Google Scholar 

  • Stenlund, T., Eklöf, H., & Lyrén, P.-E. (2017). Group differences in test-taking behaviour: An example from a high-stakes testing program. Assessment in Education: Principles, Policy & Practice, 24(1), 4–20. https://doi.org/10.1080/0969594X.2016.1142935.

    Article  Google Scholar 

  • Summers, M. M., Couch, B. A., Knight, J. K., Brownell, S. E., Crowe, A. J., Semsar, K., et al. (2018). EcoEvo-MAPS: An ecology and evolution assessment for introductory through advanced undergraduates. CBE—Life Sciences Education, 17(2), ar18. https://doi.org/10.1187/cbe.17-02-0037.

    Article  Google Scholar 

  • Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice models: The distractors are also part of the item. Journal of Educational Measurement, 26(2), 161–176. https://doi.org/10.1111/j.1745-3984.1989.tb00326.x.

    Article  Google Scholar 

  • Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4.

    Article  Google Scholar 

  • Vickrey, T., Rosploch, K., Rahmanian, R., Pilarz, M., & Stains, M. (2015). Research-based implementation of peer instruction: A literature review. CBE-Life Sciences Education, 14(1), es3. https://doi.org/10.1187/cbe.14-11-0198.

    Article  Google Scholar 

  • Wood, W. (2004). Clickers: A teaching gimmick that works. Developmental Cell, 7(6), 796–798. https://doi.org/10.1016/j.devcel.2004.11.004.

    Article  Google Scholar 


Page 2

Skip to main content

From: Multiple-true-false questions reveal more thoroughly the complexity of student thinking than multiple-choice questions: a Bayesian item response model comparison

  Model structures WAIC ΔWAIC WAIC SE P WAIC
A Mastery, TTFF partial mastery, informed reasoning based on attractiveness, informed reasoning with double-T endorsement bias, individual student performance 18,520.7 0 200.0 342.1
B − remove question-level mastery 18,927.0 − 406.3 199.1 288.7
C − remove TTFF partial mastery 18,566.0 − 45.3 200.2 328.9
D − remove TTFF partial mastery
+ replace with TTFF-TFTF partial mastery
18,534.3 − 13.6 199.7 341.8
E − remove TTFF partial mastery
+ replace with TTFF-TFTF-TFFT partial mastery
18,536.5 − 15.8 199.5 333.5
F − remove informed reasoning based on attractiveness
+ replace with random guessing
19,582.4 − 1061.7 203.4 268.4
G − remove double-T bias 18,656.1 − 135.4 201.4 330.3
H − remove double-T bias
+ replace with multi-T bias
18,533.4 − 12.7 200.3 344.6
I − remove question-level double-T bias
+ replace with global double-T bias for each student
18,539.6 − 18.9 200.0 325.1
J + add random guessing for students not in mastery, partial mastery, or informed reasoning 18,519.9 + 0.8 199.9 356.1
K − remove individual student performance 19,448.5 − 927.8 196.7 180.2