Research Article
My AI students: Evaluating the proficiency of three AI chatbots in completeness and accuracy
More Detail
1 University of KwaZulu-Natal, Durban, SOUTH AFRICA* Corresponding Author
Contemporary Educational Technology, 16(2), April 2024, ep509, https://doi.org/10.30935/cedtech/14564
Published: 26 April 2024
OPEN ACCESS 1214 Views 938 Downloads
ABSTRACT
A new era of artificial intelligence (AI) has begun, which can radically alter how humans interact with and profit from technology. The confluence of chat interfaces with large language models lets humans write a natural language inquiry and receive a natural language response from a machine. This experimental design study tests the capabilities of three popular AI chatbot services referred to as my AI students: Microsoft Bing, Google Bard, and OpenAI ChatGPT on completeness and accuracy. A Likert scale was used to rate completeness and accuracy, respectively, a three-point and five-point. Descriptive statistics and non-parametric tests were used to compare marks and scale ratings. The results show that AI chatbots were awarded a score of 80.0% overall. However, they struggled with answering questions from the higher Bloom’s taxonomic levels. The median completeness was 3.00 with a mean of 2.75 and the median accuracy was 5.00 with a mean of 4.48 across all Bloom’s taxonomy questions (n=128). Overall, the completeness of the solution was rated mostly incomplete due to limited response (76.2%), while accuracy was rated mostly correct (83.3%). In some cases, generative text was found to be verbose and disembodied, lacking perspective and coherency. Microsoft Bing ranked first among the three AI text generative tools in providing correct answers (92.0%). The Kruskal-Wallis test revealed a significant difference in completeness (asymp. sig.=0.037, p<0.05) and accuracy (asymp. sig.=0.006, p<0.05) among the three AI chatbots. A series of Mann and Whitney tests were carried out showing no significance between AI chatbots for completeness (all p-values>0.015 and 0<r<0.2), while a significant difference was found for accuracy between Google Bard and Microsoft Bing (asymp. sig.=0.002, p<0.05, r=0.3 medium effect). The findings suggest that while AI chatbots can generate comprehensive and correct responses, they may have limits when dealing with more complicated cognitive tasks.
CITATION (APA)
Govender, R. G. (2024). My AI students: Evaluating the proficiency of three AI chatbots in completeness and accuracy. Contemporary Educational Technology, 16(2), ep509. https://doi.org/10.30935/cedtech/14564
REFERENCES
- Adiguzel, T., Kaya, M. H., & Cansu, F. K. (2023). Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemporary Educational Technology, 15(3), ep429. https://doi.org/10.30935/cedtech/13152
- Allam, H., Dempere, J., Akre, V., Parakash, D., Mazher, N., & Ahamed, J. (2023). Artificial intelligence in education: An argument of Chat-GPT use in education. In Proceedings of the 9th International Conference on Information Technology Trends (pp. 151-156). https://doi.org/10.1109/ITT59889.2023.10184267
- Berrar, D. P., & Schuster, A. (2014). Computing machinery and creativity: Lessons learned from the Turing test. Kybernetes, 43(1), 82-91. https://doi.org/10.1108/K-08-2013-0175
- Bibi, W., Butt, M. N., & Reba, A. (2020). Relating teachers’ questioning techniques with students’ learning within the context of Bloom’s taxonomy. FWU Journal of Social Sciences, 14(1), 111-119.
- Blooma, M. J., Chua, A. Y., & Goh, D. H. L. (2008). A predictive framework for retrieving the best answer. In Proceedings of the 2008 ACM symposium on Applied Computing (pp. 1107-1111). ACM. https://doi.org/10.1145/1363686.1363944
- Borenstein, J., & Howard, A. (2021). Emerging challenges in AI and the need for AI ethics education. AI Ethics, 1, 61-65. https://doi.org/10.1007/s43681-020-00002-7
- Buchholz, K. (2023). ChatGPT sprints to one million users. Statista. https://www.statista.com/chart/29174/time-to-one-million-users/
- Carter, C. (2023). Machines hacking machines–Turing’s legacy. In R. K. Nichols, C. M. Carter, C. Diebold, J. Drew, M. Farcot, J. P. Hood, M. J. Jackson, P. Johnson, S. Joseph, S. Khan, W. D. Lonstein, R. McCreight, T. Muehlfelder, H. C. Mumm, J. C. H. Ryan, S. M. Sincavage, W. Slofer, & J. Toebes (Eds.), Cyber-human systems, space technologies, and threats. https://kstatelibraries.pressbooks.pub/cyberhumansystems/chapter/6-machines-hacking-machines-turings-legacy-carter
- Chen, Y., Jensen, S., Albert, L. J., Gupta, S., & Lee, T. (2023). Artificial intelligence (AI) student assistants in the classroom: Designing chatbots to support student success. Information Systems Frontiers, 25(1), 161-182. https://doi.org/10.1007/s10796-022-10291-4
- Dheda, G. (2023). Can Turnitin detect ChatGPT? Open AI Master. https://openaimaster.com/can-turnitin-detect-chat-gpt/
- Edelsbrunner, P., & Thurn, C. (2023). Improving the utility of non-significant results for educational research: A review and recommendations. PsyArXiv. https://doi.org/10.31234/osf.io/uxzwg
- Emerson, R. W. (2020). Bonferroni correction and type I error. Journal of Visual Impairment & Blindness, 114(1), 77-78. https://doi.org/10.1177/0145482X20901378
- Fichman, P. (2011). A comparative assessment of answer quality on four question answering sites. Journal of Information Science, 37(5), 476-486. https://doi.org/10.1177/0165551511415584
- Forehand, M. (2010). Bloom’s taxonomy. Emerging Perspectives on Learning, Teaching, and Technology, 41(4), 47-56.
- Garg, M., & Goel, A. (2022). A systematic literature review on online assessment security: Current challenges and integrity strategies. Computers & Security, 113(6), 102544. https://doi.org/10.1016/j.cose.2021.102544
- Gonçalves, B. (2023). The Turing test is a thought experiment. Minds and Machines, 33(1), 1-31. https://doi.org/10.1007/s11023-022-09616-8
- Govender, R. G. (2021). Embracing the fourth industrial revolution by developing a more relevant educational spectrum. In J. Naidoo (Ed.), Teaching and learning in the 21st century (pp. 30-49). Brill. https://doi.org/10. 1163/9789004460386_003
- Gulyamov, S., & Rustambekovich, R. S. (2023). Code of ethics for the responsible use of AI (chatbots) in science, education and professional activities. Uzbek Journal of Law and Digital Policy, 1(3).
- Hodges, A. (2009). Alan Turing and the Turing Test. In R. Epstein, G. Roberts, & G. Beber (Eds.), Parsing the Turing Test. Springer. https://doi.org/10.1007/978-1-4020-6710-5_2
- Hwang, G. J., & Chang, C. Y. (2023). A review of opportunities and challenges of chatbots in education. Interactive Learning Environments, 31(7), 4099-4112. https://doi.org/10.1080/10494820.2021.1952615
- Jabotinsky, H. Y., & Sarel, R. (2022). Co-authoring with an AI? Ethical dilemmas and artificial intelligence. SSRN. https://doi.org/10.2139/ssrn.4303959
- Jannai, D., Meron, A., Lenz, B., Levine, Y., & Shoham, Y. (2023). Human or not? A gamified approach to the Turing test. arXiv. https://doi.org/10.48550/arXiv.2305.20010
- John, B. M., Chua, A. Y. K., & Goh, D. H. L. (2010). What makes a high-quality user-generated answer? IEEE Internet Computing, 15(1), 66-71. https://doi.org/10.1109/MIC.2011.23
- Jones, K., & Sharma, R. S. (2020). On reimagining a future for online learning in the post-COVID-19 era. SSRN. https://doi.org/10.2139/ssrn.3578310
- Khurana, D., Koli, A., Khatter, K., & Singh, S. (2023). Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82(3), 3713-3744. https://doi.org/10.1007/s11042-022-13428-4
- Lee, D., & Yeo, S. (2022). Developing an AI-based chatbot for practicing responsive teaching in mathematics. Computers & Education, 191(2022), 104646. https://doi.org/10.1016/j.compedu.2022.104646
- Li, L., He, D., & Zhang, C. (2016). Evaluating academic answer quality: A pilot study on ResearchGate Q&A. In F. H. Nah, & C. H. Tan (Eds.), HCI in business, government, and organizations: eCommerce and innovation (pp. 1-14). Springer. https://doi.org/10.1007/978-3-319-39396-4_6
- Marcus, G., Rossi, F., & Veloso, M. (2016). Beyond the Turing test. AI Magazine, 37(1), 3-4. https://doi.org/10.1609/aimag.v37i1.2650
- McKight, P.E. & Najab, J. (2010) Kruskal-Wallis test. In The Corsini encyclopedia of psychology (pp. 1-10). https://doi.org/10.1002/9780470479216.corpsy0491
- Microsoft. (2023). What is Bing Chat, and how can you use it? Microsoft. https://www.microsoft.com/en-us/bing/do-more-with-ai/what-is-bing-chat-and-how-can-you-use-it?form=MA13KP
- Moor, J. H. (1976). An analysis of the Turing test. Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition, 30(4), 249-257. https://doi.org/10.1007/bf00372497
- Naidu, S. (2022). Reimagining and reengineering education systems for the post-COVID-19 era. Distance Education, 43(1), 1-5. https://doi.org/10.1080/01587919.2022.2029652
- Newton, P. M., & Keioni, E. (2022). How common is cheating in online exams and did it increase during the COVID-19 pandemic? A systematic review. Journal of Academic Ethics. https://doi.org/10.1007/s10805-023-09485-5
- Nguyen, C. (2021). The accuracy and completeness of drug information in Google snippet blocks. Journal of the Medical Library Association: JMLA, 109(4), 613. https://doi.org/10.5195/jmla.2021.1229
- Nilsen, E. B., Bowler, D. E., & Linnell, J. D. (2020). Exploratory and confirmatory research in the open science era. Journal of Applied Ecology, 57(4), 842-847. https://doi.org/10.1111/1365-2664.13571
- Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence, 2, 100033. https://doi.org/10.1016/j.caeai.2021.100033
- Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J., Brundage, M., & Sutskever, I. (2019). OpenAI. https://openai.com/research/better-language-models
- Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527. https://doi.org/10.1007/s10462-021-10068-2
- Selwyn, N. (2022). The future of AI and education: Some cautionary notes. European Journal of Education, 57(4), 620-631. https://doi.org/10.1111/ejed.12532
- Shieber, S. M. (1994). Lessons from a restricted Turing test. ArXiv. https://doi.org/10.1145/175208.175217
- Shin, B. (2023). The Turing test for measuring AI intelligence is outdated because of ChatGPT’s wizardry, and a new test would be better. Fortune. https://fortune.com/2023/06/20/turing-test-proposed-update-ai-chatgpt-deepmind-cofounder/
- Strzelecki, A. (2023). To use or not to use ChatGPT in higher education? A study of students’ acceptance and use of technology. Interactive Learning Environments. https://doi.org/10.1080/10494820.2023.2209881
- Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: A framework for applying generative AI in education. ECNU Review of Education, 6(3), 355-366. https://doi.org/10.1177/20965311231168423
- Taylor, R. S. (1986). Value-added processes in information systems. Greenwood Publishing Group.
- Theophilou, E., Koyuturk, C., Yavari, M., Bursic, S., Donabauer, G., Telari, A., Testa, A., Boiano, R., Hernandez-Leo, D., Ruskov, M., Taibi, D., Gabbiadini, A., & Ognibene, D. (2023). Learning to prompt in the classroom to understand AI limits: A pilot study. In R. Basili, D. Lembo, C. Limongelli, & A. Orlandini (Eds.), Proceedings of the 22nd International Conference of the Italian Association for Artificial Intelligence (pp. 481-496). Springer. https://doi.org/10.1007/978-3-031-47546-7_33
- Tlili, A., Shehata, B., Adarkwah, M. A., Bozkurt, A., Hickey, D. T., Huang, R., & Agyemang, B. (2023). What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learning Environments, 10, 15. https://doi.org/10.1186/s40561-023-00237-x
- Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433-60. https://doi.org/10.1093/mind/LIX.236.433
- Visentin, D. C., Cleary, M., & Hunt, G. E. (2020). The earnestness of being important: Reporting non-significant statistical results. Journal of Advanced Nursing, 76(4), 917-919. https://doi.org/10.1111/jan.14283
- Wang, J. (2023). ChatGPT: A test drive. American Journal of Physics, 91(4), 255-256. https://doi.org/10.1119/5.0145897
- Wang, J., Liu, Y., Li, P., Lin, Z., Sindakis, S., & Aggarwal, S. (2023). Overview of data quality: Examining the dimensions, antecedents, and impacts of data quality. Journal of the Knowledge Economy. https://doi.org/10.1007/s13132-022-01096-6
- Yin, D., Dong, L., Cheng, H., Liu, X., Chang, K. W., Wei, F., & Gao, J. (2022). A survey of knowledge-intensive NLP with pre-trained language models. arXiv. https://doi.org/10.48550/arXiv.2202.08772
- Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023). Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In A. Schmidt., K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, & M. L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1-21). https://doi.org/10.1145/3544548.3581388
- Zhu, L., Mou, W., Yang, T., & Chen, R. (2023). ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format. Resuscitation, 188, 109783. https://doi.org/10.1016/j.resuscitation.2023.109783