Research Article
Enhancing logical reasoning in language models: An investigation of the Capybara dataset
More Detail
1 Facultad de Ingenierías, Universidad Tecnológica de Pereira, Pereira, COLOMBIA2 Grupo Ingeniería y Sociedad, Facultad de Ingeniería, Universidad de Antioquia, Antioquia, COLOMBIA* Corresponding Author
Contemporary Educational Technology, 17(3), July 2025, ep582, https://doi.org/10.30935/cedtech/16425
Published: 02 June 2025
OPEN ACCESS 2207 Views 641 Downloads
ABSTRACT
Recent progress made in conversational AI lays emphasis on the need for development of language models that possess solid logical reasoning skills and further extrapolated capabilities. An examination into this phenomenon investigates how well the Capybara dataset can improve one’s ability to reason using language-based systems. Multiple cutting-edge linguistic models were fine-tuned using the Capybara corpus before assessing their performances on standard tasks demanding sophisticated reasoning. The comparison using different ways reveals that the logical reasoning of models improves and their ability to make inferences is enhanced. This research explores this further by considering what it means for developers who want more human-like machine conversation intelligence. We also see that this could become an invaluable tool when training reasoning-oriented language generating models.
CITATION (APA)
Muñoz Guerrero, L. E., Ceballos, Y. F., & Trejos Rojas, L. D. (2025). Enhancing logical reasoning in language models: An investigation of the Capybara dataset. Contemporary Educational Technology, 17(3), ep582. https://doi.org/10.30935/cedtech/16425
REFERENCES
- Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., & Gupta, S. (2021). Muppet: Massive multi-task representations with pre-finetuning. arXiv. https://doi.org/10.18653/v1/2021.emnlp-main.468
- AlKhamissi, B., Li, M., Çelikyilmaz, A., Diab, M., & Ghazvininejad, M. (2022). A review on language models as knowledge bases. arXiv. https://doi.org/10.48550/arXiv.2204.06031
- Armstrong, C. L., & Towery, N. A. (2022). Person or PC? A comparison of human and computer coding as content analyses tools evaluating severe weather. Online Journal of Communication and Media Technologies, 12(2), Article e202211. https://doi.org/10.30935/ojcmt/11572
- Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922
- Bhagavatula, C., Le Bras, R., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, S. W., & Choi, Y. (2020). Abductive commonsense reasoning. arXiv. https://doi.org/10.48550/arXiv.1908.05739
- Bhardwaj, R., Majumder, N., & Poria, S. (2020). Investigating gender bias in BERT. arXiv. https://doi.org/10.48550/arXiv.2009.05021
- Bisk, Y., Zellers, R., Le Bras, R., Gao, J., & Choi, Y. (2019). PIQA: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7432–7439. https://doi.org/10.1609/aaai.v34i05.6239
- Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485
- Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., Wang, W., …, & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv. https://doi.org/10.48550/arXiv.2108.07258
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., …, & Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165
- Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75. https://doi.org/10.1023/A:1007379606734
- Clark, P., Etzioni, O., Khot, T., Sabharwal, A., Tafjord, O., Turney, P., & Khashabi, D. (2016). Combining retrieval, statistics, and inference to answer elementary science questions. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10325
- Clark, P., Tafjord, Ø., & Richardson, K. (2020). Transformers as soft reasoners over language. arXiv. https://doi.org/10.48550/arXiv.2002.05867
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
- Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv. https://doi.org/10.48550/arXiv.2002.06305
- Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv. https://doi.org/10.48550/arXiv.1702.08608
- Gardner, M., Artzi, Y., Basmova, V., Berant, J., Bogin, B., Chen, S., Dasigi, P., Dua, D., Elazar, Y., Gottumukkala, A., Gupta, N., Hajishirzi, H., Ilharco, G., Khashabi, D., Lin, K., Liu, J., Liu, N. F., Mulcaire, P., Ning, Q., ..., & Zhang, A. (2020). Evaluating models’ local decision boundaries via contrast sets. arXiv. https://doi.org/10.18653/v1/2020.findings-emnlp.117
- Geirhos, R., Jacobsen, J. H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2, 665–673. https://doi.org/10.1038/s42256-020-00257-z
- Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer feed-forward layers are key-value memories. arXiv. https://doi.org/10.18653/v1/2021.emnlp-main.446
- Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. https://doi.org/10.48550/arXiv.2009.03300
- Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (pp. 328–339). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1031
- Huang, L., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 2391–2401). https://doi.org/10.18653/v1/D19-1243
- Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2020). Compositionality decomposed: How do neural networks generalise? arXiv. https://doi.org/10.24963/ijcai.2020/708
- Johnson-Laird, P. N. (2008). How we reason. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199551330.001.0001
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
- Lin, B. Y., Zhou, W., Shen, M., Zhou, P., Bhagavatula, C., Choi, Y., & Ren, X. (2020). CommonGen: A constrained text generation challenge for generative commonsense reasoning. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2020 (pp. 1823–1840). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.165
- Liu, A., Swayamdipta, S., Smith, N. A., & Choi, Y. (2022). WANLI: Worker and AI collaboration for natural language inference dataset creation. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2022 (pp. 6826–6847). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.508
- Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2020). K-BERT: Enabling language representation with knowledge graph. Proceedings of the AAAI Conference on Artificial Intelligence, 34(3), 2901–2908. https://doi.org/10.1609/aaai.v34i03.5681
- Marcus, G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv. https://doi.org/10.48550/arXiv.2002.06177
- McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3428–3448). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1334
- Mihaylov, T., Clark, P., Khot, T., & Sabharwal, A. (2018). Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2381–2391). https://doi.org/10.18653/v1/D18-1260
- Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2019). Adversarial NLI: A new benchmark for natural language understanding. arXiv. https://doi.org/10.18653/v1/2020.acl-main.441
- Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62(3), 54–60. https://doi.org/10.1145/3241036
- Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? Adapting pretrained representations to diverse tasks. arXiv. https://doi.org/10.18653/v1/W19-4302
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv. https://doi.org/10.48550/arXiv.1910.10683
- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.442
- Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., Xu, J., Ott, M., Shuster, K., Smith, E. M., Boureau, Y.-L., & Weston, J. (2020). Recipes for building an open-domain chatbot. arXiv. https://doi.org/10.18653/v1/2021.eacl-main.24
- Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv. https://doi.org/10.48550/arXiv.1706.05098
- Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). WinoGrande: An adversarial Winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8732–8740. https://doi.org/10.1609/aaai.v34i05.6399
- Sap, M., Rashkin, H., Chen, D., LeBras, R., & Choi, Y. (2019). SocialIQA: Commonsense reasoning about social interactions. arXiv. https://doi.org/10.18653/v1/D19-1454
- Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y., Kambadur, M., & Weston, J. (2022). BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv. https://doi.org/10.48550/arXiv.2208.03188
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., ..., & Lee, J. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv. https://doi.org/10.48550/arXiv.2206.04615
- Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4149–4158). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1421
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762
- Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv. https://doi.org/10.48550/arXiv.1905.00537
- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv. https://doi.org/10.18653/v1/W18-5446
- Wang, Y., & Kosinski, M. (2018). Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology, 114(2), 246–257. https://doi.org/10.1037/pspa0000098
- Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641. https://doi.org/10.1162/tacl_a_00290
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv. https://doi.org/10.48550/arXiv.2201.11903
- Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W., Legassick, S., & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv. https://doi.org/10.48550/arXiv.2112.04359
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). https://doi.org/10.18653/v1/2020.emnlp-demos.6
- Yao, L., Mao, C., & Luo, Y. (2018). Clinical text classification with rule-based features and knowledge-guided convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics Workshop (pp. 70–71). IEEE. https://doi.org/10.1109/ICHI-W.2018.00024
- Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4791–4800). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1472
- Zhang, T., & Hashimoto, T. B. (2021). On the inductive bias of masked language modeling: From statistical to syntactic dependencies. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5131–5146). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.404
- Zhang, Y., Tsipidi, E., Schriber, S., Kapadia, M., Gross, M., & Modi, A. (2019). Generating animations from screenplays. arXiv. https://doi.org/10.18653/v1/S19-1032
The articles published in this journal are licensed under the CC-BY Creative Commons Attribution International License.