Research Article

Enhancing logical reasoning in language models: An investigation of the Capybara dataset

Luis Eduardo Muñoz Guerrero 1 , Yony Fernando Ceballos 2 , Luis David Trejos Rojas 1 *
More Detail
1 Facultad de Ingenierías, Universidad Tecnológica de Pereira, Pereira, COLOMBIA2 Grupo Ingeniería y Sociedad, Facultad de Ingeniería, Universidad de Antioquia, Antioquia, COLOMBIA* Corresponding Author
Contemporary Educational Technology, 17(3), July 2025, ep582, https://doi.org/10.30935/cedtech/16425
Published: 02 June 2025
OPEN ACCESS   2207 Views   641 Downloads
Download Full Text (PDF)

ABSTRACT

Recent progress made in conversational AI lays emphasis on the need for development of language models that possess solid logical reasoning skills and further extrapolated capabilities. An examination into this phenomenon investigates how well the Capybara dataset can improve one’s ability to reason using language-based systems. Multiple cutting-edge linguistic models were fine-tuned using the Capybara corpus before assessing their performances on standard tasks demanding sophisticated reasoning. The comparison using different ways reveals that the logical reasoning of models improves and their ability to make inferences is enhanced. This research explores this further by considering what it means for developers who want more human-like machine conversation intelligence. We also see that this could become an invaluable tool when training reasoning-oriented language generating models.

CITATION (APA)

Muñoz Guerrero, L. E., Ceballos, Y. F., & Trejos Rojas, L. D. (2025). Enhancing logical reasoning in language models: An investigation of the Capybara dataset. Contemporary Educational Technology, 17(3), ep582. https://doi.org/10.30935/cedtech/16425

REFERENCES

  1. Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., & Gupta, S. (2021). Muppet: Massive multi-task representations with pre-finetuning. arXiv. https://doi.org/10.18653/v1/2021.emnlp-main.468
  2. AlKhamissi, B., Li, M., Çelikyilmaz, A., Diab, M., & Ghazvininejad, M. (2022). A review on language models as knowledge bases. arXiv. https://doi.org/10.48550/arXiv.2204.06031
  3. Armstrong, C. L., & Towery, N. A. (2022). Person or PC? A comparison of human and computer coding as content analyses tools evaluating severe weather. Online Journal of Communication and Media Technologies, 12(2), Article e202211. https://doi.org/10.30935/ojcmt/11572
  4. Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463
  5. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922
  6. Bhagavatula, C., Le Bras, R., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, S. W., & Choi, Y. (2020). Abductive commonsense reasoning. arXiv. https://doi.org/10.48550/arXiv.1908.05739
  7. Bhardwaj, R., Majumder, N., & Poria, S. (2020). Investigating gender bias in BERT. arXiv. https://doi.org/10.48550/arXiv.2009.05021
  8. Bisk, Y., Zellers, R., Le Bras, R., Gao, J., & Choi, Y. (2019). PIQA: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7432–7439. https://doi.org/10.1609/aaai.v34i05.6239
  9. Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485
  10. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., Wang, W., …, & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv. https://doi.org/10.48550/arXiv.2108.07258
  11. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., …, & Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165
  12. Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75. https://doi.org/10.1023/A:1007379606734
  13. Clark, P., Etzioni, O., Khot, T., Sabharwal, A., Tafjord, O., Turney, P., & Khashabi, D. (2016). Combining retrieval, statistics, and inference to answer elementary science questions. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10325
  14. Clark, P., Tafjord, Ø., & Richardson, K. (2020). Transformers as soft reasoners over language. arXiv. https://doi.org/10.48550/arXiv.2002.05867
  15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
  16. Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv. https://doi.org/10.48550/arXiv.2002.06305
  17. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv. https://doi.org/10.48550/arXiv.1702.08608
  18. Gardner, M., Artzi, Y., Basmova, V., Berant, J., Bogin, B., Chen, S., Dasigi, P., Dua, D., Elazar, Y., Gottumukkala, A., Gupta, N., Hajishirzi, H., Ilharco, G., Khashabi, D., Lin, K., Liu, J., Liu, N. F., Mulcaire, P., Ning, Q., ..., & Zhang, A. (2020). Evaluating models’ local decision boundaries via contrast sets. arXiv. https://doi.org/10.18653/v1/2020.findings-emnlp.117
  19. Geirhos, R., Jacobsen, J. H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2, 665–673. https://doi.org/10.1038/s42256-020-00257-z
  20. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer feed-forward layers are key-value memories. arXiv. https://doi.org/10.18653/v1/2021.emnlp-main.446
  21. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. https://doi.org/10.48550/arXiv.2009.03300
  22. Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (pp. 328–339). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1031
  23. Huang, L., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 2391–2401). https://doi.org/10.18653/v1/D19-1243
  24. Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2020). Compositionality decomposed: How do neural networks generalise? arXiv. https://doi.org/10.24963/ijcai.2020/708
  25. Johnson-Laird, P. N. (2008). How we reason. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199551330.001.0001
  26. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
  27. Lin, B. Y., Zhou, W., Shen, M., Zhou, P., Bhagavatula, C., Choi, Y., & Ren, X. (2020). CommonGen: A constrained text generation challenge for generative commonsense reasoning. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2020 (pp. 1823–1840). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.165
  28. Liu, A., Swayamdipta, S., Smith, N. A., & Choi, Y. (2022). WANLI: Worker and AI collaboration for natural language inference dataset creation. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2022 (pp. 6826–6847). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.508
  29. Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2020). K-BERT: Enabling language representation with knowledge graph. Proceedings of the AAAI Conference on Artificial Intelligence, 34(3), 2901–2908. https://doi.org/10.1609/aaai.v34i03.5681
  30. Marcus, G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv. https://doi.org/10.48550/arXiv.2002.06177
  31. McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3428–3448). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1334
  32. Mihaylov, T., Clark, P., Khot, T., & Sabharwal, A. (2018). Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2381–2391). https://doi.org/10.18653/v1/D18-1260
  33. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2019). Adversarial NLI: A new benchmark for natural language understanding. arXiv. https://doi.org/10.18653/v1/2020.acl-main.441
  34. Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62(3), 54–60. https://doi.org/10.1145/3241036
  35. Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? Adapting pretrained representations to diverse tasks. arXiv. https://doi.org/10.18653/v1/W19-4302
  36. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv. https://doi.org/10.48550/arXiv.1910.10683
  37. Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.442
  38. Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., Xu, J., Ott, M., Shuster, K., Smith, E. M., Boureau, Y.-L., & Weston, J. (2020). Recipes for building an open-domain chatbot. arXiv. https://doi.org/10.18653/v1/2021.eacl-main.24
  39. Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv. https://doi.org/10.48550/arXiv.1706.05098
  40. Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). WinoGrande: An adversarial Winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8732–8740. https://doi.org/10.1609/aaai.v34i05.6399
  41. Sap, M., Rashkin, H., Chen, D., LeBras, R., & Choi, Y. (2019). SocialIQA: Commonsense reasoning about social interactions. arXiv. https://doi.org/10.18653/v1/D19-1454
  42. Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y., Kambadur, M., & Weston, J. (2022). BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv. https://doi.org/10.48550/arXiv.2208.03188
  43. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., ..., & Lee, J. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv. https://doi.org/10.48550/arXiv.2206.04615
  44. Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4149–4158). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1421
  45. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762
  46. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv. https://doi.org/10.48550/arXiv.1905.00537
  47. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv. https://doi.org/10.18653/v1/W18-5446
  48. Wang, Y., & Kosinski, M. (2018). Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology, 114(2), 246–257. https://doi.org/10.1037/pspa0000098
  49. Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641. https://doi.org/10.1162/tacl_a_00290
  50. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv. https://doi.org/10.48550/arXiv.2201.11903
  51. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W., Legassick, S., & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv. https://doi.org/10.48550/arXiv.2112.04359
  52. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). https://doi.org/10.18653/v1/2020.emnlp-demos.6
  53. Yao, L., Mao, C., & Luo, Y. (2018). Clinical text classification with rule-based features and knowledge-guided convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics Workshop (pp. 70–71). IEEE. https://doi.org/10.1109/ICHI-W.2018.00024
  54. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4791–4800). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1472
  55. Zhang, T., & Hashimoto, T. B. (2021). On the inductive bias of masked language modeling: From statistical to syntactic dependencies. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5131–5146). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.404
  56. Zhang, Y., Tsipidi, E., Schriber, S., Kapadia, M., Gross, M., & Modi, A. (2019). Generating animations from screenplays. arXiv. https://doi.org/10.18653/v1/S19-1032