Automatic detection of machine-generated texts based on linguistic models and deep learning

Document Type : Original Article

Authors

1 Master's student at Imam Hussein (AS) Comprehensive University

2 Assistant Professor, Imam Hussein University

3 Researcher at Imam Hussein (AS) Comprehensive University

Abstract

Today, with the significant growth of artificial intelligence and its products, many opportunities and threats have emerged. One of the most famous and popular products of artificial intelligence is text generation, also called machine text. In this research, a new method is introduced that combines features extracted from the text with its structural features, thus creating an automatic discriminator to distinguish between human-written text and artificial intelligence-generated text. The introduced method consists of two parts, the first part: the extended BERT (RoBERTa) model and the bidirectional long-term short-term memory (BiLSTM) model, which are improved with the fusion layer. The second part: the structural features of the text are extracted using a writing style-based method. Finally, the output of the model parts is combined together, and in this way, the model distinguishes human-written text from machine-generated text. The results of this research show that the proposed method is capable of recognizing machine texts with 90% accuracy and exhibits good performance.

Keywords

Main Subjects


[1] Jawahar, G., M. Abdul-Mageed, and L.V. Lakshmanan, Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020.
[2] Kudugunta, S. and E. Ferrara, Deep neural networks for bot detection. Information Sciences, 2 : 467 . 018p. 312-322.
[3] Solaiman, I., et al., Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
[4] Zellers, R., et al., Defending against neural fake news. Advances in neural information processing systems, 2019. 32.
[5] Uchendu, A., et al. Authorship attribution for neural text generation. in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 2020.
[6] Dukić, D., D. Keča, and D. Stipić. Are you human? Detecting bots on Twitter Using BERT. in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA). 2020. IEEE.
[7] Adelani, D.I., et al. Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. in Advanced information networking and applications: Proceedings of the 34th international conference on advanced
information networking and applications (AINA-2020). 2020. Springer.
[8] Ippolito, D., et al., Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
[9] Rodriguez, J.D., et al. Cross-domain detection of GPT-2-generated technical text. in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies. 2022.
[10] Crothers, E.N., N. Japkowicz, and H.L. Viktor, Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access, 2023. 11: p. 70977-71002.
[11] Jacob, D., et al., BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). arXiv preprint arXiv:1810.04805, 2018.
[12] Liu, Y., Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[13] Adelani, D.I., et al. Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. in Advanced information networking and applications: Proceedings of the 34th international conference on advanced information networking and applications (AINA-2020).
[14] Ippolito ,D., et al., Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
[15] Fagni, T., et al., TweepFake: About detecting deepfake tweets. Plos one, 2021. 16(5): p. e0251415.