PoSLemma: How Traditional Machine Learning and Linguistics Preprocessing Aid in Machine Generated Text Detection

Authors

Diana Jimenez Instituto Politécnico Nacional
Marco A. Cardoso-Moreno Instituto Politécnico Nacional
Fernando Aguilar-Canto Instituto Politécnico Nacional
Omar Juarez-Gambino Instituto Politécnico Nacional
Hiram Calvo Instituto Politécnico Nacional

DOI:

https://doi.org/10.13053/cys-27-4-4778

Keywords:

Generative text detection, text generation, AuTexTification, logistic regression, support vector machine (SVM), classification

Abstract

With the release of several Large Language Models (LLMs) to the public, concerns have emerged regarding their ethical implications and potential misuse. This paper proposes an approach to address the need for technologies that can distinguish between text sequences generated by humans and those produced by LLMs. The proposed method leverages traditional Natural Language Processing (NLP) feature extraction techniques focusing on linguistic properties, and traditional Machine Learning (ML) methods like Logistic Regression and Support Vector Machines (SVMs). We also compare this approach with an ensemble of Long-Short Term Memory (LSTM) networks, each analyzing different paradigms of Part of Speech (PoS) taggings. Our traditional ML models achieved F1 scores of 0.80 and 0.72 in the respective analyzed tasks.

Author Biographies

Diana Jimenez, Instituto Politécnico Nacional

Centro de Investigacion en Computación

Marco A. Cardoso-Moreno, Instituto Politécnico Nacional

Centro de Investigacion en Computación

Fernando Aguilar-Canto, Instituto Politécnico Nacional

Centro de Investigacion en Computación

Omar Juarez-Gambino, Instituto Politécnico Nacional

Centro de Investigacion en Computación

Hiram Calvo, Instituto Politécnico Nacional

Centro de Investigacion en Computación

Downloads

PDF

Published

2023-12-17

Issue

Vol. 27 No. 4 (2023): 27(4) 2023

Section

Articles

License

Hereby I transfer exclusively to the Journal "Computación y

Sistemas", published by the Computing Research Center (CIC-IPN),

the Copyright of the aforementioned paper. I also accept that these

rights will not be transferred to any other publication, in any other

format, language or other existing means of developing.

I certify that the paper has not been previously disclosed or simultaneo

usly submitted to any other publication, and that it does not contain

material whose publication would violate the Copyright or other

proprietary rights of any person, company or institution. I certify that

I have the permission from the institution or company where I work or

study to publish this work.

The representative author accepts the responsibility for the publication

of this paper on behalf of each and every one of the authors.

This transfer is subject to the following conditions:

The authors retain all ownership rights (such as patent rights) of this work, except for the publishing rights transferred to the CIC, through this document.
Authors retain the right to publish the work in whole or in part in any book they are the authors or publishers. They can also make use of this work in conferences, courses, personal web pages, and so on.
Authors may include working as part of his thesis, for non-profit distribution only.