| 115 | 0 | 284 |
| 下载次数 | 被引频次 | 阅读次数 |
随着自然语言生成技术的发展,大语言模型已经能够生成类似人类语言风格和质量的复杂文本,机器生成文本广泛应用的同时也带来了文本来源识别的挑战。目前,存在多种基于监督学习的英文机器生成文本检测方法,但这些方法在不同语言环境下的适用性和准确性仍需进一步研究。为此,由优质中文语料生成对应机器文本建立数据集,并提出一种融合文本深度特征和多语言特征的文本分类模型,通过提取并融入字词常用度、篇章凝聚力等语言特征增强模型的泛化能力。实验结果表明,所提方法能够有效提升模型在不同源测试集上的泛化能力。
Abstract:With the advancement of natural language generation technology, large language models have become capable of generating complex texts that resemble human language in style and quality. The widespread application of machine-generated text has posed challenges in identifying the source of such texts. Currently, there are various supervised learning-based methods for detecting English machine-generated text. However, their applicability and accuracy in different linguistic contexts require further investigation. To address this issue, a dataset was created by generating corresponding machine-generated texts from high-quality Chinese corpora, and a text classification model that integrates deep textual features and multilingual characteristics was proposed. By extracting and incorporating linguistic features such as word frequency and discourse cohesion, the model′s generalization ability is enhanced. Experimental results demonstrate that the proposed method can effectively improve the model′s generalization capability across different source test sets.
[1]李亚玲,覃缘琪,魏阙.人工智能生成内容的潜在风险及治理对策[J].智能科学与技术学报,2023,5(3):415-423.
[2] KIRCHENBAUER J,GEIPING J,WEN Y,et al. A watermark for large language models[C]//Proceedings of the 40th International Conference on Machine Learning. online,PMLR. 2023:17061-17084.
[3] MITCHELL E,LEE Y,KHAZATSKY A,et al. DetectGPT:zero-shot machine-generated text detection using probability curvature[C]//Proceedings of the 40th International Conference on Machine Learning. online,PMLR. 2023:24950-24962.
[4] SOLAIMAN I,BRUNDAGE M,CLARK J,et al. Release strategies and the social impacts of language models[EB/OL].(2019-11-13)[2024-07-08]. https://arxiv.org/abs/1908.09203.
[5] RODRIGUEZ J,HAY T,GROS D,et al. Cross-domain detection of GPT-2-generated technical text[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg,PA:Association for Computational Linguistics,2022:1213-1233.
[6] LIU Y,OTT M,GOYAL N,et al. RoBERTa:a robustly optimized BERT pretraining approach[EB/OL].(2019-07-26)[2024-07-08]. https://arxiv.org/abs/1907.11692.
[7] LAI A,TETREAULT J. Discourse coherence in the wild:a dataset,evaluation and methods[C]//Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue. Stroudsburg,PA:Association for Computational Linguistics,2018:214-223.
[8] ZHONG W,TANG D,XU Z,et al. Neural deepfake detection with factual structure of text[C]//Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2020:2461-2470.
[9]薛扬,梁循,赵东岩,等.镜像图灵测试:古诗的机器识别[J].计算机学报,2021,44(7):1398-1413.
[10]王一博,郭鑫,刘智锋,等. AI生成与学者撰写中文论文摘要的检测与差异性比较研究[J].情报杂志,2023,42(9):127-134.
[11] LI Y,ZHANG Y,ZHAO Z,et al. CSL:a large-scale Chinese scientific literature dataset[C]//Proceedings of the 29th International Conference on Computational Linguistics. Praha,Czech Republic:International Committee on Computational Linguistics,2022:3917-3923.
[12]BRIGHT X. NLP Chinese corpus:large scale Chinese corpus for NLP[EB/OL].[2024-09-17]. https://zenodo.org/records/3402023.
[13]韩世依,张钰晖,马云山,等. THUOCL:清华大学开放中文词库[EB/OL].[2024-09-22]. http://thuocl.thunlp.org.
[14] GUO B,ZHANG X,WANG Z,et al. HC3-Chinese[EB/OL].[2024-09-22]. https://huggingface.co/datasets/HelloSimpleAI/HC3-Chinese.
[15]HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural computation,1997,9(8):1735-1780.
[16]KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics,2014:1746-1751.
[17]DEVLIN J,CHANG M W,LEE K,et al. BERT:pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies:Volume 1(Long and Short Papers). Stroudsburg,PA:Association for Computational Linguistics,2019:4171-4186.
基本信息:
DOI:10.19573/j.issn2095-0926.202502009
中图分类号:TP391.1
引用信息:
[1]吴春晖,陈静,张佳佳,等.融合多语言学特征的机器生成文本检测方法研究[J].天津职业技术师范大学学报,2025,35(02):55-60.DOI:10.19573/j.issn2095-0926.202502009.
基金信息:
天津市教委科研计划项目(2021KJ008)
2025-06-28
2025-06-28