Stopwords in technical language processing

Serhad Sarica*, Jianxi Luo

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

109 Citations (Scopus)
29 Downloads (CityUHK Scholars)

Abstract

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications. © 2021 Sarica, Luo.
Original languageEnglish
Article numbere0254937
JournalPLOS ONE
Volume16
Issue number8
Online published5 Aug 2021
DOIs
Publication statusPublished - 2021
Externally publishedYes

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Stopwords in technical language processing'. Together they form a unique fingerprint.

Cite this