TY - JOUR
T1 - A scalable framework for cross-lingual authorship identification
AU - Sarwar, Raheem
AU - Li, Qing
AU - Rakthanmanon, Thanawin
AU - Nutanong, Sarana
PY - 2018/10
Y1 - 2018/10
N2 - Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.
AB - Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.
KW - Authorship identification
KW - Cross-lingual
KW - Cyber forensic
KW - Similarity search
KW - Stylometric features
KW - Writeprint
UR - http://www.scopus.com/inward/record.url?scp=85049986312&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85049986312&origin=recordpage
U2 - 10.1016/j.ins.2018.07.009
DO - 10.1016/j.ins.2018.07.009
M3 - RGC 21 - Publication in refereed journal
SN - 0020-0255
VL - 465
SP - 323
EP - 339
JO - Information Sciences
JF - Information Sciences
ER -