Skip to main navigation Skip to search Skip to main content

A stacking model using URL and HTML features for phishing webpage detection

  • Yukun Li
  • , Zhenguo Yang*
  • , Xu Chen
  • , Huaping Yuan
  • , Wenyin Liu*
  • *Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

In this paper, we present a stacking model to detect phishing webpages using URL and HTML features. In terms of features, we design lightweight URL and HTML features and introduce HTML string embedding without using the third-party services, making it possible to develop real-time detection applications. Furthermore, we devise a stacking model by combining GBDT, XGBoost and LightGBM in multiple layers, which enables different models to be complementary, thus improving the performance on phishing webpage detection. In particular, we collect two real-world datasets for evaluations, named as 50K-PD and 50K-IPD, respectively. 50K-PD contains 49,947 webpages with URLs and HTML codes. 50K-IPD contains 53,103 webpages with screenshots in addition to URLs and HTML codes. The proposed approach outperforms quite a few machine learning models on multiple metrics, achieving 97.30% on accuracy, 4.46% on missing alarm rate, and 1.61% on false alarm rate on 50K-PD dataset. On 50K-IPD dataset, the proposed approach achieves 98.60% on accuracy, 1.28% on missing alarm rate, and 1.54% on false alarm rate.
Original languageEnglish
Pages (from-to)27-39
JournalFuture Generation Computer Systems
Volume94
Online published9 Nov 2018
DOIs
Publication statusPublished - May 2019

Research Keywords

  • Anti-phishing
  • HTML string embedding
  • Machine learning
  • Stacking model

Fingerprint

Dive into the research topics of 'A stacking model using URL and HTML features for phishing webpage detection'. Together they form a unique fingerprint.

Cite this