LightLDA: Big topic models on modest computer clusters

Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-Yan Liu, Wei-Ying Ma

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

142 Citations (Scopus)

Abstract

When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practi-tioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 bil-lion tokens ? a scale not yet reported even with thousands of ma-chines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-Art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limit-ed memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high-and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These con-tributions are built on top of the Petuum open-source distributed M-L framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
Original languageEnglish
Title of host publicationWWW 2015 - Proceedings of the 24th International Conference on World Wide Web
PublisherAssociation for Computing Machinery
Pages1351-1361
ISBN (Print)9781450334693
DOIs
Publication statusPublished - 18 May 2015
Externally publishedYes
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: 18 May 201522 May 2015

Publication series

NameWWW 2015 - Proceedings of the 24th International Conference on World Wide Web

Conference

Conference24th International Conference on World Wide Web, WWW 2015
PlaceItaly
CityFlorence
Period18/05/1522/05/15

Bibliographical note

Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].

Research Keywords

  • Data Parallelism
  • Distributed Systems
  • Large Scale Machine Learning
  • Metropolis-Hastings
  • Model Scheduling
  • Param-eter Server
  • Petuum
  • Topic Model

Fingerprint

Dive into the research topics of 'LightLDA: Big topic models on modest computer clusters'. Together they form a unique fingerprint.

Cite this