TY - GEN
T1 - LightLDA
T2 - 24th International Conference on World Wide Web, WWW 2015
AU - Yuan, Jinhui
AU - Gao, Fei
AU - Ho, Qirong
AU - Dai, Wei
AU - Wei, Jinliang
AU - Zheng, Xun
AU - Xing, Eric P.
AU - Liu, Tie-Yan
AU - Ma, Wei-Ying
N1 - Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].
PY - 2015/5/18
Y1 - 2015/5/18
N2 - When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practi-tioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 bil-lion tokens ? a scale not yet reported even with thousands of ma-chines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-Art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limit-ed memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high-and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These con-tributions are built on top of the Petuum open-source distributed M-L framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
AB - When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practi-tioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 bil-lion tokens ? a scale not yet reported even with thousands of ma-chines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-Art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limit-ed memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high-and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These con-tributions are built on top of the Petuum open-source distributed M-L framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
KW - Data Parallelism
KW - Distributed Systems
KW - Large Scale Machine Learning
KW - Metropolis-Hastings
KW - Model Scheduling
KW - Param-eter Server
KW - Petuum
KW - Topic Model
UR - http://www.scopus.com/inward/record.url?scp=84968736704&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-84968736704&origin=recordpage
U2 - 10.1145/2736277.2741115
DO - 10.1145/2736277.2741115
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9781450334693
T3 - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
SP - 1351
EP - 1361
BT - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
PB - Association for Computing Machinery
Y2 - 18 May 2015 through 22 May 2015
ER -