TY - GEN
T1 - Learning similarity measures in non-orthogonal space
AU - Liu, Ning
AU - Zhang, Benyu
AU - Yan, Jun
AU - Yang, Qiang
AU - Yan, Shuicheng
AU - Chen, Zheng
AU - Bai, Fengshan
AU - Ma, Wei-Ying
N1 - Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].
PY - 2004
Y1 - 2004
N2 - Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space. The basic idea of our proposed metric is that the similarity of features should affect the similarity of objects, and vice versa. A novel iterative algorithm for computing non-orthogonal space similarity measures is then proposed. Experimental results on a synthetic data set, a real MSN search click-thru logs, and 20NG dataset show that our algorithm outperforms the traditional Cosine similarity and is superior to LSI. Copyright 2004 ACM.
AB - Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space. The basic idea of our proposed metric is that the similarity of features should affect the similarity of objects, and vice versa. A novel iterative algorithm for computing non-orthogonal space similarity measures is then proposed. Experimental results on a synthetic data set, a real MSN search click-thru logs, and 20NG dataset show that our algorithm outperforms the traditional Cosine similarity and is superior to LSI. Copyright 2004 ACM.
KW - Latent Semantic Indexing (LSI)
KW - Non-Orthogonal Space (NOS)
KW - Similarity Measures (SM)
KW - Vector Space Model (VSM)
UR - http://www.scopus.com/inward/record.url?scp=18744372752&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-18744372752&origin=recordpage
U2 - 10.1145/1031171.1031240
DO - 10.1145/1031171.1031240
M3 - RGC 32 - Refereed conference paper (with host publication)
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 334
EP - 341
BT - CIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - CIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management
Y2 - 8 November 2004 through 13 November 2004
ER -