Measuring Similarity between Documents Using TF-IDF Cosine Similarity Function
Keywords:
Term frequency, Vector Space Model, inverse Document Frequency, String based SimilarityAbstract
Tremendous amount of information is present in all over the world in the form of document databases. This database is expanding exponentially. Measuring similarity between documents is very difficult task. The field of similarity deals with the problem of documents similarity and for this problem various models and functions have been proposed. Measuring similarity between documents plays a crucial role in various research and application. Plagiarism detection , information retrieval ,text based research, clustering etc methods are possible after measuring similarity between documents . In this paper we would like to compare documents and measure the similarity between them using TF (term frequency) -IDF (inverse document frequency) cosine similarity function.
References
M.K.Vijaymeena and K.Kavitha, “A Survey On Similarity Measures In Text Mining”, Machine Learning and Applications: An International Journal (MLAIJ), vol 3,no.1,pp 19-28, march 2016.
A.K.patidar, Jitender Agrawal and N.Mishra “ Analysis of different Similarity Measure Functions and Their Impact On Shared Nearest Neighbour Clustering Approach”,International Journal of Computer Application, ISSN 0975-8887, vol. 40, no.16,pp. 1-5, Feb. 2012.
Rugved Deshpande , Ketan Vaze et all., “Comparative Study of Document Similarity Algorithms and Clustering Algorithms for Sentiment Analysis”,International Journal of Emerging Trends & Technology in Computer Science (IJETTCS),ISSN 2278-6856,vol.3,Issue 5,Sep-Oct 2014.
David Buttler, “A Short Survey of Document Structure Similarity Algorithms”,Lawrence Livermore National Laboratory ,Livermore, CA 94550.
Tung Khuat and Le Thi Hanh, “A Comparison of Algorithms used to measure the Similarity between two documents”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET),ISSN 2278-1323,Volume 4, Issue 4, pp 1117-1121,April 2015
Abhishek Jain,Aman Jain er all. “Information Retrieval using Cosine and Jaccard Similarity Measures in Vector Space Model”, International Journal of Computer Applications (0975 – 8887) ,Volume 164 ,No 6, pp 28-30, April 2017
Vikas Thada, “Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm”, International Journal of Innovations in Engineering and Technology (IJIET),ISSN 2319-1058, vol. 2, Issue 4, pp.202-205, Aug. 2014.
Zun May Myint and May Zin oo, “Ananysis Of Modified Inverse Document Frequency Variants For Word Sense Disambiguation”, International Journal Of advanced Computational Engineering and Networking,ISSN 2320-2106,Vol. 4,Issue 8 pp.46-50, Aug 2016.
K. Ramana and A. Venkataramana ,“ Enhance the Efficiency of Clustering by Minimizing the Processing Time using Hadoop MapReduce”, International Journal of Advanced Research in computer science and software Engineering ,ISSN 2277-128X, vol 5 Issue 9 ,pp- 841-845 ,Sep 2015.
S.Anitha Elavarasi and J.Akilandeswari, “Categorical Data Clustering using Frequency and TF-IDF based cosine similarity”, Proceedings of the Intl. Disciplinary Research in Engineering and Technology,ISBN 9778-81-929742-0-0,pp 39-43, 2014.
Pragati Bhatnagar and N.K. Pareek, “ A combined matching function based evolutionary approach for development of adaptive information retrieval system”,International Journal of Emerging Technology and Advanced Engineering, ISSN 2250-2459, vol. 2, no. 6,pp. 249-256, Jun. 2012.
Nurkhadijah Aishah Ibrahim, Ali Selamat, Mohd Hafiz Selamat, “Query optimization in relevance feedback using hybrid GA-PSO for effective webinformation retrieval”, IEEE Transaction DOI 10.1109, pp. 91-96, 200
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2018 International Journal for Research Publication and Seminar
This work is licensed under a Creative Commons Attribution 4.0 International License.
Re-users must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. This license allows for redistribution, commercial and non-commercial, as long as the original work is properly credited.