Measuring Similarity between Documents Using TF-IDF Cosine Similarity Function

Authors

  • Manoj Chahal Master of Technology (Computer Science and Engineering) Guru Jambheshwar University of Science and Technology, Hisar, Haryana, India

Keywords:

Term frequency, Vector Space Model, inverse Document Frequency, String based Similarity

Abstract

Tremendous amount of information is present in all over the world in the form of document databases. This database is expanding exponentially. Measuring similarity between documents is very difficult task. The field of similarity deals with the problem of documents similarity and for this problem various models and functions have been proposed. Measuring similarity between documents plays a crucial role in various research and application. Plagiarism detection , information retrieval ,text based research, clustering etc methods are possible after measuring similarity between documents . In this paper we would like to compare documents and measure the similarity between them using TF (term frequency) -IDF (inverse document frequency) cosine similarity function.

References

M.K.Vijaymeena and K.Kavitha, “A Survey On Similarity Measures In Text Mining”, Machine Learning and Applications: An International Journal (MLAIJ), vol 3,no.1,pp 19-28, march 2016.

A.K.patidar, Jitender Agrawal and N.Mishra “ Analysis of different Similarity Measure Functions and Their Impact On Shared Nearest Neighbour Clustering Approach”,International Journal of Computer Application, ISSN 0975-8887, vol. 40, no.16,pp. 1-5, Feb. 2012.

Rugved Deshpande , Ketan Vaze et all., “Comparative Study of Document Similarity Algorithms and Clustering Algorithms for Sentiment Analysis”,International Journal of Emerging Trends & Technology in Computer Science (IJETTCS),ISSN 2278-6856,vol.3,Issue 5,Sep-Oct 2014.

David Buttler, “A Short Survey of Document Structure Similarity Algorithms”,Lawrence Livermore National Laboratory ,Livermore, CA 94550.

Tung Khuat and Le Thi Hanh, “A Comparison of Algorithms used to measure the Similarity between two documents”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET),ISSN 2278-1323,Volume 4, Issue 4, pp 1117-1121,April 2015

Abhishek Jain,Aman Jain er all. “Information Retrieval using Cosine and Jaccard Similarity Measures in Vector Space Model”, International Journal of Computer Applications (0975 – 8887) ,Volume 164 ,No 6, pp 28-30, April 2017

Vikas Thada, “Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm”, International Journal of Innovations in Engineering and Technology (IJIET),ISSN 2319-1058, vol. 2, Issue 4, pp.202-205, Aug. 2014.

Zun May Myint and May Zin oo, “Ananysis Of Modified Inverse Document Frequency Variants For Word Sense Disambiguation”, International Journal Of advanced Computational Engineering and Networking,ISSN 2320-2106,Vol. 4,Issue 8 pp.46-50, Aug 2016.

K. Ramana and A. Venkataramana ,“ Enhance the Efficiency of Clustering by Minimizing the Processing Time using Hadoop MapReduce”, International Journal of Advanced Research in computer science and software Engineering ,ISSN 2277-128X, vol 5 Issue 9 ,pp- 841-845 ,Sep 2015.

S.Anitha Elavarasi and J.Akilandeswari, “Categorical Data Clustering using Frequency and TF-IDF based cosine similarity”, Proceedings of the Intl. Disciplinary Research in Engineering and Technology,ISBN 9778-81-929742-0-0,pp 39-43, 2014.

Pragati Bhatnagar and N.K. Pareek, “ A combined matching function based evolutionary approach for development of adaptive information retrieval system”,International Journal of Emerging Technology and Advanced Engineering, ISSN 2250-2459, vol. 2, no. 6,pp. 249-256, Jun. 2012.

Nurkhadijah Aishah Ibrahim, Ali Selamat, Mohd Hafiz Selamat, “Query optimization in relevance feedback using hybrid GA-PSO for effective webinformation retrieval”, IEEE Transaction DOI 10.1109, pp. 91-96, 200

Downloads

Published

31-03-2018

How to Cite

Manoj Chahal. (2018). Measuring Similarity between Documents Using TF-IDF Cosine Similarity Function. International Journal for Research Publication and Seminar, 9(1), 53–57. Retrieved from https://jrps.shodhsagar.com/index.php/j/article/view/1298

Issue

Section

Original Research Article