Blogs

### Using TF-IDF Algorithm to Find Relevance Score in Document Queries

Tags

As the term implies, TF-IDF stands for term frequency-inverse document frequency and is used to determine what words in a corpus of documents might be more favorable to use in a query. TF-IDF calculates values for each word in a document to the percentage of documents the word appears in. Words with high TF-IDF numbers imply a strong relationship with the document they appear in, suggesting that if that word to appear in a query, the document could be of interest to the person.

The task of retrieving data from a user-defined query has become so common and natural in recent years that some might not give it a second thought. However, this growing use of query retrieval warrants continued research and enhancements to generate better solutions to the problem.

## Type of Term Frequency Algorithm

There have been many advances in the TF-IDF algorithm, many researchers have contributed and have come out with many of their own algorithms which although not used prominently but are still relevant. Here is a list of algorithms that are generally referred for Term frequency:

## Type of Inverse Document Frequency

The Inverse Document frequency algorithm, has also seen many advances, the only problem with the simple IDF has is it's not able to identify words which are singular and plural, so it identifies them as two distinct characters, thus not giving an accurate result. Researchers have tried to come up with some algorithms to counter that:

## The Math behind TF-IDF

Essentially, TF-IDF works by determining the relative frequency of words in a specific document compared to the inverse proportion of that word over the entire document corpus. Intuitively, this calculation determines how relevant a given word is in a particular document. Words that are common in a single or a small group of documents tend to have higher TFIDF numbers than common words such as articles and prepositions. The TF-IDF weighing is a much better way to understand this, it’s the product of its TF weight with IDF weight.

This type of weighing is the best weighing scheme in information retrieval and it increases with the number of occurrences of a given word in the document. Another factor which contributes to this is with an increase in a rarity of the word in other documents the weight also increases. This ‘W’ term is said to have a large discriminatory power. Therefore, when a query contains this ‘W’, returning a document ‘d’ where ‘W’ is large will very likely satisfy the user.

## Applications

This algorithm is useful when you have a document set, generally a large one, which needs to be categorized and its especially easy to implement as you don't need to train a model ahead of time and it will automatically account for differences in lengths of documents.

If you have a Blogging website where tens of thousands of users contribute and write blog posts, the tags attached to each blog post will appear on listing pages on various parts of the site. Although the authors are able to tag things manually when they write the content, in many cases they chose not to, and therefore many blog posts are not categorized. Empirics show that only a small fraction of users will take the time to manually add tags and assist with a categorization of posts and reviews, making voluntary organization unsustainable. Such a document set is an excellent use-case for TF-IDF, as it generates tags for the blog posts and helps display them in the right areas of your site. Best of all, no new writer or blogger would have to suffer through manually tagging them on their own! A quick run of the algorithm would go through the document set and sort through all the entries, eliminating a great deal of hassle.