Phrase based indexing and retrieval is a system designed by Google to retrieve information by using phrases to list, rank, define and search documents on the internet. It looks at how the phrases are used on the internet and determines whether they are good or valid phrases to be on the web. This system also considers how frequently a phrase has been used and determines their significance on the internet and also compares how certain common phrases are used in place of or together with other phrases. If a certain phrase is used very frequently, it may be used to determine relation in content for web contents or simply web pages that will produce the same search results. The phrase based indexing and retrieval system can also use this same technique to determine any spam activity based on how many times a phrase is used on one document.
This information retrieval system is used mainly to capture spam documents that contain many repetitive key phrases but very little useful content. Such pages stuffed with keywords can easily be eliminated this way. This technique has been helpful when searching websites that contain many ‘honeypots’. These are phrases that advertisers use frequently and therefore pay website owners to incorporate in their pages. A page with too much of this offers no real information to internet users and is therefore considered as spam. The phrase based indexing and retrieval system has also been used to detect duplicate content, identify spam and rank pages in terms of popularity hence make more use of pages that users want to use.
Another important role played by phrase based indexing and retrieval system is exploring documents that are on the internet and put together the phrases that appear in the same documents. It searches for the exact keyword and any other common phrases that appear on both or all documents. This technique ensures that search results yielded put together these documents because they more likely than not have related information that users can use. It also determines in what order documents should appear on the results page.
Phrase based indexing and retrieval system may sometimes get mixed up when a term is common in two or more different scenarios. In this case, the search engine will create clusters or groups of documents that are related and will all appear on the search results page although the order in which they appear will be based on the most common context. The system will create different clusters and give them individual cluster names and the documents with the most used phrase in each cluster are displayed on the results page. Google will then determine how the results will be presented and show those in the results page. This gives the internet users a broader results page and ensures that they get what they are looking for. Google’s aim at designing this system was to identify and deal with spam documents as well as ensure users get the most out of their search results.