Search Engine Algorithms

Search Engine Algorithms

Welcome to my first blog.

ok, so let's start with the most basic question i.e. how search engines like Google Yahoo, or Bing work. They do this by using a process called crawling. Crawling is the process used by search engine web crawlers to visit and download a page and extract its links to discover additional pages. Crawling is usually done by bots called spiders or crawling bots.

Further, the search engine indexes (basically gives every webpage a unique number) it. so, when the user initiates a particular query, the search engine starts by matching the keywords (it's just one of the ways) in the webpage and then shows the result. The procedure through which the search engine shows the result is known as the Search Engine Algorithm(SEA), and believe me they are pretty interesting. Let's cover them one by one.

True OR False:

This technique is pretty simple but is considered a bit primitive, due to its multiple setbacks. talking about the technique it is like a Y/N question like whether the webpage contains the keyword from the query or not.

For example, consider the query Coffee and Tea, for this, the search engine will search the internet for 3 words, "Coffee", "and", "Tea". If a webpage contains any of these, it will appear on the result. Now, coming to its setback, since the query had 3 words, the search engine is searching for all those webpages that contain these and shows the result according to its token score.

Token Score is the total count of the no of times a keyword has appeared on the webpage. So, if a webpage contains this line "We serve Coffee and Tea in our Cafe." and another document contains "Facts: Brazil is the largest producer of Coffee and China is the largest producer of Tea." let's see their token score, the first document has a token score of 3 (coffee = 1, and = 1, tea = 1), while the token score for the second document is 4 (coffee = 1, and = 2, tea = 1).

So, document no two will be on the top priority, while it might not be relevant, as its token score has been improved due to and extra AND. This is a major setback for the SEA. Also, in many cases, the token score of documents gets equal, so there is no absolute way to organize the result. also, there is a huge difference in its content. One thing to be noted is that when the search engine operates, it does not operate in case sensitive.

Zone Indexes

In this method, we divide the webpage into zones and then give individual scores to the zones. each zone has a different weightage depending upon its functionality. like obliviously the body of the page contains the most important info so we give it a weightage of 0.5, similarly 0.4 weightage to the description, and 0.1 to the title, as people might try to clickbait traffic.

Also, the other sections like the author and time of publication are neglected in it. consider this example:

Query: Coffee and Tea

ZoneWeightage
Title0.1
Description0.4
Body0.5
Total1.0

So, if a website has our keywords in both title and description it will have a score of 0.5, but if a site has it on the Body, and Title it will have a score of 0.6. So, the second website will be shown first. This method also has a setback in that many web pages can get the same score.

Tip for devs: try to keep the text/code ratio good, it helps the search engine :)

Term Frequency

Remember the problem from the first method i.e. True OR False method, one word (which is more or less useless) impacts the whole score and makes it biased. we have a solution for it here. But before that keyword's frequency (the no of times it appeared) on the webpage is the term frequency.

Talking about the solution, it's called inverse document frequency (IDF), which is the opposite of Document frequency. Document frequency is the number of documents where a term occurs. IDF being the opposite, means that as the no of times the keyword appears on different pages, the value of IDF will reduce. ex: the keyword "XYZ" appears on two web pages, so the IDF value will be of 0.5.

now, the formula for the overall TF score is: (frequency of keyword)\(IDF)*
with this formula, the effect of words like "and", "or", and "is" is reduced significantly, as their IDF is very low, and multiplied by its frequency still won't make the overall score biased.

Vector Model

The basic concept behind this model is to compare the pages to the original query, to rank the pages for results.

In this model, we use the property of trigonometry math to evaluate the score of the web pages. The property used is the cosine property.

A glimpse, if two vectors are identical, their cosine product is 1. If they have no similarity, the product is 0, and if they contradict, they have a product of -1.

  • Rubrics: TF = term frequency

  • DF = document frequency

  • IDF = inverse document frequency

  • Wt. ,q = weight for term in query

  • Wt. ,d = weight for term in document

  • Product = Wt., q * Wt., d Score = Sum of the products

Consider the query to be Order food. As we can see it has two keywords. so, proceeded with the calculations for both of them.

termQueryDocumentProduct
TFDFIDFWt., qTFWFWt., d
Order125.500.0003.6104931593.610493159214410.707112.55302
Food1118.000.0002.9451513322.9452214410.707112.08258
Score:4.6356

If the above table topples over your head, not a problem. Even remembering the result of this can help you with simple Search Engine Optimization(SEO):

We can conclude that the number of times you use a term is not necessarily important. It is important to find the right balance for the terms you want to rank.

To speed up the process, stop the comparison of the pages, after finding the top 10, or scan only the top N relevant pages to rank them.

Relevance Feedback

Basically what happens here, is probably what we all have been waiting for, now, the search engine detects which word in your query to give more weightage to. consider our previous example, "Coffee and Tea", so, now the search engine will give high weightage or priority to Coffee, Tea. while completely ignoring the "and" keyword.

The search engine performs this by finding the relevance of your query with the top search results. the top search results here are found by Click through rate (CTR), and bounce rate of other users for a similar query. The formula for Relevance feedback is given below, but is pretty irrelevant to most people.

The danger of this method is topic drift.

Conclusion

Now, we have covered most of the algorithms used by popular search engines for results. but there are several more factors, which are taken into account like your previous searches, your location, and many more.

In the end one tip from my side: for a better SEO, It is important to find the right balance for the terms you want to rank.

Also, now with the AI in the picture, all this might change soon :) Thank You for your time.

Feedback link - https://d8uamy8rtn2.typeform.com/to/eNRVcfmb