Anyone who has spent some time on the internet would have used some search engine. But we hardly spare a thought to understand how a search engine works. Or for that matter how does a search work. It really doesn’t matter to most as long as the likes of Google and Bingcan dish out relevant results for our queries.
But this question matters to those who are interested in search marketing, to those inquisitive kinds who would not shy away from some extra knowledge and to those who would like to appreciate the incredible technology that goes behind fetching those results. How does a search engine work? What happens when you enter a query? How does the search engine fetch the relevant results? And how are the results ranked? These are some of the questions that I attempt to answer in this article.
What is a search engine? A search engine is a program that automatically browse the world wide web methodically, stores and indexes the browsed data and then allow users to query that data to provide as far as possible relevant results.
The entire search begins long before you have even thought of something to search. It begins with creating an inventory of pages in the search index. The search index comprises of all possible keywords mapped to the websites which contain those keywords. However, to save space, the index does not store the webpage urls, but a unique document ID that identifies those urls in a separate database.
The construction of this search index begins with a spider (or crawler, or bot). The spider starts by examining web pages in a seed list but then discovers sites on its own by following links. The spider identifies links by checking the HTML code of the web pages it visits. Thus, theoretically, given enough time, a spider can find every page in the web (at least those that are linked to at least another page). But that is purely theoretical. Various researches to find how much of the web is actually indexed throw up widely divergent numbers from 0.03% to 16% of the web.
While crawling is probably the most efficient way of discovering web pages, it is definitely not the most efficient when it comes to discovering changes made to a web page. This is simply because there is no surety when the spider will return to a site. By then a web page could have changed dramatically or even ceased to exist. Once the spider has found a web page and added it to the index, it is time for the search engines to analyze those pages.
That is just about the simplest description of what a search engine does to build the search index. Crawling, indexing and analysis could very well be the topic of a dedicated article. But that is not the point of this one. So let’s move ahead to find out what happens when you actually enter a query.
Once you have typed in your query and clicked on the search button (or pressed the enter key), the search engine starts by matching the search query to pages in the search index. The first step in the process is to analyze the query. The search engine examines each word in the query to find the best web pages in the search index that match. Analyses of search queries involve finding word variants, correcting spellings, detecting phrases and antiphrases (words such as ‘what’, ‘is’, ‘the’), examining word order and processing search operators.
Once the analysis is done, the next task for the search engine is to decide which results to present. With hundreds of thousands of possibilities this is a tough task. This is where the search index comes to use. The search engine uses this index to locate the matching pages depending not only on the query as entered by you but also any word variants (e.g. ‘mouse’ and ‘mice’) and words to ignore.
Now comes the most interesting and challenging phase of the search engine’s job. Ranking the matching pages. This is where the ranking algorithm comes to play (the most famous of which is Google’s PageRankTM algorithm). Ranking, very simply put, is just sorting by relevance. There are a variety of factors that go into consideration while ranking the matching pages. These include keyword density, keyword proximity, keyword prominence and link popularity. Link popularity has emerged as the most popular factor in ranking since it can act as a surrogate for quality and reliability.
Sounds simple? This is what Google has to say about their PageRankTM[1]: “We use more than 200 signals, including our patented PageRank™ algorithm, to examine the entire link structure of the web and determine which pages are most important. We then conduct hypertext-matching analysis to determine which pages are relevant to the specific search being conducted. By combining overall importance and query specific relevance, we’re able to put the most relevant and reliable results first”.
So now you know what happens in those milliseconds after you type in your query and hit the enter key and the search engine presents the results to you. This article is more of an attempt to enlighten as many as possible to the intricacies of a piece of technology that has become so ubiquitous in our lives.
PS: About Google’s PageRank™ – PageRank™ mainly relies on the ‘democratic nature’ of the web by using its vast link structure as an indicator of an individual page’s value. Important, high quality sites receive a higher PageRank™. So, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at a lot more than the sheer volume of links a page receives. For example, it also analyzes the page that casts the vote and votes by pages that are important weigh more heavily and help to make other pages important. A site’s rate of link acquisition, the longevity of a link, the text used for the link, whether it’s a ‘deep link’ or to the homepage and whether anyone clicks on the link seem also to count.
That’s about all that we know about PageRank™, the rest of the mystery is safely secured in Mountainview.
[1] http://www.google.com/corporate/tech.html