Friday 25 September 2009

How Search Engines Work

There are billions of pages on the internet. Possiblly trillions and maybe gazillions. And not all of it is indexed by the search engines (they can't possibly index and store it all, their servers just arn't that powerful).

So what they do is trawl through the web looking for material and then analyse what the material is about. They then have to decide how important the material is and whether to list it in their index or not.

Google, Bing and Yahoo have bots that are constantly out there looking for material. In the parlance, they are "spidering" the web, because like a spider they crawl from one page to another using the links on the page, and when they find a new page, they index it.

In my previous post I mentioned that social media does have uses, it's just not about directly making money. Well we come to the first use of social media - getting your website noticed. The spiders and bots practically live on the social media sites such as Digg and Reddit. So to get your site indexed for the first time, simply submit a page (or get someone else to submit a page) to one of the social sites. You just need to submit one page. That's it. You don't need to spam the social sites with all your material. The bot will crawl from the social media site to your page, and index the page, and then for good measure index your entire site, especially if you have laid it out nicely with dynamic linking (a lesson we will come onto later).

Don't submit your page to search engines (and certainly don't pay to get someone submit your page to search engines). Let the spider find you. Why? Because when the spider finds you, you are just dealing with a bot. If you submit to the search engines, you will be dealing with a human, and a) it will take an age for them to get round to viewing and approving you and b) the only people who submit sites are internet marketers, normal bloggers and webmasters don't - so this send up a flag in the search engine to watch you closely.

What happens once the bot has found and indexed your page? The bot has to try to decide what your page is about. The title is a big clue (titles usually summarize the entire page). Therefore always choose your title carefully, with your main keyword in the title towards the front. Don't be tempted to go for jokey titles or puns that the tabloid newspapers love - the bots have no sense of humour, and won't get the point. And your page won't get found as a result.

Next the bot will try to analyse your content. It will create a matrix listing every single word you have in your article, counting how many times each word comes up. It then eliminates the prepositions and conjunctives (words such as "the", "a", "and" etc, with which we build sentences), and will instead concentrate on the nouns, verbs, adjectives and adverbs. In the olden days people used to keyword stuff their pages (i.e. mention their main keyword over and over to make the point). This is not necessary anymore, the search engines have become more sophisticated. They now look for patterns of related words (for instance a post about search engines will usually mention crawling, but won't mention running). You may inadvertantly have several related patterns on your page - the biggest pattern wins, and this is pegged to be what your page is all about, and everything else is discarded. To make things easy for the search engines always write your post to be tightly on topic. Don't confuse the bot by going off on tangents and talking about material that is not relevant to the title of your article.

The next thing the bots do is check whether any other page links to you. The anchor text (the set of words that the hyperlink is anchored onto) also gives them a clue as to what your page is about. The reasoning is simple - the person who is linking to you usually anchors the link on a set of words that summarises your page. Thus the anchor text is a big clue as to what your page is about. Gradually as you get more and more links, a picture is built up by the search engine as to what your page is about and therefore which search results pages they should list you on.

But how do they rank you on the results page? The search engines are looking to deliver the most relevant pages and also the best pages. So they are looking for pages which have the closest match to what the person has typed into the search engine (in terms of what is on the page, the title of the page and the anchor text of links to your page), and of all the pages that have a close match they are trying to work out which is best and then rank them in the results.

How do they work out which is best? They simply use the wisdom of crowds, and the pages with the most links to it are deemed the most popular. They reasoning is that real humans will have read the pages, and real humans will only link to other pages that they find worthwhile and wish to direct their readers to. Nobody willingly links out to rubbish. In fact if you find a link to a poorly written page, chances are very high that the person who created the link also wrote the rubbishy page - and this is how Google engineers catch those who are trying to game the system. They just have to find one rubbish page and track it backwards via the links to it, and then they usually deindex the whole lot - this is also why blackhatting doesn't really work. It's easy for an experienced engineer to track.

Google is actually a hand built search engine, not a machine built one like Yahoo - and this is key to it's success. They have teams of engineers handchecking the most searched for terms and the biggest money niches where the blackhatters congregate, and they will deindex any spam they come across. So for all the talk about the algorithms that search engines use, always build your sites in a way a real human will find useful and valuable, and always remember that real humans include Google engineers who may be inspecting your site and who have the power to deindex you.

No comments:

Post a Comment