An exclusive look inside the Google search algorithm

The March 2010 issue of Wired features “an exclusive look at the algorithm that rules the Web.” It’s a fascinating three page article that reveals much about Google’s tireless drive to improve itself, but most significant are the hints as to what hidden ‘signals’ determine what websites end up at the top of a search query – and the realization that Google’s system is ever-changing.

Wired Magazine – Exclusive: How Google’s Algorithm Rules the Web

Web search is a multipart process. First, Google crawls the Web to collect the contents of every accessible site. This data is broken down into an index (organized by word, just like the index of a textbook), a way of finding any page based on its content. Every time a user types a query, the index is combed for relevant pages, returning a list that commonly numbers in the hundreds of thousands, or millions. The trickiest part, though, is the ranking process — determining which of those pages belong at the top of the list.

That’s where the contextual signals come in. All search engines incorporate them, but none has added as many or made use of them as skillfully as Google has. PageRank itself is a signal, an attribute of a Web page (in this case, its importance relative to the rest of the Web) that can be used to help determine relevance. Some of the signals now seem obvious.

Early on, Google’s algorithm gave special consideration to the title on a Web page — clearly an important signal for determining relevance. Another key technique exploited anchor text, the words that make up the actual hyperlink connecting one page to another. As a result, “when you did a search, the right page would come up, even if the page didn’t include the actual words you were searching for,” says Scott Hassan, an early Google architect who worked with Page and Brin at Stanford. “That was pretty cool.”

Later signals included attributes like freshness (for certain queries, pages created more recently may be more valuable than older ones) and location (Google knows the rough geographic coordinates of searchers and favors local results). The search engine currently uses more than 200 signals to help rank its results.

And Google keeps improving. Recently, search engineer Maureen Heymans discovered a problem with “Cindy Louise Greenslade.” The algorithm figured out that it should look for a person — in this case a psychologist in Garden Grove, California — but it failed to place Greenslade’s homepage in the top 10 results. Heymans found that, in essence, Google had downgraded the relevance of her homepage because Greenslade used only her middle initial, not her full middle name as in the query. “We needed to be smarter than that,” Heymans says. So she added a signal that looks for middle initials. Now Greenslade’s homepage is the fifth result.

This flexibility — the ability to add signals, tweak the underlying code, and instantly test the results — is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months Google has made more than 200 improvements, some of which seem to mimic — even outdo — the offerings of its competitors.