Behind Google’s Search Algorithm
Okay folks , ever wondered how Google finds everything from Antarctica to Arctic when you search. Read on. This is very interesting tech article form wired. I hope the folks who didn’t read it already will like it.
Want to know how Google is about to change your life? Stop by the Ouagadougou conference room on a Thursday morning. It is here, at the Mountain View, California, headquarters of the world’s most powerful Internet company, that a room filled with three dozen engineers, product managers, and executives figure out how to make their search engine even smarter. This year, Google will introduce 550 or so improvements to its fabled algorithm, and each will be determined at a gathering just like this one. The decisions made at the weekly Search Quality Launch Meeting will wind up affecting the results you get when you use Google’s search engine to look for anything — “Samsung SF-755p printer,” “Ed Hardy MySpace layouts,” or maybe even “capital Burkina Faso,” which just happens to share its name with this conference room. Udi Manber, Google’s head of search since 2006, leads the proceedings. One by one, potential modifications are introduced, along with the results of months of testing in various countries and multiple languages. A screen displays side-by-side results of sample queries before and after the change. Following one example — a search for “guitar center wah-wah” — Manber cries out, “I did that search!”
You might think that after a solid decade of search-market dominance, Google could relax. After all, it holds a commanding 65 percent market share and is still the only company whose name is synonymous with the verb search. But just as Google isn’t ready to rest on its laurels, its competitors aren’t ready to concede defeat. For years, the Silicon Valley monolith has used its mysterious, seemingly omniscient algorithm to, as its mission statement puts it, “organize the world’s information.” But over the past five years, a slew of companies have challenged Google’s central premise: that a single search engine, through technological wizardry and constant refinement, can satisfy any possible query. Facebook launched an early attack with its implication that some people would rather get information from their friends than from an anonymous formula. Twitter’s ability to parse its constant stream of updates introduced the concept of real-time search, a way of tapping into the latest chatter and conversation as it unfolds. Yelp helps people find restaurants, dry cleaners, and babysitters by crowdsourcing the ratings. None of these upstarts individually presents much of a threat, but together they hint at a wide-open, messier future of search — one that isn’t dominated by a single engine but rather incorporates a grab bag of services.
Still, the biggest threat to Google can be found 850 miles to the north: Bing. Microsoft’s revamped and rebranded search engine — with a name that evokes discovery, a famous crooner, or Tony Soprano’s strip joint — launched last June to surprisingly upbeat reviews. (The Wall Street Journal called it “more inviting than Google.”) The new look, along with a $100 million ad campaign, helped boost Microsoft’s share of the US search market from 8 percent to about 11 — a number that will more than double once regulators approve a deal to make Bing the search provider for Yahoo.
Team Bing has been focusing on unique instances where Google’s algorithms don’t always satisfy. For example, while Google does a great job of searching the public Web, it doesn’t have real-time access to the byzantine and constantly changing array of flight schedules and fares. So Microsoft purchased Farecast — a Web site that tracks airline fares over time and uses the data to predict when ticket prices will rise or fall — and incorporated its findings into Bing’s results. Microsoft made similar acquisitions in the health, reference, and shopping sectors, areas where it felt Google’s algorithm fell short.
Even the Bingers confess that, when it comes to the simple task of taking a search term and returning relevant results, Google is still miles ahead. But they also think that if they can come up with a few areas where Bing excels, people will get used to tapping a different search engine for some kinds of queries. “The algorithm is extremely important in search, but it’s not the only thing,” says Brian MacDonald, Microsoft’s VP of core search. “You buy a car for reasons beyond just the engine.”
Google’s response can be summed up in four words: mike siwek lawyer mi.
Amit Singhal types that koan into his company’s search box. Singhal, a gentle man in his forties, is a Google Fellow, an honorific bestowed upon him four years ago to reward his rewrite of the search engine in 2001. He jabs the Enter key. In a time span best measured in a hummingbird’s wing-flaps, a page of links appears. The top result connects to a listing for an attorney named Michael Siwek in Grand Rapids, Michigan. It’s a fairly innocuous search — the kind that Google’s servers handle billions of times a day — but it is deceptively complicated. Type those same words into Bing, for instance, and the first result is a page about the NFL draft that includes safety Lawyer Milloy. Several pages into the results, there’s no direct referral to Siwek.
The comparison demonstrates the power, even intelligence, of Google’s algorithm, honed over countless iterations. It possesses the seemingly magical ability to interpret searchers’ requests — no matter how awkward or misspelled. Google refers to that ability as search quality, and for years the company has closely guarded the process by which it delivers such accurate results. But now I am sitting with Singhal in the search giant’s Building 43, where the core search team works, because Google has offered to give me an unprecedented look at just how it attains search quality. The subtext is clear: You may think the algorithm is little more than an engine, but wait until you get under the hood and see what this baby can really do.
The story of Google’s algorithm begins with PageRank, the system invented in 1997 by cofounder Larry Page while he was a grad student at Stanford. Page’s now legendary insight was to rate pages based on the number and importance of links that pointed to them — to use the collective intelligence of the Web itself to determine which sites were most relevant. It was a simple and powerful concept, and — as Google quickly became the most successful search engine on the Web — Page and cofounder Sergey Brin credited PageRank as their company’s fundamental innovation.
But that wasn’t the whole story. “People hold on to PageRank because it’s recognizable,” Manber says. “But there were many other things that improved the relevancy.” These involve the exploitation of certain signals, contextual clues that help the search engine rank the millions of possible results to any query, ensuring that the most useful ones float to the top.
Web search is a multipart process. First, Google crawls the Web to collect the contents of every accessible site. This data is broken down into an index (organized by word, just like the index of a textbook), a way of finding any page based on its content. Every time a user types a query, the index is combed for relevant pages, returning a list that commonly numbers in the hundreds of thousands, or millions. The trickiest part, though, is the ranking process — determining which of those pages belong at the top of the list.
That’s where the contextual signals come in. All search engines incorporate them, but none has added as many or made use of them as skillfully as Google has. PageRank itself is a signal, an attribute of a Web page (in this case, its importance relative to the rest of the Web) that can be used to help determine relevance. Some of the signals now seem obvious. Early on, Google’s algorithm gave special consideration to the title on a Web page — clearly an important signal for determining relevance. Another key technique exploited anchor text, the words that make up the actual hyperlink connecting one page to another. As a result, “when you did a search, the right page would come up, even if the page didn’t include the actual words you were searching for,” says Scott Hassan, an early Google architect who worked with Page and Brin at Stanford. “That was pretty cool.” Later signals included attributes like freshness (for certain queries, pages created more recently may be more valuable than older ones) and location (Google knows the rough geographic coordinates of searchers and favors local results). The search engine currently uses more than 200 signals to help rank its results.
Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
Throughout its history, Google has devised ways of adding more signals, all without disrupting its users’ core experience. Every couple of years there’s a major change in the system — sort of equivalent to a new version of Windows — that’s a big deal in Mountain View but not discussed publicly. “Our job is to basically change the engines on a plane that is flying at 1,000 kilometers an hour, 30,000 feet above Earth,” Singhal says. In 2001, to accommodate the rapid growth of the Web, Singhal essentially revised Page and Brin’s original algorithm completely, enabling the system to incorporate new signals quickly. (One of the first signals on the new system distinguished between commercial and noncommercial pages, providing better results for searchers who want to shop.) That same year, an engineer named Krishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals.
Google is famously creative at encouraging these breakthroughs; every year, it holds an internal demo fair called CSI — Crazy Search Ideas — in an attempt to spark offbeat but productive approaches. But for the most part, the improvement process is a relentless slog, grinding through bad results to determine what isn’t working. One unsuccessful search became a legend: Sometime in 2001, Singhal learned of poor results when people typed the name “audrey fino” into the search box. Google kept returning Italian sites praising Audrey Hepburn. (Fino means fine in Italian.) “We realized that this is actually a person’s name,” Singhal says. “But we didn’t have the smarts in the system.”
The Audrey Fino failure led Singhal on a multiyear quest to improve the way the system deals with names — which account for 8 percent of all searches. To crack it, he had to master the black art of “bi-gram breakage” — that is, separating multiple words into discrete units. For instance, “new york” represents two words that go together (a bi-gram). But so would the three words in “new york times,” which clearly indicate a different kind of search. And everything changes when the query is “new york times square.” Humans can make these distinctions instantly, but Google does not have a Brazil-like back room with hundreds of thousands of cubicle jockeys. It relies on algorithms.
The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”
This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”
And Google keeps improving. Recently, search engineer Maureen Heymans discovered a problem with “Cindy Louise Greenslade.” The algorithm figured out that it should look for a person — in this case a psychologist in Garden Grove, California — but it failed to place Greenslade’s homepage in the top 10 results. Heymans found that, in essence, Google had downgraded the relevance of her homepage because Greenslade used only her middle initial, not her full middle name as in the query. “We needed to be smarter than that,” Heymans says. So she added a signal that looks for middle initials. Now Greenslade’s homepage is the fifth result.
At any moment, dozens of these changes are going through a well-oiled testing process. Google employs hundreds of people around the world to sit at their home computer and judge results for various queries, marking whether the tweaks return better or worse results than before. But Google also has a larger army of testers — its billions of users, virtually all of whom are unwittingly participating in its constant quality experiments. Every time engineers want to test a tweak, they run the new algorithm on a tiny percentage of random users, letting the rest of the site’s searchers serve as a massive control group. There are so many changes to measure that Google has discarded the traditional scientific nostrum that only one experiment should be conducted at a time. “On most Google queries, you’re actually in multiple control or experimental groups simultaneously,” says search quality engineer Patrick Riley. Then he corrects himself. “Essentially,” he says, “all the queries are involved in some test.” In other words, just about every time you search on Google, you’re a lab rat.
This flexibility — the ability to add signals, tweak the underlying code, and instantly test the results — is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months Google has made more than 200 improvements, some of which seem to mimic — even outdo — the offerings of its competitors. (Google says this is just a coincidence and points out that it has been adding features routinely for years.) One is real-time search, eagerly awaited since Page opined some months ago that Google should be scanning the entire Web every second. When someone queries a subject of current interest, among the 10 blue links Google now puts a “latest results” box: a scrolling set of just-produced posts from news sources, blogs, or tweets. Once again, Google uses signals to ensure that only the most relevant tweets find their way into the real-time stream. “We look at what’s retweeted, how many people follow the person, and whether the tweet is organic or a bot,” Singhal says. “We know how to do this, because we’ve been doing it for a decade.”
Along with real-time search, Google has introduced other new features, including a service called Goggles, which treats images captured by users’ phones as search queries. It’s all part of the company’s relentless march toward search becoming an always-on, ubiquitous presence. With a camera and voice recognition, a smartphone becomes eyes and ears. If the right signals are found, anything can be query fodder.
Google’s massive computing power and bandwidth give the company an undeniable edge. Some observers say it’s an advantage that essentially prohibits startups from trying to compete. But Manber says it’s not infrastructure alone that makes Google the leader: “The very, very, very key ingredient in all of this is that we hired the right people.”
By all standards, Qi Lu qualifies as one of those people. “I have the highest regard for him,” says Manber, who worked with the 48-year-old computer scientist at Yahoo. But Lu joined Microsoft early last year to lead the Bing team. When asked about his mission, Lu, a diminutive man dressed in jeans and a Bing T-shirt, pauses, then softly recites a measured reply: “It’s extremely important to keep in mind that this is a long-term journey.” He has the same I’m-not-going-away look in his eye that Uma Thurman has in Kill Bill.
Indeed, the company that won last decade’s browser war has a best-served-cold approach to search, an eerie certainty that at some point, people are going to want more than what Google’s algorithm can provide. “If we don’t have a paradigm shift, it’s going to be very, very difficult to compete with the current winners,” says Harry Shum, Microsoft’s head of core search development. “But our view is that there will be a paradigm shift.”
Still, even if there is such a shift, Google’s algorithms will probably be able to incorporate that, too. That’s why Google is such a fearsome competitor; it has built a machine nimble enough to absorb almost any approach that threatens it — all while returning high-quality results that its competitors can’t match. Anyone can come up with a new way to buy plane tickets. But only Google knows how to find Mike Siwek.