[Updated with a view from a former Google search engineer below]
Earlier this week Google disclosed that it had run a sting operation on its main search page which clearly ascertained that Bing, a competitor search engine by Microsoft, was using Google’s search results to tweak its own search results. The furore over this has been all over the web ever since Danny Sullivan released a post at Search Engine Land titled, "Google: Bing Is Cheating, Copying Our Search Results". Google claims Microsoft is effectively ‘cheating’. Microsoft admits openly that it uses Google results as one of many factors in its ranking. Is this copacetic? I say 100% yes. In fact, it’s quite clever too. Here’s the background to what’s really happening.
There are three main background issues.
- Google is now seen as the biggest target for anti-trust regulators because of its dominance in search and advertising
- At the same time, Google’s search results have deteriorated significantly of late.
- As a result, decent but distant competitors to Google have re-emerged in search.
Over the past decade, the tech world has moved away from a computer-centric world dominated by the Wintel duopoly to one in which the web is central. Google dominates web search with the overwhelming share in nearly every important market. China is a notable exception. As Google grows in scope and size, it is coming under increased regulatory scrutiny. As with the anti-trust investigations of Microsoft a decade ago, the European Union is the leading government body giving Google greater scrutiny. Much of the mainstream press has focused on privacy issues from Google’s Street View mapping project. But the real issues are in search.
For example, in 2010, EU regulators acted upon three separate anti-trust complaints from Foundem, a U.K price comparison search engine, Ejustice.fr , a French legal search engine, and Ciao from Bing, a UK-based shopping search engine by Microsoft. The EU is now investigating whether Google lowers vertical search rivals in its search results to drive down traffic. The EU is also investigating whether Google prevents vertical search engines from displaying rival advertisement, abusing its dominant position in another market. Google is buying ITA, a travel software company, which arguably puts it in competition with ‘search’ competitors in the travel vertical. This acquisition is being scrutinised. There was also a kerfuffle in local search when Google was enhancing its results with the content from vertical competitor Yelp. Google removed the content in question. This is important because mobile search is an emerging market of huge potential.
In sum, Google is the dominant competitor in search. It has been portrayed by competitors both in general search and in search verticals as using its dominance to muscle out competitors. These claims are attracting regulatory scrutiny, particularly in Europe. A good comprehensive article on the anti-trust issues and the dominance of Google comes from VoxEU.
Meanwhile, Google’s search results are said to have deteriorated significantly. I have witnessed this personally and can testify to having switched to other search engines in order to get less ‘spammy’ results. The issue is content farms, content aggregators and spam sites that either publish low-quality content or do not publish original content. Often, these sites have gamed Google’s search engine algorithms to rank higher than more established sites. Content aggregators have learned how to scrape content from other reputable sites and rank higher in Google search results with that content than the original publishers.
For example, I wrote a post in early December called "Site Scrapers Find Free Money on the Web" on this topic.
This is how it works: site scrapers’ entire business model involves copying every single post from leading news sources and re-publishing in order to earn advertising dollars from Google and other sites. They accomplish this by using WordPress plugins that allow them to automatically post the content of RSS feeds which other sites publish.
Blogs are especially vulnerable to this kind of copyright infringement because blogs generally publish full feeds in order to facilitate ease of use for readers. Most traditional publishers use short feeds, perhaps in order to increase site traffic or to prevent content scraping. The Guardian is one admirable exception.
Ironically, it is because of high rankings in Google’s search algorithm that these sites are even able to garner traffic and earn dollars from Internet advertising … Apparently, all you have to do to make money on the Internet is set up a website, install some plugins to scrape good content elsewhere, make sure you optimize your site for search rankings, and contact Google to include you in their list of trusted sources. It’s as good as free money – and often Google is collecting the advertising money along with you.
In my view, it is just this kind of situation which will ultimately win Google more regulatory scrutiny because Google dominates both search and advertising online. I am surprised that they have not taken steps to eliminate this kind of situation…
Search Engine Land is now reporting that Google is not just overhauling its search algorithms but is also dropping some sites from inclusion in Google News and reviewing standards. Before posting this article, I had linked to another Search Engine Land article from yesterday which noted that Google was revamping its copyright protection rules . Clearly, the kinks are still being worked out in these new rules.
The main sites I am now also using in addition to Google search are Bing and Blekko because their results are less spammy than Google’s. I find Bing and Blekko consistently better for specific and targeted general search but Google’s results are still better for me for news search and obscure long-tail searches. When I wrote Google News about this problem in early December, I e-mailed about the offending sites, writing: "I should point out that similar searches of Bing do not yield any results from these sites. So this may be a situation unique to Google’s search algorithm." I never received a reply.
There has been a lot of talk about Google’s deteriorating search quality. As Microsoft has now taken over Yahoo’s search engine results and people have been drawn to Bing’s results, Bing is eroding Google’s market share. A new search engine, Blekko has also entered the fray to great fanfare. Google is fighting back by making changes. Search Engine Land has written a number of posts about it. See Blekko Launches Spam Clock To Keep Pressure On Google or Google: Spam Really Has Increased Lately. We’re Fixing That, And Content Farms Are Next or Google May Let You Blacklist Domains To Fight Spam as examples.
What happened with Bing then? Basically, Bing uses a lot of different inputs in order to give it a competent search result. One signal is ‘clickstream data’ that it obtains from the Bing Search Bar you can install on your browser or Internet Explorer. You search, you click on a result and this data is recorded and added to Microsoft’s search algorithm for Bing. It doesn’t matter if the search was done on Bing or Google or elsewhere. Microsoft writes:
We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.
To be clear, we learn from all of our customers. What we saw in today’s story was a spy-novelesque stunt to generate extreme outliers in tail query ranking. It was a creative tactic by a competitor, and we’ll take it as a back-handed compliment. But it doesn’t accurately portray how we use opt-in customer data as one of many inputs to help improve our user experience.
Google has competing products in the Google Search Bar and Google Chrome, my main browser. But Google says it would never use competitors’ clickstream data to enhance its results. Bing is not breaking any laws here. It is using data that it has been given with consent by users of its products to enhance its search engine results. When installing the Bing toolbar, the terms read:
“improve your online experience with personalized content by allowing us to collect additional information about your system configuration, the searches you do, websites you visit, and how you use our software. We will also use this information to help improve our products and services.”
[emphasis added in Search Engine Land article]
One of the data inputs just happens to be the clickstream of users searches on Google. I use both products all the time, up to 100 times daily. I can tell you that Bing’s results are different enough from Google’s that this is not a large factor in Bing’s results. But I have also said that Bing’s algorithm for long-tail search needs work. Bing apparently knows this and the Google sting results demonstrate that when there are no other usable data inputs, the Google clickstream is used to enhance Bing’s results. Is this a ‘bad’ thing from a user perspective? Not in the least. I think this is quite clever actually – and to the degree Bing gets good results this way it would make me want to use them more. For now, I still find Google’s long tail results better.
In Sullivan’s post on the ‘cheating incident’, he even opined "I think what’s happening right now is that there’s a perfect storm of various developments all coming together at the same time. And if that storm gets people focused on demanding better search quality, I’m happy." I agree with that sentiment.
Update 1700 ET: I ran across a very good response to this debate on Quora by Edmond Lau who self-identifies as a former Google Search quality engineer. Quora is yet another of the new search-type sites that are popping up in response to diminished search quality results. The site’s goal is to answer specific questions through crowd sourcing, what I have called collaborative filtering. Here is what Lau has written in response to the Quora question Did Bing intentionally copy Google’s search results?:
Yes, but using click and visit data to rank results is a very reasonable and logical thing to do, and ignoring the data would have been silly. For Bing to use click and visit data from a competitor seems to be more clever than wrong given their lower market share and lower organic quality.
It’s pretty clear that any reasonable search engine would use click data on their own results to feed back into ranking to improve the quality of search results. Infrequently clicked results should drop toward the bottom because they’re less relevant, and frequently clicked results bubble toward the top. Building a feedback loop is a fairly obvious step forward in quality for both search and recommendations systems, and a smart search engine would incorporate the data. The actual mechanics of how click data is used is often proprietary, but Google makes it obvious that it uses click data with its patents on systems like "Rank-adjusted content items".
Both the Google toolbar and the Bing toolbar provide the ability to collect even more data by sending back information about queries and visited sites…
Most likely Bing started by first matching up queries to its own search engine through Bing toolbar with sites that users visited afterwards, and boosting those results in ranking, since that would have been the easiest to do.
Since that probably worked well, the next logical step would’ve been to incorporate site visit data through queries to other search engines like Google as well and to not only boost those results but to also incorporate them into the index if missing. Their implementation could even have been agnostic of Google initially and just incorporated the visit data after any query, though given how strong of a signal Google is, I’d be surprised if there weren’t at least some consideration of whether a result came from Google or not. It’d be a huge oversight not to use whether the source was Google as a ranking signal.
This is the argument I used in the comment section here. From Microsoft’s perspective, having collected anonymous data from user surfing behaviour, it must determine how to use that data and decide which parts influence search result rank. Microsoft would be kneecapping itself if it did not use the Google search data since Google is 70% of the search market. How do you strip that much data out and still have a competitive product? The issue here is that in a specific case, Google has artificially created a fake long-tail search result, purposely injecting spam results into its ranking in order to affect Bing’s ranking.
Moreover, Google acts in a similar fashion when it is not the dominant data source in a particular market. For example, Yelp dominated local search content. Google wanted into that space so they used not just clickstream data but actual Yelp content to beef up their own results. The reason Google doesn’t have to use clickstream from others in general search is because of their dominant position. Otherwise, when they need data, they use it.
I still maintain that, to the degree you want a long-tail search, you still should go with Google. After all, long-tail search is dependent both on market share and on longevity in the search business, meaning you need a large number of queries to get a statistically significant dataset on long-tail items to yield a satisfactory result. You can only do this by building up hundreds of millions of samples over time, a minuscule portion of which will be these specific search terms. Clearly, if you have no other result for a specific data query except forced clickstream matches due to the gaming of your search algorithm, you are going to return that match. As Lau writes:
if you saw the query [hiybbprqag] and had no other ranking signals other than that 20 users who searched for that query subsequently went to the exact same site, it’s highly likely that the site is the correct result for the query, and it’d be stupid not to algorithmically show that result given little to no other data, especially since Google was willing to show the result.
That’s what happened here. In any event, Google will continue to dominate this part of search because of its superior data set, unless someone does something algorithmically far superior.
Update 2/4/2011 1649 ET: Search Engine Land, the original source of the debate, has now written a new post with arguments from Bing which confirm the interpretation of the situation presented here. Google is just one of several search-type data sources that Microsoft uses. However, I would assume it is one of the largest. Stripping it out would significantly reduce search clickstream data points and probably reduce accuracy. See Bing: Why Google’s Wrong In Its Accusations.