Wednesday, February 9, 2011

 

Google vs. Bing: A technical solution for fair use of clickstream data

When Google engineers noticed that Bing unexpectedly returned the same result as Google for a misspelling of tarsorrhapy, they concluded that somehow Bing considered Google’s search results for its own ranking. Danny Sullivan ran the story about Bing cheating and copying Google search results last week. (Also read his second article on this subject, Bing: Why Google’s Wrong In Its Accusations.)

Google decided to create a trap for Bing by returning results for about 100 bogus terms, as Amit Singhal, a Google Fellow who oversees the search engine’s ranking algorithm, explains:
To be clear, the synthetic query had no relationship with the inserted result we chose—the query didn’t appear on the webpage, and there were no links to the webpage with that query phrase. In other words, there was absolutely no reason for any search engine to return that webpage for that synthetic query. You can think of the synthetic queries with inserted results as the search engine equivalent of marked bills in a bank.
Running Internet Explorer 8 with the Bing toolbar installed, and the “Suggested Sites” feature of IE8 enabled, Google engineers searched Google for these terms and clicked on the inserted results, and confirmed that a few of these results, including “delhipublicschool40 chdjob”, “hiybbprqag”, “indoswiftjobinproduction”, “jiudgefallon”, “juegosdeben1ogrande”, “mbzrxpgjys” and “ygyuuttuu hjhhiihhhu”, started appearing in Bing a few weeks later:



The experiment showed that Bing uses clickstream data to determine relevant content, a fact that Microsoft’s Harry Shum, Vice President Bing, confirmed:
We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.
These clickstream data include Google search results, more specifically the click-throughs from Google search result pages. Bing considers these for its own results and consequently may show pages which otherwise wouldn’t show in the results at all since they don’t contain the search term, or rank results differently. Relying on a single signal made Bing susceptible to spamming, and algorithms would need to be improved to weed suspicious results out, Shum acknowledged.

As an aside, Google had also experienced in the past how relying too heavily on a few signals allowed individuals to influence the ranking of particular pages for search terms such as “miserable failure”; despite improvements to the ranking algorithm we continue to see successful Google bombs. (John Dozier's book about Google bombing nicely explains how to protect yourself from online defamation.)

The experiment failed to validate if other sources are considered in the clickstream data. Outraged about the findings, Google accused Bing of stealing its data and claimed that “Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation”.

Whither clickstream data?

Privacy concerns aside—customers installing IE8 and the Bing toolbar, or most other toolbars for that matter, may not fully understand and often not care how their behavior is tracked and shared with vendors—using clickstream data to determine relevant content for search results makes sense. Search engines have long considered click-throughs on their results pages in ranking algorithms, and specialized search engines or site search functions will often expose content that a general purpose search engine crawler hasn’t found yet.

Google also collects loads of clickstream data from the Google toolbar and the popular Google Analytics service, but claims that Google does not consider Google Analytics for page ranking.

Using clickstream data from browsers and toolbars to discover additional pages and seeding the crawler with those pages is different from using the referring information to determine relevant results for search terms. Microsoft Research recently published a paper Learning Phrase-Based Spelling Error Models from Clickthrough Data about how to improve the spelling corrections by using click data from “other search engines”. While there is no evidence that the described techniques have been implemented in Bing, “targeting Google deliberately” as Matt Cutts puts it would undoubtedly go beyond fair use of clickstream data.

Google considers the use of clickstream data that contains Google Search URLs plagiarism and doesn't want another search engine to use this data. With Google dominating the search market and handling the vast majority of searches, Bing's inclusion of results from a competitor remains questionable even without targeting, and dropping that signal from the algorithm would be a wise choice.

Should all clickstream data be dropped from the ranking algorithms, or just certain sources? Will the courts decide what constitutes fair use of clickstream data and who “owns” these data, or can we come up with a technical solution?

Robots Exclusion Protocol to the rescue

The Robots Exclusion Protocol provides an effective and scalable mechanism for selecting appropriate sources for resource discovery and ranking. Clickstream data sources and crawlers results have a lot in common. Both provide information about pages for inclusion in the search index, and relevance information in the form of inbound links or referring pages, respectively.

DimensionCrawlerClickstream
SourceWeb pageReferring page
TargetLinkFollowed link
WeightLink count and equityClick volume

Following the Robots Exclusion Protocol, search engines only index Web pages which are not blocked in robots.txt, and not marked non-indexable with a robots meta tag. Applying the protocol to clickstream data, search engines should only consider indexable pages in the ranking algorithms, and limit the use of clickstream data to resource discovery when the referring page cannot be indexed.

Search engines will still be able to use clickstream data from sites which allow access to local search results, for example the site search on amazon.com, whereas Google search results are marked as non-indexable in http://www.google.com/robots.txt and therefore excluded.

Clear disclosure how clickstream data are used and a choice to opt-in or opt-out put Web users in control of their clickstream data. Applying the Robots Exclusion Protocol to clickstream data will further allow Web site owners to control third party use of their URL information.

Labels: , , , , ,

Comments:
Great explanation of how it all went down between Google and Bing. Looking at the existing Google Robots.txt, aren't search results already excluded? If so, then Bing is clearly violating that. But then what is the "punishment" for violation of Robots Exclusions? Perhaps that's a rhetorical question.

One takeaway you mention here is that all of us using search toolbars are contributing to this clickstream data and should be aware of that. Reading the TOS should not be optional.
 
@witchtrivets The Robots Exclusion Protocol currently only covers Web retrieval and indexing, although it would probably make sense to expand this to clickstream data as well. While ignoring the robots rules has no immedicate consequences, all search engines honor this de-facto standard.
 
Is Bing A Better Search Engine?

We have created a logical test that shows which search engine provides better search results. Google or Bing? I will explain the test on this page.

First, I would like to make the test concept more clear with several examples:

Say we take a series of Titles to search on Google and Bing for comparison.

Here are several example: (all the tests are at www.rssfeedrss.com/index2.html)

Title 1) Patients are willing to undergo multiple tests for new cancer treatments

http://www.rssfeedrss.com/test2.html

Title 2) Conference on composite materials for structural performance: Towards higher limits

http://www.rssfeedrss.com/test4.html

Now, I explain the way this test works.
Each title is about two or three main keywords.
For example Title 1 is about cancer treatment.
Title 2 is about composite material.

I propose a logical test that uses Google, and also Bing search results that extracts the main keywords in a logical manner. The better search engine will provide a better and more relevant extraction based on this logical test. I like to emphasize logic.

Now what is this logical test?
The better search engine provides search results that contain higher number of main keywords in the search page results (usually in bold).

For example, if we take title 1 to either Google or Bing and make a search on the whole title and then count the number of times the main keywords appear in the search results (usually in Bold), the better search engine will give us cancer treatment and not other words. That means if you count the number of times the keywords cancer treatment appear in search results in both Google and Bing, Bing provides a higher quantity.

I used both Google and Bing for the test on the page www.rssfeedrss.com/index2.html and Bing provided a better search. You can do this test in-house.

I will propose this test in search engine conferences. It is a valid test.
I can email you the perl file that performed the test. Call 949-500-8638 or email info@katir.com.

In fact, if you continue the test to second page results, it also shows which search engine provides better search results for the second page or third page or....

Why is this test valid?

It is not very complex to prove why this test is valid. If you type a sentence that contains several main keywords, you prefer more information about those main keywords. The higher quantity of those main keywords prove the page is more relevant and the search engine has delivered more relevant results.
 
Post a Comment

Subscribe to Post Comments [Atom]










Page tools



Archives