a vanguard against confusion: 2/1/11

Let me start by saying I’m a Google fanboy and a half. I love that company and use all their products. My life revolves around my Nexus One and it’s deep and tight integration with Google’s stack which is my extremely large clustered brain-annex that’s constantly available with a few taps of my touchscreen.

Google is awesome and great and I love them to death. That said, they are wrong about Bing copying their results.

See these first-party posts for details:

Bing uses Google’s Results — and denies it (Google Blog)

Why Microsoft is not in the wrong: Crowd Sourcing is not stealing

What Microsoft has done is create a genuinely useful process for improving the relevancy of the Bing search engine. It’s not a groundbreaking technique. Like most web based application in the world, they just monitor your behaviour. You type ‘foo’ into their search box, and are presented with 10 results on your first screen. You click #2 result. This is recorded. The same thing happens for 10,000 other users. Eventually Bing figures out that the #2 result really should be the #1 result, and ups it’s rank.

This is no different than Amazon’s suggestion engine ‘Users who viewed this product ultimately bought this other product’.

This process doesn’t depend on Google. This process would work purely as a way of improving rank within the Bing system, with no outside influences. It also doesn’t really require a special toolbar to make it happen. You could collect that data through their normal web interface just as easily.

This is not an issue of spyware, cheating, or copying. It’s just Bing using crowd-sourced data to determine relevancy. It’s very smart. It’s not dishonest. Get over it Google.

How to improve search result relevancy?

I find this interesting because, currently Google faces a huge challenge: Reducing the amount of spam that is polluting their results. Bing clearly has a tactic for that, though it may or may not be completely effective. The value of the click through data is only as good as the user who clicked on the result. If the majority of people searching for “Foo” clicked on say, option #3 instead of #2 in our previous example, and if #3 was a spam site, then #3 would get ranked up. Bad news!

Looks like Bing is trusting our judgement as a user community. We know what’s relevant and show it by clicking through. If we choose a spam site as our main result, Cest La Vie!!

But how can we improve search relevancy, and reduce false positives in our result set? The answer so far seems to be “curate the web”, or like Bing, use a “mechanical turk”, aka click stream crowd-sourced relevancy, assuming that people will be able to express their preference as a whole and emerge the correct answer over time.

Our solution: Contextual Search

My company has a different solution: Working at a higher level of abstraction than words and documents. We have built a novel search tool that allows users to search for contexts, not documents, and make decisions on contexts.

Find me “license” where the document has “drivers license” in it, but not where it has “fishing license”, unless it also has something I want in it like “drivers license”. Traditional boolean search, which operates against an inverted index of terms to documents (which is what both Bing and Google offer), does not provide for this kind of decision making. It’s impossible without changing how the data is indexed, and that’s not anything these guys are going to be doing anytime soon. They have too much invested in their current methodology to change.

We’re hoping to launch a public search site sometime this year that presents our novel approach to improving relevancy in search. I look forward to seeing how it performs compared to Google and Bing.

More on that later, when it’s closer to reality.

What do you think, world at large?

Does anyone have any other ideas about search relevancy? What are some other tactics one might employ, beyond Curating or Crowdsourcing? How else to make the spam go away?

a vanguard against confusion

2011-02-02

Crowd-Sourced Relevancy Ranking in Bing or why Google is wrong