Thursday, February 17, 2011

GOOGLE SEO - Google and SEO

I sometimes hear people say “Remember when Google launched and the results were so good? Google didn’t have any spam back then. Man, I wish we could go back to those days.” I know where those people are coming from. I was in grad school in 1999, and I remember that Google’s quality blew me away after just a few searches.
But it’s a misconception that there was no spam on Google back then. Google in 2000 looked great in comparison with other engines at the time, but Google 2011 is much better than Google 2000. I know because back in October 2000 I sent 40,000+ queries to google.com and saved the results as a sort of search time capsule. Take a query like [buy domain name]. Google’s current search results aren’t perfect, but the page returns several good resources as well as some places to actually buy a domain name. Here’s what Google returned for that query in 2000:

URL_1:http://buy-domain-name.domain-searcher.com/domains/buy-domain-name.shtml
URL_2:http://buy-domain-name.domain-searcher.com/buy-domain-name.shtml
URL_3:http://buy-domain.domain-searcher.com/domains/buy-domain.shtml
URL_4:http://buy-domain.domain-searcher.com/Map3.shtml
URL_5:http://domain-name-broker.domain-searcher.com/domains/domain-name-broker.shtml
URL_6:http://users5.50megs.com/buydomain32/
URL_7:http://users4.50megs.com/buydomain02/
URL_8:http://domain-name-service.domain-searcher.com/domains/domain-name-service.shtml
URL_9:http://domain-name-service.domain-searcher.com/Map2.shtml
URL_10:http://dns-id.co.uk/
Seven of the top 10 results all came from one domain, and the urls look a little… well, let’s say fishy. In 1999 and early 2000, search engines would often return 50 results from the same domain in the search results. One nice change that Google introduced in February 2000 was “host crowding,” which only showed two results from each hostname (here’s what a hostname is). Suddenly, Google’s search results were much cleaner and more diverse! It was a really nice win–we even got email fan letters. Unfortunately, just a few months later people were creating multiple subdomains to get around host crowding, as the results above show. Google later added more robust code to prevent that sort of subdomain abuse and to ensure better diversity. That’s why it’s pretty much a wash now when deciding whether to use subdomains vs. subdirectories.
Improving search quality is a process that never ends. I hope in another 10 years we look back and say “Wow, most queries were only a few words back then. And we had to type queries. How primitive!” Mostly I wanted to make the point that Google looked much cleaner compared to other search engines in 2000, but spam was absolutely an issue even back then. If someone harkens back to the golden, halcyon days when Google had no spam–take those memories with a grain of salt. :)
Earlier this week I was on a search panel with Harry Shum of Bing and Rich Skrenta of Blekko (and moderated by Vivek Wadhwa) and the video now live. It’s forty minutes long, but it covers a lot of ground:
One big point of discussion is whether Bing copies Google’s search results. I’m going to try to address this earnestly; if snarky is what you want, Stephen Colbert will oblige you.
First off, let me say that I respect all the people at Bing. From engineers to evangelists, everyone that I’ve met from Microsoft has been thoughtful and sincere, and I truly believe they want to make a great search engine too. I know that they work really hard, and the last thing I would want to do is imply that Bing is purely piggybacking Google. I don’t believe that.
That said, I didn’t expect that Microsoft would deny the claims so strongly. Yusuf Mehdi’s post says “We do not copy results from any of our competitors. Period. Full stop.”
Given the strength of the “We do not copy Google’s results” statements, I think it’s fair to line up screenshots of the results on Google that later showed up on Bing:
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
and
Google Screenshot
compared with
Bing Screenshot
I think if you asked a regular person about these screenshots, Microsoft’s “We do not copy Google’s results” statement wouldn’t ring completely true.
Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:
The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. --Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”
This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “&spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places. This is at least one concrete example of Microsoft taking browser data and using it to mine data deliberately and specifically from Google (in this case, the efforts of Google’s spell correction team).
That brings me to an issue that I raised with Bing during the search panel and afterwards with Harry Shum: disclosure. A while ago, my copy of Windows XP was auto-updated to IE8. Here’s one of the dialog boxes:
IE8 suggested sites
I don’t think an average consumer realizes that if they say “yes, show me suggested sites” that they’re granting Microsoft permission to send their queries and clicks on Google to Microsoft, which will then be used in Bing’s ranking. I think my Mom would be confused that saying “Yes” to that dialog will send what she searches for on Google and what she clicks on to Microsoft. I don’t think that IE8′s disclosure is clear and conspicuous enough that a reasonable consumer could make an informed choice and know that IE8 will send their Google queries/clicks to Microsoft.
One comment that I’ve heard is that “it’s whiny for Google to complain about this.” I agree that’s a risk, but at the same time I think it’s important to go on the record about this.
Another comment that I’ve heard is that this affects only long-tail queries. As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries. To verify our hypothesis, rare queries were the easiest to test. To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings. Microsoft is the company best able to answer the degree to which clicks on Google figure into their Bing’s rankings, and I hope they clarify how much of an impact clicks on Google affect Microsoft’s rankings.
Unfortunately, most of the reply has been along the lines of “this is only one of 1000 signals.” Nate Silver does a good job of tackling this, so I’ll quote him:
Microsoft’s defense boils down to this: Google results are just one of the many ingredients that we use. For two reasons, this argument is not necessarily convincing.
First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined.
And it may also be that an even larger fraction of what creates value for Bing users are Google’s results. Bing might consider hundreds of other variables, but these might produce little overall improvement in the quality of its search, or might actually detract from it. (Microsoft might or might not recognize this, since measuring relevance is tricky: it could be that features that they think are improving the relevance of their results actually aren’t helping very much.)
Second, it is problematic for Microsoft to describe Google results as just one of many “signals and features”. Google results are not any ordinary kind of input; instead, they are more of a finished (albeit ever-evolving) product
Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?
I really did try to be calm and constructive in this post, so I apologize if some frustration came through despite that–my feelings on the search panel were definitely not feigned. Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It’s because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing’s rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says “we don’t know how much of this win came from Google” does a disservice to everyone. I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.
If you want to dive into this topic even deeper, you can watch the full forty minute video above.
I just wanted to give a quick update on one thing I mentioned in my search engine spam post.
My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week.
This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice. The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site’s content.
Thanks to Jeff Atwood and the team at Stack Overflow for providing feedback to Google about this issue. I mentioned the update over on Hacker News too, because folks on that site had been discussing specific queries too.
(Just as a reminder: while I am a Google employee, the following post is my personal opinion.)
Recently I read a fascinating essay that I wanted to comment on. I found it via Ars Technica and it discusses “search neutrality” (PDF link, but I promise it’s worth it). It’s written by James Grimmelmann, an associate professor at New York Law School. The New York Times called Grimmelmann “one of the most vocal critics” of the proposed Google Books agreement, so I was curious to read what he had to say about search neutrality.
What I discovered was a clear, cogent essay that calmly dissects the idea of “search neutrality” that was proposed in a New York Times editorial. If you’re at all interested in search policies, how search engines should work, or what “search neutrality” means when people ask search engines for information, advice, and answers–I highly recommend it. Grimmelmann considers eight potential meanings for search neutrality throughout the article. As Grimmelmann says midway through the essay, “Search engines compete to give users relevant results; they exist at all only because they do. Telling a search engine to be more relevant is like telling a boxer to punch harder.” (emphasis mine)
On the notion of building a completely transparent search engine, Grimmelmann says
A fully public algorithm is one that the search engine’s competitors can copy wholesale. Worse, it is one that websites can use to create highly optimized search-engine spam. Writing in 2000, long before the full extent of search-engine spam was as clear as it is today, Introna and Nissenbaum thought that the “impact of these unethical practices would be severely dampened if both seekers and those wishing to be found were aware of the particular biases inherent in any given
search engine.” That underestimates the scale of the problem. Imagine instead your inbox without a spam filter. You would doubtless be “aware of the particular biases” of the people trying to sell you fancy watches and penis pills–but that will do you little good if your inbox contains a thousand pieces of spam for every email you want to read. That is what will happen to search results if search algorithms are fully public; the spammers will win.
And Grimmelmann independently hits on the reason that Google is willing to take manual action on webspam:
Search-engine-optimization is an endless game of loopholing. …. Prohibiting local manipulation altogether would keep the search engine from closing loopholes quickly and punishing the loopholers–giving them a substantial leg up in the SEO wars. Search results pages would fill up with spam, and users would be the real losers.
I don’t believe all search engine optimization (SEO) is spam. Plenty of SEOs do a great job making their clients’ websites more accessible, relevant, useful, and fast. Of course, there are some bad apples in the SEO industry too.
Grimmelmann concludes
The web is a place where site owners compete fiercely, sometimes viciously, for viewers and users turn to intermediaries to defend them from the sometimes-abusive tactics of information providers. Taking the search engine out of the equation leaves users vulnerable to precisely the sorts of manipulation search neutrality aims to protect them from.
Really though, you owe it to yourself to read the entire essay. The title is “Some Skepticism About Search Neutrality.”

No comments:

Post a Comment