Klaus Johannes Rusch's random thoughts

Tuesday, January 31, 2012

Google Browser Size: Is your content visible?

Have you ever wondered how much of your carefully designed Web page is actually visible to the people coming to your site?

Then take a look at Google Browser Size, an amazingly simple and effective tool for Web designers to see what percentage of users sees which content, like this:

Of course we all know to place important content towards the top, above the fold, we have seen the heatmaps from eye tracking studies, and we all test at different screen sizes, right? Google Browser Size, already launched back in December 2009, just makes the testing easier and brings this home with shocking immediacy (Mike Moran at Biznology).

The visualization is based on browser window sizes of people who visit Google, not on actual browser window sizes used when accessing a particular site. Depending on how closely your audience matches the average Google visitor, results may vary.

One caveat: As mentioned on the Browser Size website, the tool works best on web pages with a fixed layout aligned to the left. The visualization can be misleading for liquid or reactive pages that adjust to the available screen width, we well as centered pages.

Labels: google, seo, usability, webdevelopment

Posted by Klaus Johannes Rusch @ 3:02 AM 1 Comments
Link: http://klausrusch.atmedia.net/blog/2012/01/google-browser-size-is-your-content_31.html

Monday, October 24, 2011

Google encrypting searches: security, privacy and control

Google recently announced plans to make search more secure.

This effort includes encrypting search queries, which is especially important when using an unsecured Internet connection or accessing the Internet through intermediate devices which have the ability to log requests. Encrypting the search interface will automatically block referrer information for unencrypted sites, and would provide an incentive for companies to join the industry effort to use SSL/TLS encryption more widely.

But Google takes this a step further, hiding query information from encrypted searches. The click-through tracking link for unencrypted search includes the search term parameter “q”, which gets passed to the visited website:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&url=http%3A%2F%2Fexample.com%2F

The link for encrypted search, however, leaves the parameter “q” empty. Interestingly click-throughs for encrypted searches are tracked on an unencrypted connection, thus revealing the visited site address to an eavesdropper:
http://www.google.com/url?sa=t&rct=j&q=example&url=http%3A%2F%2Fexample.com%2F

With this change, the visited website receives no information about the search term. What enhances privacy for searchers holds website owners off important information for optimizing their websites to best serve visitors.

Browsers already provide mechanisms for controlling referrer information, for example the network.http.sendRefererHeader preference setting or the customizable RefControl extension for Firefox. Google’s privacy enhancement takes control away from the users by not passing referring information, period.

Google’s move has the potential to change the search engine marketing (SEM) landscape. Search terms in paid ads will remain trackable unchanged. For organic search, the only way for website owners to get access to, albeit delayed, aggregated and limited to the top 1,000, search terms is through Google webmaster console, a very fine tool but not a replacement for an integrated web analytics solution.

The impact of this change goes beyond web analytics and search engine optimization (SEO): Sites often use the search terms that led visitors to the site for dynamic customization, offering related information and links. With encrypted search, visitors will no longer have access to these enhancements either.

Labels: google, metrics, security, seo

Posted by Klaus Johannes Rusch @ 3:42 AM 1 Comments
Link: http://klausrusch.atmedia.net/blog/2011/10/google-encrypting-searches-security.html

Wednesday, February 9, 2011

Google vs. Bing: A technical solution for fair use of clickstream data

When Google engineers noticed that Bing unexpectedly returned the same result as Google for a misspelling of tarsorrhapy, they concluded that somehow Bing considered Google’s search results for its own ranking. Danny Sullivan ran the story about Bing cheating and copying Google search results last week. (Also read his second article on this subject, Bing: Why Google’s Wrong In Its Accusations.)

Google decided to create a trap for Bing by returning results for about 100 bogus terms, as Amit Singhal, a Google Fellow who oversees the search engine’s ranking algorithm, explains:

To be clear, the synthetic query had no relationship with the inserted result we chose—the query didn’t appear on the webpage, and there were no links to the webpage with that query phrase. In other words, there was absolutely no reason for any search engine to return that webpage for that synthetic query. You can think of the synthetic queries with inserted results as the search engine equivalent of marked bills in a bank.

Running Internet Explorer 8 with the Bing toolbar installed, and the “Suggested Sites” feature of IE8 enabled, Google engineers searched Google for these terms and clicked on the inserted results, and confirmed that a few of these results, including “delhipublicschool40 chdjob”, “hiybbprqag”, “indoswiftjobinproduction”, “jiudgefallon”, “juegosdeben1ogrande”, “mbzrxpgjys” and “ygyuuttuu hjhhiihhhu”, started appearing in Bing a few weeks later:

The experiment showed that Bing uses clickstream data to determine relevant content, a fact that Microsoft’s Harry Shum, Vice President Bing, confirmed:

We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.

These clickstream data include Google search results, more specifically the click-throughs from Google search result pages. Bing considers these for its own results and consequently may show pages which otherwise wouldn’t show in the results at all since they don’t contain the search term, or rank results differently. Relying on a single signal made Bing susceptible to spamming, and algorithms would need to be improved to weed suspicious results out, Shum acknowledged.

As an aside, Google had also experienced in the past how relying too heavily on a few signals allowed individuals to influence the ranking of particular pages for search terms such as “miserable failure”; despite improvements to the ranking algorithm we continue to see successful Google bombs. (John Dozier's book about Google bombing nicely explains how to protect yourself from online defamation.)

The experiment failed to validate if other sources are considered in the clickstream data. Outraged about the findings, Google accused Bing of stealing its data and claimed that “Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation”.

Whither clickstream data?

Privacy concerns aside—customers installing IE8 and the Bing toolbar, or most other toolbars for that matter, may not fully understand and often not care how their behavior is tracked and shared with vendors—using clickstream data to determine relevant content for search results makes sense. Search engines have long considered click-throughs on their results pages in ranking algorithms, and specialized search engines or site search functions will often expose content that a general purpose search engine crawler hasn’t found yet.

Google also collects loads of clickstream data from the Google toolbar and the popular Google Analytics service, but claims that Google does not consider Google Analytics for page ranking.

Using clickstream data from browsers and toolbars to discover additional pages and seeding the crawler with those pages is different from using the referring information to determine relevant results for search terms. Microsoft Research recently published a paper Learning Phrase-Based Spelling Error Models from Clickthrough Data about how to improve the spelling corrections by using click data from “other search engines”. While there is no evidence that the described techniques have been implemented in Bing, “targeting Google deliberately” as Matt Cutts puts it would undoubtedly go beyond fair use of clickstream data.

Google considers the use of clickstream data that contains Google Search URLs plagiarism and doesn't want another search engine to use this data. With Google dominating the search market and handling the vast majority of searches, Bing's inclusion of results from a competitor remains questionable even without targeting, and dropping that signal from the algorithm would be a wise choice.

Should all clickstream data be dropped from the ranking algorithms, or just certain sources? Will the courts decide what constitutes fair use of clickstream data and who “owns” these data, or can we come up with a technical solution?

Robots Exclusion Protocol to the rescue

The Robots Exclusion Protocol provides an effective and scalable mechanism for selecting appropriate sources for resource discovery and ranking. Clickstream data sources and crawlers results have a lot in common. Both provide information about pages for inclusion in the search index, and relevance information in the form of inbound links or referring pages, respectively.

Dimension	Crawler	Clickstream
Source	Web page	Referring page
Target	Link	Followed link
Weight	Link count and equity	Click volume

Following the Robots Exclusion Protocol, search engines only index Web pages which are not blocked in robots.txt, and not marked non-indexable with a robots meta tag. Applying the protocol to clickstream data, search engines should only consider indexable pages in the ranking algorithms, and limit the use of clickstream data to resource discovery when the referring page cannot be indexed.

Search engines will still be able to use clickstream data from sites which allow access to local search results, for example the site search on amazon.com, whereas Google search results are marked as non-indexable in http://www.google.com/robots.txt and therefore excluded.

Clear disclosure how clickstream data are used and a choice to opt-in or opt-out put Web users in control of their clickstream data. Applying the Robots Exclusion Protocol to clickstream data will further allow Web site owners to control third party use of their URL information.

Labels: bing, google, microsoft, seo, technology, webdevelopment

Posted by Klaus Johannes Rusch @ 3:34 AM 2 Comments
Link: http://klausrusch.atmedia.net/blog/2011/02/google-vs-bing-technical-solution-for.html

Monday, May 31, 2010

Blogger on your site

Google recently announced that they no longer support FTP publishing in Blogger after May 1, 2010, citing low usage and the drain on engineering resources as the reasons. The article also cited reasons why people wanted their blogs published on their site rather than going for the hosted solution.

If you are one of the .5% of bloggers who for whatever reason published via FTP or the more secure SFTP, you were left with a choice of moving your blog to blogspot.com or a custom domain name, or moving to another blogging platform. Importing your blog into WordPress is easy, WordPress has some nifty features that Blogger lacks, and you will easily find professionally designed WordPress themes, too, but switching to WordPress means going with the hosted solution on wordpress.com or installing and maintaining WordPress code on your server.

For those who want to stay with Blogger and have Blogger integrated into the Website there are two options, both requiring some hacking and configuration:

Use the Blogger Data API to retrieve the blog in XML format and perform the rendering locally, most likely by processing the XML with XSLT stylesheets. While very flexible, this means losing the Blogger template capabilities.
Build a reverse proxy that translates requests for blog resources to the correponding URL on Google's servers. The proxy solution gives flexbility with URL formats and also allows for tweaking the generated HTML code before sending it to the browser.

The Blogger proxy solution

Here is how it works:

Create backup copies of your blog in Blogger and on your server. The migration tool will update all previously published pages with a notice that your blog has moved, so you want to save the state of your blog first.
Create a secret hostname for your blog in a domain you control, say secretname.example.com, and CNAME this to ghs.google.com. Don't limit your creativity, although the name really doesn't matter much. The migration tool checks that secretname.example.com is CNAMEd to ghs.google.com during the migration.
Use the Blogger migration tool to move your blog to the new domain. At this point the blog will be up and running at secretname.example.com.
Install a proxy script on your site which intercepts requests, rewrites the request as needed and sets a Host: secretname.example.com header, sends the modified request to ghs.google.com and rewrites the response to correct absolute links, and optionally tweaks the generated HTML code before sending the response to the browser.
Configure the Webserver to invoke the script when no local content is available, for example in Apache
RewriteEngine On RewriteRule ^$ index.html RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /bloggerproxy.php [L]
Google will eventually attempt to index your blog under secretname.example.com. To ensure a consistent appearance of the blog on your site, as the last step point secretname.example.com back to your Webserver and forward requests with that server name to your proxied blog using a 301 redirect.

Disclaimer: This solution is not for the faint of heart. It requires more changes and configuration than simply switching to a custom domain name, and isn't blessed or supported by Google. Use at your own risk.

Labels: google, web2.0, webdevelopment

Posted by Klaus Johannes Rusch @ 3:22 AM 0 Comments
Link: http://klausrusch.atmedia.net/blog/2010/05/blogger-on-your-site.html

Saturday, January 31, 2009

Google: This site may harm your computer

Google generally does a pretty good job warning users about suspicious Web sites assumed to contain malware, but their algorithm seems to have gone overboard now. This morning every search result shows a warning that the site may harm my computer:

Labels: google, technology

Posted by Klaus Johannes Rusch @ 7:02 AM 2 Comments
Link: http://klausrusch.atmedia.net/blog/2009/01/google-this-site-may-harm-your-computer.html

Monday, January 14, 2008

Blogger

Choosing a hosted service for blogging was a matter of a few minutes, and it didn't involve working through feature lists and comparison charts.

I started playing with Blogger and within minutes had a basic template and publishing to my Web server working. The template language looked sufficiently flexible, and the backing by search giant Google made this an attractive choice too.

WordPress would have been next on my review list. The hosted options are probably comparable, with WordPress offering some advanced features for a fee. Anita Campbell has published a great article about moving a blog from Blogger to WordPress, citing a number of good reasons why the latter is a much better option, although Blogger was “simple to set up and use”. Good enough for me.

One minor limitation I noticed is that Blogger only creates a single XML feed but no category feeds, which can be created easily using the rich Blogger data API.

The only complaint I have about Blogger is the incorrect rendering of ampersand and angle quotes:

Ampersand: &
Angle bracket open: <
Angle bracket close: >

They are represented correctly as entities in the XML feed, but rendered as plain characters in the HTML version. This looks like a bug that should be easy enough to fix.

Labels: google, technology

Posted by Klaus Johannes Rusch @ 9:43 AM 0 Comments
Link: http://klausrusch.atmedia.net/blog/2008/01/blogger.html

Wednesday, July 4, 2007

Google dropped my site

Google's Matt Cutts asked for feedback on the webmaster guidelines and I gladly shared my experience there:

Recently, Google sent me an email entitled "Entfernung Ihrer Webseite sitename aus dem Google Index", notifying me that one of my sites had been dropped from the Google index for violating the content quality guidelines. Now that site certainly deserved to be dropped for various reasons, not the least being that the content was old and not highly relevant and I am quite happy to see that site dropped.

I couldn’t find anything related to the specific issue highlighted. In addition the issue highlighted doesn’t exist (or I don’t understand what it is trying to say, maybe something got lost in the translation since the mail was in German):
Wir haben auf Ihren Seiten insbesondere die Verwendung folgender Techniken festgestellt:
*Seiten wie z. B. example.com, die zu Seiten wie z. B. http://www.example.com/index.htm mit Hilfe eines Redirects weiterleiten, der nicht mit unseren Richtlinien konform ist.
Translation: In particular we have noticed the following techniques on your pages: * Pages such as example.com, which redirect to pages such as http://www.example.com/index.htm using redirects, which is not compliant with our guidelines.

Now since when does Google consider redirects within a site evil? Plus, the referenced domain example.com does not even exist, nor does the homepage redirect either.

I couldn't care less about this particular site. What worries me though is that I haven't been able to identify how the site violates the guidelines, even after reading the guidelines more than once, and chances are that I have used the same techniques on other sites where I do care.

Labels: google

Posted by Klaus Johannes Rusch @ 2:01 AM 0 Comments
Link: http://klausrusch.atmedia.net/blog/2007/07/google-dropped-my-site.html

Wednesday, May 16, 2007

.net special issue about Google

The .net magazine has a special issue about Google. Looks pretty interesting judging from Matt Cutt's totally unbiased comments about it :-) and I haven't read .net for a while, the only problem is that this issue has sold out! I had tried to order a copy from www.myfavouritemagazines.co.uk (nice and easy to type URL!) but the Website was acting strangely when I tried and insisted that I had placed an order for the wrong continent, and when I tried again -- gone. Sooooo, if anyone happens to have a spare copy of the May 2007 issue of the .net magazine ...