Sunday, August 21, 2011

TagCloud and SEO in Plone: problem solved and lesson learned

This article is an analysis of how a minor bug in a Plone product we are maintaining give us some problems on a production site, how fixing it was not enough to revert problems, and what lesson I learned.

The environment
The Plone site I'm talking about is an old Plone 3.3 installation with an old hardware. It worked without problems for years then suddenly it started to be slow (sometimes really slow).
In front of this Plone installation there's also a Varnish installation that cache also HTML for anonymous users (a standard for us).

So what's can be turned wrong?!

After started checking the problem I also find that it wasn't a memory consumption problem (a lot of free memory, thanks to the use of plone.app.blob also on this Plone 3.3 installation), but only a CPU usage (unluckily we have only a single core for the Plone site).

Next step was checking what the site was doing to keep the CPU so busy. Before installing products like zope.healthwatcher, I spent some minutes simply checking the HTTP log and I found that the site is... "really popular". I mean it is very often visited by web crawlers, mainly Googlebot. This is not strange: the site is the main site of a well know public agency, and update really often.

The Vaporisation problem found
Is this enough to make a Plone site "slow"? Obviously no! The problem was not Plone but the TagCloud portlet.

One of the best features of collective.vaporisation (maybe the main feature that convinced us to takeover the project and maintain it) is the joint navigation.
When activating it, clicking on one of the links inside the cloud will not simply display a search result, but a search result where you can also navigate through found items using related additional terms (something like a faceted navigation).

The first problem was that customer site use a lot of different keyword, using them widely in a lot of site contents.
In this way the joint navigation became complex. Imagine a web crawler that starts to scan the tag cloud results page: it's able to follow a lot of additional link that create a great permutation of different search results (let call this big number N).

The second problem (the real one) was a bug in the way Vaporisation was creating links from the cloud to the result page (note that version 1.2.0 of the TagCloud fixed this problem so recent releases will not give you such behavior): older releases were calling the cloud_search template on the context.
This mean that when the visitor was checking the cloud links from the home page, it will call Plone as http://thesite.com/cloud_search, but when visiting another page, the URL will became http://thesite.com/path/to/document/cloud_search.

This is a disaster for the cache that Varnish is trying to produce for our site: the two URL are different from the cache point of view, but in facts on Plone they are generating the same result. This is really bad (also for page rank).

This also raise the number of possible cloud search pages from N to NxM (where M is the number of documents in the site). Terrible!

The fix was simple: change the way the URL to the cloud is created, always give to users the http://thesite.com/cloud_search version.

Google: the elephant
When I released this fix I make another mistake (not very lucky...).
Always remember that in Plone the context can be important. If you created a site view that must be called onto the site root is better to define it callable only on this context. An common error can be define this view callable everywhere (like old CMF skins templates).

So even if all URL were always generated in the right way, the view was defined still as follow:

<browser:page
	 for="*"
	 name="cloud_search"
	 class=".search.CloudSearch"
	 template="search.pt"
	 permission="zope2.View"
	 />

In this way calling (manually) again the http://thesite.com/path/to/document/clud_search will still be valid.

Why this is still a problem? Haven't you fixed all possible wrong links?
Because of Google long memory.

Even if we removed all links to the useless cloud_search call, Google already indexed them so it continue to call all those URLs and make the Varnish cache useless.
Maybe with the time those kind of call could decrease (as the search engines will find no links to this page anymore), but after applied the fix we get no benefits.

Version 1.2.2 fixed this, defining also the right context for the view:

<browser:page
	 for="Products.CMFPlone.interfaces.siteroot.IPloneSiteRoot"
	 name="cloud_search"
	 class=".search.CloudSearch"
	 template="search.pt"
	 permission="zope2.View"
	 />

After this change all calls to something that isn't http://thesite.com/cloud_search will simply generate a NotFound error page. This is good: Plone is very fast to generate this page, and this kind of pages (if no link lead to them) will rapidly disappear from the index.

Some SEO enhancement can help
Version 1.2.2 helped a lot, but in the meantime I was also reading a book about SEO and talked with coworkers (real SEO expert) of those kind of problems.

Must Google be aware of the cloud_search page? Is a good thing that this page will be indexed?Also, the site still use the joint navigation, so a major number of pages are still called, but in any case are those pages giving some good feedback to Web users that performs searches?

I think that there isn't an universal answer (I've doubt also related to simple Plone SERP... must be indexed by Google? Maybe Plone itself must think about this) but in this case no: let search engines index results page of search performed by tag cloud is useless.

Version 1.2.3 make the Vaporisation product more "SEO friendly".

First of all we can suggest search engines to not follow a link when scanning our site putting a rel='nofollow' attribute on it.

The new version of Vaporisation will put this rel nofollow value on every tag cloud link. I read recently that this attribute is only a suggestion, but can help.

The second and last change of this version is the use of a tag meta in the cloud SERP:

<meta name="robots" content="noindex, nofollow" />

This says to the search engine to not index this page (noindex value) but also don't follow any links from this page (nofollow value, less important there).

In this way Google quickly stopped to index also the big number of possible joint navigation results pages.

Damn Google Reader!
After every fix I applied, the site quickly became faster.
After all those fixes I still saw an heavy Google access to the site on a lot of different pages: this time to the search_rss page (used by Plone to generate RSS feeds from search result), commonly called with an URL like http://thesite.com/search_rss.
The site now was not so slow, but I liked to continue the investigation.

The problem was still the first one I analyzed above: the search_rss URL was available from the cloud_search page and so, when using older releases, was created on the context and not on the portal root.
So while a version lower than 1.2.0 was active, Google indexed a lot of cloud_search versions but for each one also a useless search_rss page.

Again, have fixed the problem for the "master" cloud_search page will not stop Google from indexing all the RSS calls.

This time we can't rely onto a well-done meta tag, as the RSS isn't HTML! Where I can put the tag?!

We have two alternative ways.

The quicker one is to block every web crawler access to the search_rss page, using simply a robots.txt file on the site root (but this will works with query parameters?):

User-agent: *
    Disallow: /search_rss

The other way, that also leave the search_rss page indexable by services like Google Reader, is to make also the search_rss template callable only from the site root.

This need some simple Plone customization (outside the Vaporisation product itself). Maybe Plone 5 (or 6?) will not use anymore CMF templates, but right now we have a lot of them all around Plone.
One of the problem of old-style template is that they can be called on every possible context. And obviously search_rss template is a very old template...

So the fix is not elegant as the cloud_search ones. We can:
  • configure Apache to disallow any search_rss call outside the site root
  • manually check if the site context is the site root; if not, we manually raise a NotFound error.

Conclusion
I found the resolution of this someway "funny".

Other Plone sites that installed directly the 1.2.0 version of Vaporisation didn't suffer all problems that are here described because the cloud_search page was always called in the "most correct way": on the site root.
So in this case Google did not indexed all other useless alternative way to call the same page (this mean also no need to fix the Plone basic search_rss)

Last thing: if you create a Plone template planned to be called on a context be sure to register it only on this context. If this template is an old CMF ones keep an eye on how you create links to this template.