Sunday, October 30, 2011

Debugging a slow site using Apache response time and TinyLogAnalyzer

We met recently a new customer, a company that need help with a slow application composed by:
  • a Plone CMS
  • a set of external processes (a daemon, an e-mail scan system, ...)
  • an complex integration with an external database engine
  • an integration with two kind of Visual Basic based client desktop application
With the grow of the company bussines they suffer in years a gradually decay of performance.
Some architectural changes were needed, but the company also asked for some quick fix, to get back a faster application while major changes are planned.

Tune up this application was not simple because the whole system is Windows based, so a lot of well-know software we commonly use are not applicable. The company contacted us because the core of the application seems the Plone CMS and the underlying Zope, but we met a customized version of a very old Plone installation. More or less we were facing something that was not Plone.

We decided to start to look at the code (everyone always hope to find some app-go-slow=True flag activated!), but meanwhile we enabled the response time log on the Apache front-end (and it works also on the Windows version!)

How long does it take to serve a request?
You can change the default Apache HTTP log configuration from something like this:
    LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined this:
    LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T/%D" combined

Suddently you log format change to this:
    [31/Jan/2008:14:19:07 +0000] "GET / HTTP/1.1" 200 7918 "" "..." 0/95491

...adding the time (in seconds and microseconds) needed to handle the request.

Please note that in a complex installation this will measure the total time needed for the request (if you put a reverse-cache proxy and a load balancer behind your Apache, this will raise the total time) but their Windows installation is very simple: Apache and Zope behind.

How to analyze results?
After a month of collected log data in the new format we wanted to see some statistics. The problem now was: how to read this log to learn something?

After some Google search I was a little worried. There are some real-time monitor tool for log analysis, but we are really scared to try them on a production Windows environment.

The only log analyzer found was apache-response-time: the projects description is really promising, however I was not able to use it: I installed it on a Linux and MacOS system but both I get a segmentation fault error.
Neither a stackoverflow question helped me to find something more. 

So I started to write some few Python lines preparing a custom analyzer. This job became interesting, so I spent a pair of week-ends for code refactoring and to make initial script is something reusable.

The result was TinyLogAnalyzer, a command line HTTP log analyzer that take a log file with response time enabled and extract two kind of statistics: the top total time and the top average time.

The top total time give you an idea of what is the most time consuming part of the application. In out case we found that more than 20% of the total time was spent performing searches and a special call performed by an external daemon.
Some optimization and changes to this code make the customer a little happier.
Also we find that a lot of requests were done for download external JavaScript resources, never cached.

The top average time instead give you an idea of the slower part of the application. With this I found that object creation and traversing in a special part of the application was really slow due to use of basic (not BTree) Plone folders.

All other release of the log analyzer product you can find are bug-fixing and new features still done for the same customer's problem (meanwhile we collected two more months of log).
For example, I found that some of the external tasks run also during the weekend, so collect data during Saturday and Sunday was wrong. The same for the night time.

The script is all but perfect. For example it keep all data in memory (so it's not the good tool for handle gigabytes-long-file) and read the log from the first line to the last (so if you need to skip some months it is still really slow).

However: it helped me a lot, giving some time to us and some breath to the customer, making things going better while major refactoring are planned.

Apart of the script itself, I think that enabling the Apache response time can be a great source of information when things goes wrong.