Thursday, April 10, 2014

lessons learned from a two dollar load test

Recently, I had the chance to work with one of our engineering groups and lead load testing efforts for one of our web redesign projects. The objective was simple, introduce stress into the system and observe how the application and the infrastructure around that application behaves and reacts under various degrees of stress.

The first logical step in planning for a load test is picking a tool understanding what information the engineering team deems as important. The objective usually gives you a good overview of what they need, but you shouldn't stop there. Part of the information that you need to know is the data story. What path does the data take when it moves from curation until it gets served to the user. You do need to understand what happens in between.

The next step would be to understand the architecture behind the application. These are some of the questions that I usually ask. Are you using CDN's?  What are your caching solutions? What are these cache's respective TTLs? Do we have multiple servers behind a load balancer? What is the load balancer algorithm?  Is there a way for me to hit an origin server directly? What are the declared settings in httpd.conf? etc. These questions will most likely provide you an insight of how complicated the data flow scenarios will be.

Talking about the data flow scenarios, this usually follows a rather standard set of scenarios. How does the system behave during a greenfield request? (A greenfield request is when none of the caches are set, this usually takes the longest) How does the system behave when the request is fully cached? and finally, every boolean combination in between. A bonus case would depend on the caching mechanism that's used. memcached, for example, has a behavior where if the cache is completely full, instead of writing to memory, it starts writing to disk. The response times in this particular case is almost as slow as a greenfield request and most of the time this is a culprit that's a bit tricky to troubleshoot if you don't have the proper server instrumentation.

I usually correlate most data flow diagrams to a series of rivers, waterfalls and dams. The end user is at the end of line and content creation is at the source. The network is the path where the water flows, the size of the path defines the bandwidth, the dams represent the caches and of course the water is the data that eventually a person consumes.

Finally we deal with the testing approaches which is really where the good stuff happens. In the years that I've performed load testing, no one has summed the vantage points where you can perform load testing better than Scott Barber. The vantage points Scott pointed out in his Web Load Testing for Dummies book (I can't seem to find a link to buy this anywhere, help?), fall under three categories, Behind the firewall; Within the cloud; and from the User's Perspective.

Each of the above vantage points provide different types of information. I will not list all of them, but I will provide what I think were my major takeaways. Behind the firewall tests or Load Testing 1.0, enables you to test component performance. Within the Cloud, aka Outside the firewall or Load Testing 1.5, provides information about the performance when there are multiple components and how these components relate to each other.  Lastly, from the user's perspective aka Load Testing 2.0. This approach is rather new and was not an option until 2009 or 2010. This gives you the ability to load real browsers from different areas of the world and swarm a particular URL. Think DDOS but used for good (not evil). The information you get out of this will be very close to what your user's response times are and will also give you a sense of how your third party content really affects your load times from that perspective.

Early this week, we had our first dry run using the User's Perspective approach using Neustar's Web Performance Tool (aka Browsermob). Within 10 minutes of the test, we found an issue with the load balancer algorithm brought about by what seemingly is a VIP misconfiguration. Personally, this is proof that it's never too early to perform load testing since it would have been disastrous if this was opened up to the world. The other cool thing about this was that we didn't have to pay a hefty licensing sum if we decided to use loadrunner, or go through a complicated setup process for our own jmeter based load generator infrastructure (which we are considering).

Based on the number of concurrent users that we used and the length of the test, it cost us (literally) $2.25 to identify problems that would have caused problems that would cost money, lost personnel time and above all, a tarnished reputation (yeah, its possible).

How do you load test?

No comments: