Performance testing server-side applications is a crucial process to help understand how the application behaves under load. It helps software teams fine-tune their applications to get the best performance while keeping the infrastructure costs low. Performance testing answers several important questions such as:
- Is the application ready to handle the traffic that’s going to hit it?
- What do average response times and latencies look like under normal and peak loads?
- Can the application be scaled out?
- What are the bottlenecks? (could be CPU, memory, an external service or a database server)
- How many instances are needed for supporting the estimated traffic (i.e. max RPS)?
- What type of instances are needed? Does the application requires an instance with higher CPU to Memory ratio? Or does it need an instance type that supports high network utilization?
- Does the application slowly degrade in performance under load? Is it slowly leaking a resource that eventually crashes it after a few hours or days?
A lesson I learned recently is that performance testing should not be an after-thought. Software teams should start performance testing early in the release cycle and not wait until the end to do it. I once worked on a team that built a backend service that passed all unit, integration and end-to-end tests with flying colors. QA engineers didn’t find any bugs in the application’s logic. However, the performance was just terrible when we ran load tests on it. On a single m4.large instance, the application supported 80% fewer requests than the team had estimated! The main bottleneck was found to be the 2-core CPU that was utilized to its maximum capacity as the application issued several queries to the database and applied complex algorithms to build a graph. Investigations by developers revealed that to reduce the amount of work the CPU was doing, it would require significant design changes. But it was already too late - the deadline was just weeks away. We decided to proceed with the release - albeit by over-provisioning the hardware and over-running our cost estimates by a factor of 3.
Performance testing is a broad topic. Teams I work with run load and soak tests to measure performance metrics such as throughput, latency, resource utilization, etc. using a wide variety of tools. At Glu, we build REST services in Java and use the following tools for our performance tests:
- YourKit Profiler to profile CPU and memory usage at a fine-grained level.
- Apache Jmeter to generate load (in reality, Blazemeter or distributed Jmeter)
- Amazon Cloudwatch to monitor resource utilization because we deploy our services on the Amazon cloud.
- Hosted Graphite to observe custom metrics that the application generates and we are interested in.
- Kibana dashboards to look at the logs, errors, etc.
Before I wrap this post up, there are few other important lessons I’d like to share:
- Run performance tests on a production-like environment. I have seen teams run performance tests on their MacBooks Pros with 8-core CPU’s and get drastically different results than from the actual cloud instances with puny, virtualized hardware.
- Create a good load test plan which requires careful thought. The goal is to emulate the load that real users would generate otherwise you might spend a lot of time chasing ghosts and fixing issues that are unlikely to happen in production. For example, I was investigating an issue with the performance of a Chat service under load. After looking at the load test script, I found that it grouped tens of thousands of users together. In reality, a group has on average about 10-50 people. The fix was to update the load test script to use a random group-id or a group-id from a pool instead of reusing the same id for each request.
I hope this post was helpful. Would love to hear your thoughts, the tools and the approach you take for performance testing. Till next time.