Sunday 26 October 2014

Capacity Planning: Working within operational capacity


 
Final objective of all performance engineering, validation etc is to deliver performance that meets service requirements in production providing a smooth experience to the customer. One aspect is engineering of software  that has high capacity, low response time and optimally utilizes the resources(CPU, Memory, Storage and network), another aspect is managing the operations for optimal working of the software.  Even a well engineered software if used beyond its capacity will provide very poor user experience.  

Providing performance in the production environment is more important than just demonstrating it in performance labs.

Software utilizes hardware infrastructure to function and provide service. saturation in the infrastructure(easy to detect) implies saturation of software
In the world of distributed software, software modules/ components takes services of each other to provide service. Any one of the module that gets saturated will saturate the entire system or a major subsystem


Smooth operations implies all the subsystems work within their operational capacity.
Problem detection implies that monitoring is in place for the user experience so that problems are detected well before the service levels become completely unacceptable
Diagnostics implies that we can identify the subsystem(infrastructure + software service) that has saturated and take corrective actions
Predictive analytics implies that we can foresee the capacity issues before they arise.


General respons time vs  workload graph is below. The major characteristics for this curve is same for the hardware or software service.
 
If capacity and utilization of each subsystem is clear we have the decision support information to ensure smooth operations.

Operation of each subsystem should be within the operational capacity otherwise the performance will suffer.

Subsystem with the highest utilization is the one that will saturate first at higher workload and become the bottleneck.

Head room shown in the graph is the additional workload that the system can endure before it saturates. Depending on the risk tolerance additional capacity  can be added while operational head room is still remaining.

 
Sounds simple! what are the challenges?

 
In a distributed deployed solution there are too many subsystems.Lets say thousands of hardware equipment and similar number of software components
  • Are you monitoring all these subsystems?
  • what is the workload in these subsystems?
  • Do you know what is the capacity of these subsystems?
  • Do you know what is the utilization of these subsystems?
  • What is the head room in these subsystems? To endure more load

Another challenge is the workload itself. This depends on type of business. Inherently in business like eCommerce, Internet services there can be sudden surges in the end user activity. Some of these events can be anticipated like in festive season you can expect more online shopping or after a big marketing initiative with discounts there can be a big stress/workload on the eCommerce software services. So the systems that are working within operational capacity get saturated and provide poor end user experience at the critical business period.Typically govt websites for tax or online for submissions become unresponsive around the last dates. In trading systems lot of activity is driven by market volumes and critical events like corporate results, interest rate movements etc can trigger higher activity.

  • Do you have historical workload for your services and the operational baseline of workload
  • Work load trends (regular + seasonal)
  • Can you forecast the workload for important business event?
  • Is your capacity elastic ie if you know you need twice the capacity can you add it in time. More hardware + more software service + load distribution
Next challenge is do you understand the relationship between service utilization and the workload?
  • Do you have required  models?
  • Are you capturing the data required for these models?In all sub systems?
    • OS provides monitoring info about the hardware
    • What about the middleware?
    • What about the application software?
    • What about the DB?
  • Do you know the highest  load to which you can drive the system at which the response time meets the service requirements not in labs but in production.
We will discuss various models that solve these problems and data that you need to solve these issues in future posts.

Other Posts

No comments:

Post a Comment