Follow on twitter

Friday, 17 July 2015

The 2nd Annual Conference of CMG India

The 2nd Annual Conference of CMG India will be held in M.S.Ramaiah Institute of Technology in Bangalore, on Nov 27th and 28th 2015. CMG India is a community that networks performance engineers across IT companies.

Call for Papers & Tutorials
CMG India 2015 calls for original submissions of real life experiences, (research) work in progress, tutorials in the following areas:
  • Performance Engineering of IT Systems: Design, Development, and Production System Management
  • Performance & Capacity Modelling
  • Big Data Performance Engineering
  • Cloud Performance Engineering
  • Mobile Systems Performance
  • In Memory Computing: Design and Optimization
  • High Performance Computing
  • Large System Architecture & Design for Performance

For details visit their website

Saturday, 1 November 2014

What other product can learn from Oracle database performance management

Whenever I diagnose and tune oracle performance issues I am filled with a deep sense of respect for the engineers who designed and implemented the diagnostic capabilities of the Oracle database.

Levels of performance management can be
  • Descriptive
    •  What happened? ie Monitoring
  • Diagnostic
    • Why it happened?
  • Prescriptive
    • What to do?
Oracle does a excellent job at all the three levels

Oracle is a complex product with a more complex memory and process architecture that a general server side application. This makes the performance management more challenging. On top of it characterization of workload of a general server application is simple while in Oracle  DB even if you look at selects they can have so different resource and time foot print and can be of so many types. All this implies that the performance management of Oracle DB is very challenging. This makes the DB diagnostic solutions provided by Oracle even more admirable.

Oracle provides tuning parameters for each subsystem, be it checkpoint process, LWR process etc or the size of shared pool, Buffer cache, SGA, PGA etc. Oracle diagnostic data(AWR Reports) provides rich information to detect inefficiencies in any of the process or memory subsystems or hardware capacity. AWR report contains the top foreground waits which help in identifying which subsystem is becoming bottleneck and needs to be tuned or needs more capacity. The report identifies the high load SQLs that are candidates of tuning and optimization.

 Oracle goes beyond just diagnostics to prescriptions of solutions also. ADDM can be used analyse the AWR data and it will provide actionable recommendations to optimize the system. SQL Tuning Advisor can be used get the tuning recommendations for SQL query.

Oracle has achieved all the above three levels(Monitoring, Diagnostics, and Prescription ) where a capable customer can do it all by himself without needing support from Oracle, greatly reducing the mean time to response for performance issues as well as providing rich information for proactive performance management

Hats off to the architects and engineers of Oracle Diagnostics !

In future posts we will discuss how to do root cause analysis of Oracle issues using AWR reports

Other Posts

Thursday, 30 October 2014

CMG India 2014: 1st Annual Conference

Computer Measurement Group India is having its first annual conference CMG India 2014 in Performance Engineering and Capacity Management in Pune this December 2014

Dates : Fri Dec 12th 2014 (9am to 5:30pm + Dinner) & Sat Dec 13th 2014 (9am to 4:30pm)

Venue: The conference will be co-located at Persistent Systems and Infosys, Phase I, Rajeev Gandhi Infotech Park, Hinjewadi, Pune 411057. (The offices are opposite each other.)

The conference has three tracks

Other Posts

Sunday, 26 October 2014

Capacity Planning: Working within operational capacity

Final objective of all performance engineering, validation etc is to deliver performance that meets service requirements in production providing a smooth experience to the customer. One aspect is engineering of software  that has high capacity, low response time and optimally utilizes the resources(CPU, Memory, Storage and network), another aspect is managing the operations for optimal working of the software.  Even a well engineered software if used beyond its capacity will provide very poor user experience.  

Providing performance in the production environment is more important than just demonstrating it in performance labs.

Software utilizes hardware infrastructure to function and provide service. saturation in the infrastructure(easy to detect) implies saturation of software
In the world of distributed software, software modules/ components takes services of each other to provide service. Any one of the module that gets saturated will saturate the entire system or a major subsystem

Smooth operations implies all the subsystems work within their operational capacity.
Problem detection implies that monitoring is in place for the user experience so that problems are detected well before the service levels become completely unacceptable
Diagnostics implies that we can identify the subsystem(infrastructure + software service) that has saturated and take corrective actions
Predictive analytics implies that we can foresee the capacity issues before they arise.

General respons time vs  workload graph is below. The major characteristics for this curve is same for the hardware or software service.
If capacity and utilization of each subsystem is clear we have the decision support information to ensure smooth operations.

Operation of each subsystem should be within the operational capacity otherwise the performance will suffer.

Subsystem with the highest utilization is the one that will saturate first at higher workload and become the bottleneck.

Head room shown in the graph is the additional workload that the system can endure before it saturates. Depending on the risk tolerance additional capacity  can be added while operational head room is still remaining.

Sounds simple! what are the challenges?

In a distributed deployed solution there are too many subsystems.Lets say thousands of hardware equipment and similar number of software components
  • Are you monitoring all these subsystems?
  • what is the workload in these subsystems?
  • Do you know what is the capacity of these subsystems?
  • Do you know what is the utilization of these subsystems?
  • What is the head room in these subsystems? To endure more load

Another challenge is the workload itself. This depends on type of business. Inherently in business like eCommerce, Internet services there can be sudden surges in the end user activity. Some of these events can be anticipated like in festive season you can expect more online shopping or after a big marketing initiative with discounts there can be a big stress/workload on the eCommerce software services. So the systems that are working within operational capacity get saturated and provide poor end user experience at the critical business period.Typically govt websites for tax or online for submissions become unresponsive around the last dates. In trading systems lot of activity is driven by market volumes and critical events like corporate results, interest rate movements etc can trigger higher activity.

  • Do you have historical workload for your services and the operational baseline of workload
  • Work load trends (regular + seasonal)
  • Can you forecast the workload for important business event?
  • Is your capacity elastic ie if you know you need twice the capacity can you add it in time. More hardware + more software service + load distribution
Next challenge is do you understand the relationship between service utilization and the workload?
  • Do you have required  models?
  • Are you capturing the data required for these models?In all sub systems?
    • OS provides monitoring info about the hardware
    • What about the middleware?
    • What about the application software?
    • What about the DB?
  • Do you know the highest  load to which you can drive the system at which the response time meets the service requirements not in labs but in production.
We will discuss various models that solve these problems and data that you need to solve these issues in future posts.

Other Posts

Friday, 24 October 2014

Little's Law

The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: 
Stable system : 
  1. Steady state
  2. Number of arrivals equals number of departures
  3. System is not in beyond saturation state
 λ - Arrival rate :
  1. As arrivals equal departures, this is also equivalent to throughput
W -  average time a customer spends in the system :
  1. This includes the time spent in wait as well as service so it is response time
L-  average number of customers in a stable system
  1. This is work in progress 
  2. transaction in queue as well as being serviced
  3. requests in queue and in progress in the system
  4. concurrent users in the system

Requests in queue + in progress= Mean Response Time * Throughput

In a test configured for 10000 requests/ transactions per seconds if mean response time is 10 milliseconds than at a time pending + in progress requests are (10000 * 0.01=) 100.

While trying different configurations of server component the configurations in which response time is slow for the same throughput the memory consumption is higher. This is expected from little's law. At the same throughput if response time is larger than the requests in queue and in progress will be more hence higher memory consumption.

Mean Response Time = Requests in queue + in progress/Throughput

In systems where in queue/progress requests are known and throughput is known the mean response time can be calculated.

Little's law  is also used for validation of the performance test, if all the three indicators are tracked in the test than the relationship must hold

Other Posts