HPE Nimble Storage Tech Blog
cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding your individual Application's DNA

rfenton4

If you have attended a recent Nimble presentation or had a discussion with one of our technical team, you'd have probably heard us talk about Application-Centric Data Services.  This is much more than an industry buzz-word or a marketing phrase to communicate the use of Application-profiles (a concept that has been in Nimble OS since it's very first release).  Application Centric Data Services is focussed on understanding the unique requirements of each application, and ensuring that they are met or exceeded, in order to meet the demands on the business in the most efficient manner.  Some of the attributes we consider when thinking about Application Centric Data Services are:

  • Performance, latency characteristics, how these may change over the course of time - both in the short-term (IO/latency intensive in the day and bandwidth intensive in the evening) and long-term (seasonality of the business or growth over time).
  • Data protection and recovery objectives and the levels of granularity that data may need to be restored.
  • Security.
  • Scaling for growth (or even contraction).
  • Efficiency to optimise the delivery of the application.

A DNA profile describes a humans (or animals) genetic make-up; what makes them individuals with regards to hair/eye colour, temperament, height, weight etc.

Like humans, every application is different, not only in the services that it may consume, but thinking of an application's I/O pattern, it realistically describes that applciaitons DNA: it's individual genetic make-up and personality.

The IT industry has for a long-time made generalisations, partly because managing at the lowest common-denominator is a method of scaling, but also sometimes vendors will generalise in order to optimise their products for particular workloads or solutions.  Those who follow corporate storage blogs will know there has been long running argument of how best to understand and determine application block sizes (specifically when considering performance benchmarks for competitive purposes), a topic that was firmly but to bed by Nimble's Data Scientist, David Adamson's rather excellent blog on Application IO and our studies with Infosight on real customer data.

What if we could understand each applications DNA make up with regards to performance rather than generalise on gut feel or averages?  It's type of I/O, how it changes over time with regard to latency/block-size, how that potential changes during the course of a production day, week or a year of seasonality, what generically does the IO pattern look like for a set of applications or type of applications or look at detailed analysis for a specific application.  Understanding such personalities would allow you to get many insights that will assist with sizing, scaling, optimising, trouble-shooting and most importantly meeting the business demands.  Just like two siblings may share similarities in their genetic makeup, two similar SQL applications may show very different personalities when explored in greater depth.  This is possible today! Understanding, this level of detail doesn't require any software to be deployed or expensive data collection or agents, Nimble customers can understand their application's DNA simply through Infosight (Nimble's cloud-based predictive analytics platform).

Please allow me to introduce you to Infosight Labs!  Those that are familiar with Infosight will know that Infosight takes telemetry from our deployed storage arrays and use that telemetry to allow our customers to obtain unique insights into their applications and environments.  It is much much more than pretty graphs telling you when your capacity is full on a linear regression!  Infosight Labs is a new feature that will allow our customers to experiment and obtain unique insights and reports on many items of interest including:

Untitled.jpg

Volume Performance is the area I have found fascinating, especially around understanding Application DNA.  As this example will demonstrate, please note: this is production volume that I using as the example/demonstration.

Firstly I select the time range (remember this data has already been collected, it's just a new perspective so no waiting for the data to amass or build up), the array (or pool) I am interested in looking at.  Next, is the option to look at individual volume/application or generically look at a collection of volumes or applications so we can verify similarities and averages on a set of applications.

Untitled1.jpg

and then the granularity and the metrics I am interested in looking at:

Untitled 2.jpg

Here I've selected Operation Heatmap (DNA) and to compare that with I/O Latency, the result is a overview of the applications (a specific SQL Server in this instance I/O profile over a period of time, it's unique pattern of I/O and it's relationship to latency):

Untitled 3.jpg

Hovering over the graph provides more data on the data-points:

Untitled 4.jpg

One can see that the histogram show's I/O type of the course of week (in this instance), this specific week happened to be month-end for a billing platform.

We can see the split of I/O between large sequential block and small transactional I/O.  This particular application is tending to write in small 8-16K I/O's and larger 128k-256k blocks. Hovering over the graph shows the datapoints.  From this view you can probably see why I see this as Application DNA as the sample looks very similar to the Human DNA profile picture above.

Here's a view of a VDI datastore, as one can see the profile is very different:

Untitled 4.jpg

and in fact if I look generically across all VDI volumes for this environment, the workload above is fairly representative of the average:

Untitled 6.jpg

Infosight Labs will roll out to our customer install base over the next coming weeks.  Please DO NOT ask Nimble Support to enable this feature!

Send us an email to request InfoSight Labs access

If you have any specific questions then please contact me or write a comment below!

Thanks

About the Author

rfenton4

Comments
davidbaril127

Hi Rich,

Good post.  I very much agree with your assertion about understanding your Application's DNA, but I want to add a couple of twists to the topic.

Nimble collects large amounts of performance telemetry, and sometimes this sheer bulk can add a degree of credibility that may not exist.  Yes, the data may show correlations across multiple systems, reinforcing it's credibility, but it could also be reinforcing that multiple systems are doing the same types of things "poorly", perhaps because they are taking the "defaults" for example.

The big thing that the back-end statistics alone can't discern is the difference between "constructive" and "unconstructive"  work in the performance metrics.  If you get the opportunity to profile a "well running" application and establish a profile or footprint of such "well running" behavior, the Nimble tools can easily identify that given behavior is different than the "well running" profile, and perhaps induce what is mis-configured.

Host file systems, buffer pool managers, page flushing daemons, block IO schedulers and "optimizers" all can distort the IO profile leaving the host ... and which Nimble captures telemetry on.  I have seen a case due to multiple layered mis-configurations across the multiple layers of an IO stack and Hypervisior's IO stack ... a single IO from a virtualized Exchange 2010 environment was generating 28 IOs on the back-end storage.  For many months and multiple escalations across multiple hardware and software vendors, no one asked if the amount disk activity was reasonable for the amount of "application" workload.  In this case, almost every "best practice" guideline was not implemented, and in several cases, the "default" was the polar opposite and further amplified a poor behavior.

Please note .. Nimble was NOT one of the vendors involved. Hopefully, Nimble has gained enough insight to more quickly identify "unusual" IO patterns that likely reflect some host-side mis-configuration somewhere in the IO stack. For example, there are already several alerts available about host IO mis-alignment that are reported by Nimble.

In fact, the easily available performance telemetry from Nimble systems can provide a very easy method to validate that a host IS configured efficiently, if it can be tied into the amount of constructive work being accomplished by the application.

Thanks for another good post, Rich.  Keep them coming.

Dave B

rfenton4

Great insights - thanks Dave