Re: Incorrect Data Queue Depth Statistics

System Management Homep · ‎08-31-2020

Hi all,

We deployed several small MSA configurations (for a dedicated critical application) for a customer: Two disk groups (4 x SSDs) are used , one per pool. Firmware vesion is VL270P003-01.

Looking the performance at different levels (from pool to disk), we observed peak of IOs (not linked to the application) and impressive Queue Depth (the graph showed a curve about 8 000 000!!!).

We decided to disable the backgroup disk-groups and disk jobs scrub and to reset all statistics (because disk no more belonging to a disk group was reporting with I/Os).

Now I/Os at disk level seems normal (no more peak at 2000/3000 IOs due to scrub job) but the QueueDepth is curious : for some disks, we have about 100 (for me it is high regarding application activity and usage of SSD) and for other we continue to have 8000000. Confidence with performance reporting is low.

With CLI, it is not possible to have live QueueDepth while with SMU a QueueDepth is reporting (in live performance area) and the number shows seems correct (<3 and often = 0).

Do you experiment same issues with QueueDepth reports?

Regards,

Max

Navaneetha1 · ‎09-01-2020

Hello ,

Max , I understand you are experiencing performance issues and you are referring to queue depth and IOPS on disks.

While the scrub is ongoing it is normal to see high IOPS on the drives.
It is not advised to disable scrub as it is important to check the disks for defects
It will fix parity mismatches for RAID 5 and 6 and mirror mismatches for RAID 1 and 10.

Queue depth shows the "average" number of pending IOPS to be processed.
However , we would require more data to check the performance issues.

Details that are required :
> Host OS
> Is it direct attach or are the hosts connected via switch
> From when are we seeing these high numbers on the stats. When exactly did the issue start
> Apart from the numbers and stats you see on the MSA , are you experiencing latency on the hosts ?

> Collect MSA logs and run the below commands

show controller-statistics both
show host-port-statistics
show vdisk-statistics
show volume-statistics
show disk-statistics

Save the Putty session
Run these commands for 5 Instances. That is , run these set of commands first time. Wait for 5 minutes. Again run these for the second time.
Total of 5 instances with a gap of 5 minutes after each instance.

Please log a case with HPE support with all the above details

I am an HPE employee
Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Accept or Kudo

System Management Homep · ‎09-02-2020

Thank you for your reply.

We do not experiment any performance issues. We are just trying to better understand how the MSA is working. By disabling the scrub jobs, we are [pretty] sure the MSA is just handle I/O coming from our OpenVMS Server (attached to 2 SAN Switches).

Looking at the statistics provided by the MSA and compared it with statistics provided by OpenVMS is our objective (of course at OpenVMS we are looking the physical I/O). During the last 24hours, an average of 25I/Os is done per disk (with a peak to 200 IO/s every 4 hours tied to a backup task) whatever the graph I used (disk or disk-groups) with an average response time of 0,5ms (500µs). But when we are looking the 'queue depth' for the same period, the displayed value is about 5000 Million .

So the question is : where these fgures (queue depgh) come from? It is a calculation?

Concerning scrub jobs, one improvement should be to have a kind of cron table in order to schedule them at a desired datetime. Even if it is a backgroup activity which normally does not impact other IO, for reporting it is better to know when this task is running to eliminate the associated period from our custimized report.

I will ask the customer to open a case about Queue Depth.

Thank you

SUBHAJIT KHANBARMAN_1 · ‎09-02-2020

@System Management Homep

Yes you are correct. I have also noticed random drives shows Queue Depth as 8000 which doesn't looks like right values. This is already covered in release notes under "Issue/Workaround" section. Not specifically as "Queue Depth" value but like this,

"Issue: After adding disk-groups, the performance charts displayed in the SMU show incorrect data.

Workaround: Gather disk statistics using the CLI or reset the performance statistics.

"

This is mainly you see when you check historical disk statistics value.

I would suggest to always check live data when you are measuring performance. It can be per disk or as part of disk-group if you check for individual drives.

Also better to check Host port statistics live Queue depth as well.

Hope this helps!
Regards
Subhajit

I am an HPE employee

If you feel this was helpful please click the KUDOS! thumb below!

******************************************************************************

I work for HPE

System Management Homep · ‎09-02-2020

Thank you Subhajit for your answer!

I also see this issue when I read the release note for new firmware 'VL270P004' (we are still using the previous one) as I was looking a potential fix.

Unfortunately, I reset all statistics (all topics) last week-end but the issue is still there. Live statistics using CLI does not show the Queue Depth (I only see it using SMU).

As you also observed such hight number (on my side it is about 8000 Million not 8000 [I check using the CLI] but I assume you just have a look on the scale [you should have a (M) [for Million] in the graph title]), I am not alone with this issue.

For now, you do not have performance issue as the number of IOs is very low (an average of 25 IOs per disk). It is just to verify MSA is working fine and having such number does not satisfy the customer (if this number is wrong what about other indicators).

I would prefer to not have this metric reported by SMU otherwise custuler raises a lot of questions (for instance disk errors are reported but without real disk issue; another mystery).

Thank again.

Regards

SUBHAJIT KHANBARMAN_1 · ‎09-02-2020

@System Management Homep

You can raise HPE SUpport case and through proper channel it will be validated and bug will get raised to get the fix.

We generally look for IOPs parameter mainly for drive level but if you have specific requirement of showing Queue Depth value for individual drive level historically then better to get it fixed otherwise live data will be enough to do performance analysis.

Hope this helps!
Regards
Subhajit

I am an HPE employee

If you feel this was helpful please click the KUDOS! thumb below!

******************************************************************************

I work for HPE

System Management Homep · ‎09-03-2020

Thank you Subhajit,

I wil ask the customer to open a case to have a fix for this buf or to remove this metric.

Regards

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Incorrect Data Queue Depth Statistics

Incorrect Data Queue Depth Statistics