#StackBounty: #aix #cpu-usage #sar Making sense of the output from sar on AIX

Bounty: 100

I’m trying to understand some data that has been pulled from SAR. I have three main questions about this. Ultimately, I’d like to determine how many CPUs were idle at each sampling interval across a cluster of servers.

  1. Many of the CPUs are not showing up in every entry. Is this expected and What exactly does that mean? Is it related to #2?
  2. There are unused lines (CPU = U). The documentation says “U indicates the system-wide Unused capacity”. I can’t find a precise definition of “system-wide Unused capacity” or any definition at all, really. I’m not sure how to interpret a line that tells something like “the unused capacity was 70% idle.”
  3. Lastly, I’m unsure of how the - or all line is calculated. I would think it’s the average of all the CPUs but when I do the math across all CPUs, I get a vastly different answer than what is on that line. Can anyone tell me exactly what goes into that calculation? Looking closely at this related question about SAR it appears that the system-wide idle percentage is the sum of the product of each CPU’s idle percentage and the ‘physc’ value. Unfortunately, I don’t have the physc or entc% (assuming there is one) so I can’t verify this with my own data. If that’s correct, does it mean I need the physc values to truly understand idle percentage?

Here are some of examples of what I’m seeing. These are all from the same day.

CPU | Idle    CPU | Idle    CPU | Idle
----------    ----------    ----------
0   | 8       0   | 15      0   | 17
1   | 25      1   | 94      1   | 32
2   | 79      2   | 100     2   | 97
3   | 62      3   | 99      3   | 71
4   | 5       4   | 13      4   | 5
5   | 7       5   | 13      5   | 23
6   | 6       6   | 99      6   | 71
7   | 7       7   | 44      7   | 98
8   | 11      8   | 12      8   | 48
9   | 17      9   |         9   |
10  | 33      10  |         10  |
11  | 64      11  |         11  |
12  | 6       12  | 0       12  | 38
13  | 6       13  |         13  |
14  | 6       14  |         14  |
15  | 6       15  |         15  |
16  | 12      16  | 12      16  | 37
17  | 15      17  |         17  |
18  | 62      18  |         18  |
19  | 69      19  |         19  |
20  | 7       20  | 3       20  | 42
21  | 7       21  |         21  |
22  | 6       22  |         22  |
23  | 7       23  |         23  |

case 1: average:    22, U: 80, all: 15
case 2: avg(known): 42, U: 95, all: 15
case 3: avg(known): 48, U: 97, all: 85

If I assume that the missing values are all zero for case 2, the average is 21 which seems somewhat consistent with case 1. However, if I make that assumption for case 3, I get 24% which is completely at odds with the 85% percent value given by sar for the overall CPU idle.

I don’t fully understand why some CPUs are not being reported at each point but the ones that are missing are not evenly distributed as seen in the examples above. Also from reading this redbook, I take it that these must be logical CPUs and that without the physc numbers, I think there’s not much I can do with these values. I’ve tried to use the U value in various equations but I haven’t found anything sensible. It’s not even clear to me that the overall idle percentage can be taken at face value.

NOTE: There is something wrong with the capture of this data from sar is a completely valid answer for #1, if it’s the case it should always return.


Get this bounty!!!

Leave a Reply