Profile Terminology - Parallel Performance Wizard v3.2 User Manual

Next: PPW Workflow, Previous: Perf Analysis Intro, Up: PPW Concepts

1.2 Profile Terminology

As mentioned in the previous section, while collecting full trace data results in a more accurate picture of a program's runtime behavior, profile data can be collected more efficiently and managed much more easily. Since profile data is essentially a statistical description of a program's runtime performance, a natural question to ask is

How exactly is this performance data being summarized?

While there are many, many different methods that one can use to process performance data, there are a few popular methods that show up across different tools that we'll describe in this section. We feel that it is important to understand these terms so that profile data reported by PPW can be correctly interpreted.

Where possible, we've used the same terms that we've found in literature to describe the concepts in this section, although some terms do vary slightly from author to author.

1.2.1 Flat and Full Profiles

Within the category of profiling tools, there are variations on how profile data is collected with regards to a program's callstack. Traditionally, profile data is tracked with respect to the topmost entry on the callstack, which gives a flat profile. Flat profiles keep track of time spent in each function, but do not keep track of the relationship between functions. For instance, a flat profile will be able to tell you that your program spent 25.2 seconds executing function ‘A’, but will not be able to tell you that ‘A’ ran for 10.5 seconds when called from ‘B’ and 14.7 seconds when called from ‘C’. In other words, a flat profile tallies time spent with respect to functions rather than function callstacks.

Generating a path profile (also known as a callpath profile) involves another method of collecting profile data in which statistical information is kept with respect to function callstacks. A path profile tracks time spent in function paths rather than just time spent in each function. It is important to point out that a flat profile can be constructed from a path profile, but not vice versa. Path profiles contain much more useful information at the cost of higher implementation complexity and storage space for the profile data.

A good way to think of the difference between a flat profile and a path profile is to logically envision how data is recorded under each scenario. Assume we have the following C program:

     void B() {
       sleep(1);
     }
     
     void A() {
       sleep(1);
       B();
     }
     
     int main() {
       A();
       sleep(1);
       B();
       B();
       return 0;
     }

In the program above, we see that ‘main’ calls ‘A’, which calls ‘B’. ‘main’ then calls ‘B’ twice, and finally finishes executing.

If we were constructing a flat profile for the above program, we would keep a timer associated with each function, starting the timer when the function began executing and stopping the timer when the function returned. Therefore, we would have a total of three timers: a timer for ‘main’, a timer for ‘A’, and a timer for ‘B’. It is also important to note here that since we are creating an execution profile, we do not create a new timer for each function each time it executes; rather, we continue tallying with our existing timer if a function is executed more than once.

If we were constructing a path profile for the above program, we would look up which timer based on all of the functions on the callstack rather than just the currently-executing function. Following the execution path above, we would end up with four timers instead of three: ‘main’, ‘main - A’, ‘main - A - B’, and ‘main - B’. There will be one timer for each possible callstack, and since we are generating profile data our timers are reused, as with flat profiles.

Similar to the idea of using function callstacks to track profile data separately, one can also get more detailed performance information by tracking data with respect to a sequence of functions with callsite information rather than a sequence of function names. Profiles based on callsites are sometimes called callsite profiles. Continuing with our example above, we would end up with five timers: ‘main’ with no callsite, ‘main - A’ with a callsite in ‘main’, ‘main - A - B’ with a callsite in ‘A’, ‘main - B’ with one callsite in ‘main’, and ‘main - B’ with a second callsite in ‘main’. In short, we end up with nearly the same group of timers as with a path profile, except that we end up with an additional timer for ‘main - B’ because it is called from two different lines of code within ‘main’.

Note that the TAU performance tool framework uses a slightly different definition of the term “callpath profile”. In TAU's version of callpath profiles, timers are differentiated based on looking at a maximum of N entries from the bottom of the callstack to the root. A TAU callpath profile with a depth of two for the example given above would have the following timers: ‘main’, ‘main - A’, ‘main - B’, and ‘A - B’. TAU uses the term calldepth profile to refer to PPW's path profiles, which is really just a special case of a TAU callpath profile with an infinite depth.

PPW's measurement code will always collect full path profiles rather than flat profiles, and uses callsite profiles.

In PPW, the profile table visualization shows a flat profile and the tree table visualization shows path profiles. The flat profile information is calculated from the full callpath profile, and timers are grouped together by region where appropriate (when they have no subcalls).

1.2.2 Phases and Regions

There are many definitions of the term program phase, but for the purposes of this manual we use the term to describe a time interval in which a program is performing a particular activity. For a linear algebra application, example phases might include matrix initialization, Eigenvalue computation, doing a matrix-vector product, collecting results from all nodes, formatting output, and performing disk I/O to write the results of all computations to disk. Each program phase generally has different performance characteristics, and for this reason it is generally useful to treat each phase as a separate entity during the performance tuning process.

The idea of keeping track of timers for each function can be extended to track arbitrary sections of program code. A program region, also called a region, is a generalization of the function concept that may include loops and sections of functions. Additionally, regions may span groups of functions. The concept of a region is useful for attributing performance information to particular phases of program execution.

When working with regions, it is possible to have a region that contains other regions, such as a ‘for’ loop within a function. These regions are referred to as subregions, because they are regions contained within another region.

In most cases, the terms region and function can be used interchangeably. PPW and the rest of this manual use the more general term region instead of function; feel free to mentally substitute “function” for “region” and “function call” for “subregion call” when reading this manual.

To track phase data and arbitrary regions of code, PPW exposes a user-level measurement API (see API Reference for details on how to use this API within your programs).

When compiling using the --inst-functions option to ‘ppwcc’, ‘ppwshmemcc’, or ‘ppwupcc’, PPW will automatically instrument your program to track function entry and exit for compilers that support this. In this case, regions representing functions in your program will be created automatically at runtime by PPW's measurement code. See ppwcc, ppwshmemcc, and ppwupcc for more information on those commands. Note that the --inst-functions option is not supported on all compilers.

PPW always creates a toplevel region named ‘Application’ that keeps track of the total execution time of the program.

1.2.3 Inclusive and Exclusive Times

Profile data may also differentiate between time spent executing within a region and time spent in calls to other region within a given region. Time spent executing code in the region itself is referred to as exclusive time or self time. Time spent within this region and any subregion calls (ie, function calls) is referred to as inclusive time or total time. The inclusive/exclusive terms can be easily differentiated with the following sentence:

Exclusive time for function ‘A’ is the time spent executing statements in the body of ‘A’, while inclusive time is the total time spent executing ‘A’ including any subroutine calls.

PPW uses the self/total terms because they are easier to remember: self time is only the time taken by the region itself, and total refers to all the time taken by a region including any subregions or calls to other regions.

1.2.4 Other Profile Statistics

Sometimes it is useful to know how many times a particular region was executed, or how many times a region made calls to other regions or executed subregions within that region. Such statistics are useful in identifying functions that might benefit from inlining. These terms are usually known as calls and sub calls, although some other tools use the term count instead.

Many times, when troubleshooting a load-balancing problem in which a region of code has input-sensitive execution time, it is useful to know the minimum and maximum time spent executing a particular region. Tracking min time and max time can be done using either inclusive or exclusive time, but most tools usually track min and max statistics for inclusive time since it generally is easier to interpret.

In addition to calls and min/max time, other summary statistics about program execution can also be collected, including standard deviation of inclusive times and average exclusive or inclusive time (which can be derived from other statistics).

PPW does keep track of call and sub call counts, in addition to min and max time. However, for overhead management reasons, PPW does not track any other statistics. If you'd like to see PPW track other statistics, please file a bug report for a feature enhancement using the Bugzilla website.

1.2.5 Aggregating Profile Data

Armed with the terms above, we can now discuss one of the stranger topics relating to profile data, which is how to interpret profile data spanning different nodes. While tools can simply display profile data for each node, this amount of data becomes impractical after only a few nodes. Instead, most tools choose to aggregate the data in some manner by combining the data using one of several techniques.

The most straightforward method of aggregating data from different nodes is to simply sum together timers that have the same callpath. When summing profile data in this manner, the resulting profile gives you a good overall picture of how time (or whatever metric was collected) was spent in your application across every node. Interpreting summed profile data is fairly straightforward, as it will show any regions of code that contributed a significant amount to overall runtime. In addition, looking at summed profile data will also identify any costly synchronization constructs that sap program efficiency.

Other aggregation methods including taking the min, max, or average metric values across each timer with the same callpath. These aggregation techniques give performance data that is representative of a single node in the system, instead of giving a summary of data across all nodes. While aggregating the data using these techniques can give you a little more insight into the distribution of values among regions in your program, the resulting data can often be slightly strange.

For example, let's assume you have a simple program with three functions ‘main’, ‘A’, and ‘B’. In this example, ‘main’ makes a single call to both ‘A’ and ‘B’ and does not do anything aside from calling ‘A’ and ‘B’. A flat profile for this example might look like this (with times reported in seconds):

     Node     Region      Inclusive time   Exclusive time
     ----------------------------------------------------
     1        main        10.0             0.0
     1        A            7.5             7.5
     1        B            2.5             2.5
     
     2        main        10.0             0.0
     2        A            2.5             2.5
     2        B            7.5             7.5

If we aggregate using summing, the resulting profile would look like this:

     Region      Inclusive time   Exclusive time
     --------------------------------------------
     main        20.0              0.0
     A           10.0             10.0
     B           10.0             10.0

which makes sense, although glosses over the fact that ‘A’ and ‘B’ took different times to execute on different nodes. Note that by aggregating data together, we always lose some of these details, although tools providing a breakdown of an aggregated metric across all nodes will let you reconstruct this information.

Now let's look at what the data will look like if we aggregate using the max values:

     Region      Inclusive time   Exclusive time
     --------------------------------------------
     main        10.0              0.0
     A            7.5              7.5
     B            7.5              7.5

This data set definitely looks much stranger, especially if you consider that it is telling you the sum of time spent in ‘A’ and ‘B’ is greater than all time spent in ‘main’. However, this data set also lets us know that both ‘A’ and ‘B’ took a max of 7.5 seconds to execute on at least one node, which is useful to know as the time can be treated as a “worst-case” time across all nodes.

A similar thing happens when we aggregate using min values:

     Region      Inclusive time   Exclusive time
     --------------------------------------------
     main        10.0              0.0
     A            2.5              2.5
     B            2.5              2.5

Similar to the max aggregation example, the min values give us the “best-case” time for executing that region across all nodes, which is somewhat unintuitive.

When aggregating data using path profiles rather than flat profiles, these oddities make the resulting data set even harder to interpret properly. Since a function hierarchy can be reconstructed from path profile information, a tool can feasibly “fix” the aggregation by recalculating inclusive times in a bottom-up fashion based on the new exclusive timing information. However, after “fixing” this data, one could argue that the data set is no longer representative of the original program run.

To summarize, the summing aggregation technique is the most useful because the resulting data is simply a summary of all node data in the system. The min and max aggregation techniques can be used to get an idea of the best- and worst-case performance data that could be expected from any node in the system, and the averaging technique can be used to get an idea of nominal performance data for any given node in the system.

As mentioned before, given a path profile, we can use aggregation techniques to derive a flat profile. In this case, it only makes sense to use a summing aggregation, as the min/max/average techniques make the resulting data set nonsensical.

PPW offers all four aggregation techniques described here, but uses summing as a default aggregation method since it is the easiest to make sense of. For the profile table visualization, PPW uses the summing aggregation technique on single-node path profile data. Additionally, when aggregating other profile statistics such as calls and max inclusive time, PPW uses the expected method (summing counts, taking the absolute min of all minimum times and the absolute max of all maximum times, etc). Also, when aggregating path profiles, PPW does not attempt to “fix” inclusive times and instead shows the inclusive times generated by the aggregation method itself.