Profilage de code

Intel VTune Amplifer

The Intel VTune Amplifier XE performance profiler is an analysis and tuning tool that provides predefined analysis configurations, called experiments, to address various performance questions. Among them, the hotspots experiment can help you to identify the most time-consuming parts of your code and provide call stack information down to the source lines.

Complete these steps to prepare for profiling your code:

  1. Add the -g option to your usual set of compiler flags in order to generate a symbol table, which is used by VTune during analysis. Keep the same optimization level that you intend to run in production.
  2. Start a PBS interactive session or submit a PBS batch job.
  3. Load a VTune module in the interactive PBS session or PBS script, as follows:
    module load vtune/2015_u1
module load intel-tools-16 
 . /opt/intel/impi/5.1/vtune_amplifier_xe_2016.1.1.434111/amplxe-vars.sh

You can now run an analysis, as described in the next section.

Running a Hotspots Analysis

Run the amplxe-cl command line that is appropriate for your code, as listed below. Use the -collect (or -c) option to run the hotspots collection and the -result-dir (or -r) option to specify a directory.

Note: The analysis will be stored in a default directory if you do not include -result-dir (or -r) in the command line and specify a directory. For example: r000hs, r001hs.

Running a Hotspots Analysis on Serial or OpenMP Code
To profile a serial or OpenMP application, run:

amplxe-cl -collect hotspots -result-dir  a.out

View the results

Using the VTune Graphical User Interface (GUI)
You can use the VTune GUI, amplxe-gui, to interactively profile your application or analyze the results. For example, you can collect data using amplxe-cl during a PBS session, and then analyze the results on the login node using amplxe-gui. For example:

cd go/to/work/directory
amplxe-gui r000hs

Perfcatcher

perfcatch will run an MPI or SHMEM program with a wrapper profiling library that prints communication and synchronization call profiling information to a summary file upon program completion.
The MPI profiling result file MPI_PROFILING_STATS by default will be created in the current working directory.

To use perfcatch with an SGI Message Passing Toolkit MPI program, insert the perfcatch command in front of the executable name .

  • see manual (man perfcatch) for more documentation
[homer@nbody_particle]$ qsub -I -q express -l walltime=00:60:00 -l select=1:ncpus=20:mpiprocs=20 # launch 1 job with 20 cpus
[homer@nbody_particle]$ module add mpt/2.10 perfcatcher # load mpt and perfcatcher module
[homer@nbody_particle]$ mpirun -n 10 perfcatch ./nbody 30000 # launch for some small test on one node  or use mpiexec_mpt inside pbs script for real case
[homer@nbody_particle]$ cat MPI_PROFILING_STATS # view stats
============================================================
PERFCATCHER version 25
(C) Copyright SGI.  This library may only be used
on SGI hardware platforms. See LICENSE file for
details.
============================================================
MPI/SHMEM program profiling information
Job profile recorded:         Mon Jan 18 22:16:45 2016
Program command line:         ./nbody 30000
Total MPI/SHMEM processes:    10

Total MPI/SHMEM job time, avg per rank                      17.9475 sec
Profiled job time, avg per rank                             17.9475 sec
Percent job time profiled, avg per rank                     100%

Total user time, avg per rank                               17.958 sec
Percent user time, avg per rank                             100.059%
Total system time, avg per rank                             0 sec

Time in all profiled MPI/SHMEM functions, avg per rank      0.0658407 sec
Percent time in profiled MPI/SHMEM functions, avg per rank  0.366853%

Rank-by-Rank Summary Statistics
-------------------------------

Rank-by-Rank: Percent in Profiled MPI/SHMEM functions
    Rank:Percent
    0:0.00812736%    1:0.221985%    2:0.351557%    3:0.428641%
    4:0.457613%    5:0.419803%    6:0.415641%    7:0.466786%
    8:0.451781%    9:0.446592%
  Least:  Rank 0      0.00812736%
  Most:   Rank 7      0.466786%
  Load Imbalance:  0.229878%
...

MPInside

MPInside is Performance MPI data collection tool. This tool provides many possibilities for profiling MPI/openMP code.

MPInside by default (with no environment variables set) creates at least a file named mpinside_stats. This file contains 5 sets of columns which can be easily exploited by a spreadsheet like Excel:

  • Set 1 : Time outside MPI + all the MPI functions timing
  • Set 2-3 : Named Ch_send-R_send, Amount of char transmitted + number of requests with the Send attribute
  • Set 4-5 : Named ch_recv-R_recv,Same as Set 2-3 but with the Recv attribute.
  • see manual (man MPInside) for more documentation
[homer@ nbody_particle]$ qsub -I -q express -l walltime=00:60:00 -l select=1:ncpus=20:mpiprocs=20 # launch 1 job with 20 cpus
[homer@ nbody_particle]$  module load mpt MPInside/3.6.6 # load module
[homer@ nbody_particle]$  mpirun -n 10 MPInside ./nbody 30000 # or use mpiexec_mpt inside pbs script for real case
[homer@ nbody_particle]$  more mpinside_stats
MPInside 3.6.6 standard(Apr  3 2015 02:28:42) Input variables:

 >>> column meanings <<<<
Init:    MPI_Init
Waitall:    MPI_Waitall
Isend:    MPI_Isend
Irecv:    MPI_Irecv
Allreduce:    Calls sending data+=comm_sz;Bytes received+=count,Calls receiving data++
Allgather:    Bytes sent+=sendcount,Calls sending data+=comm_sz;Bytes received+=recvcount,Calls re
ceiving data++
Cart_create:    MPI_Cart_create
Cart_shift:    MPI_Cart_shift
mpinside_overhead:    General MPInside internal overhead

>>>> Communication time totals (s) 0 1<<<<
CPU    Compute    Init    Waitall    Isend    Irecv    Allreduce    Allgather    Cart_create    Cart
_shift    mpinside_overhead
---    ------    General    Completion    Point-to-point    Point-to-point    Collective    Collective    General    General    None
0000    17.9576     0.0001     0.0040     0.0002     0.0010     0.0133     0.0012     0.0133     0.0000     0.0002
0001    17.9196     0.0001     0.0013     0.0002     0.0002     0.0548     0.0012     0.0298     0.0000     0.0001
0002    17.8894     0.0001     0.0013     0.0002     0.0004     0.0849     0.0012     0.0276     0.0000     0.0001
0003    17.8655     0.0001     0.0020     0.0001     0.0003     0.1082     0.0012     0.0130     0.0000     0.0001
0004    17.8668     0.0001     0.0007     0.0002     0.0008     0.1077     0.0012     0.0129     0.0000     0.0001
0005    17.8907     0.0001     0.0009     0.0002     0.0011     0.0833     0.0012     0.0001     0.0000     0.0001
0006    17.8697     0.0001     0.0026     0.0002     0.0005     0.1032     0.0012     0.0261     0.0000     0.0001
0007    17.8689     0.0001     0.0018     0.0002     0.0005     0.1048     0.0012     0.0132     0.0000     0.0001
0008    17.8739     0.0001     0.0009     0.0002     0.0007     0.1004     0.0012     0.0146     0.0000     0.0001
0009    17.8659     0.0001     0.0037     0.0001     0.0008     0.1056     0.0012     0.0126     0.0000     0.0001

>>>> Mbytes sent <<<<
CPU    Compute    Init    Waitall    Isend    Irecv    Allreduce    Allgather    Cart_create    Cart
_shift    mpinside_overhead
0000    ------          0          0          8          0          0          0          0          0          0
0001    ------          0          0          8          0          0          0          0          0          0
0002    ------          0          0          8          0          0          0          0          0          0
0003    ------          0          0          8          0          0          0          0          0          0
0004    ------          0          0          8          0          0          0          0          0          0
0005    ------          0          0          8          0          0          0          0          0          0