OProfile manual

John Levon


Table of Contents

1. Introduction
1. Applications of OProfile
2. System requirements
3. Internet resources
4. Installation
5. Uninstalling OProfile
2. Overview
1. Getting started
2. Tools summary
3. Controlling the profiler
1. Using opcontrol
1.1. Examples
1.1.1. Intel performance counter setup
1.1.2. RTC mode
1.1.3. Starting the daemon separately
1.1.4. Separate profiles for libraries and the kernel
1.1.5. Profiling sessions
1.2. Specifying performance counter events
2. Using oprof_start
3. Configuration details
3.1. Hardware performance counters
3.2. OProfile in RTC mode
3.3. OProfile in timer interrupt mode
3.4. Pentium 4 support
3.5. Intel Itanium 2 support
3.6. PowerPC64 support
3.7. Dangerous counter settings
4. Obtaining results
1. Profile specifications
1.1. Examples
1.2. Profile specification parameters
1.3. Locating and managing binary images
1.4. What to do when you don't get any results
2. Image summaries and symbol summaries (opreport)
2.1. Merging separate profiles
2.2. Side-by-side multiple results
2.3. Callgraph output
2.4. Differential profiles with opreport
2.5. Anonymous executable mappings
2.6. Options for opreport
3. Outputting annotated source (opannotate)
3.1. Locating source files
3.2. Usage of opannotate
4. gprof-compatible output (opgprof)
4.1. Usage of opgprof
5. Archiving measurements (oparchive)
5.1. Usage of oparchive
6. Converting sample database files (opimport)
6.1. Usage of opimport
5. Interpreting profiling results
1. Profiling interrupt latency
2. Kernel profiling
2.1. Interrupt masking
2.2. Idle time
2.3. Profiling kernel modules
3. Interpreting call-graph profiles
4. Inaccuracies in annotated source
4.1. Side effects of optimizations
4.2. Prologues and epilogues
4.3. Inlined functions
4.4. Inaccuracy in line number information
5. Assembly functions
6. Other discrepancies
6. Acknowledgments

This manual applies to OProfile version 0.9.2. OProfile is a profiling system for Linux 2.2/2.4/2.6 systems on a number of architectures. It is capable of profiling all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. It runs transparently in the background collecting information at a low overhead. These features make it ideal for profiling entire systems to determine bottle necks in real-world systems.

Many CPUs provide "performance counters", hardware registers that can count "events"; for example, cache misses, or CPU cycles. OProfile provides profiles of code based on the number of these occurring events: repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded. This information is aggregated into profiles for each binary image.

Some hardware setups do not allow OProfile to use performance counters: in these cases, no events are available, and OProfile operates in timer/RTC mode, as described in later chapters.

Linux kernel 2.2/2.4/2.6

OProfile uses a kernel module that can be compiled for 2.2.11 or later and 2.4. 2.4.10 or above is required if you use the boot-time kernel option nosmp. 2.6 kernels are supported with the in-kernel OProfile driver. Note that only 32-bit x86 and IA64 are supported on 2.2/2.4 kernels.

2.6 kernels are strongly recommended. Under 2.4, OProfile may cause system crashes if power management is used, or the BIOS does not correctly deal with local APICs.

PPC64 processors (Power4/Power5/PPC970) require a recent (> 2.6.5) kernel with the line #define PV_970 present in include/asm-ppc64/processor.h.

modutils 2.4.6 or above

You should have installed modutils 2.4.6 or higher (in fact earlier versions work well in almost all cases).

Supported architecture

For Intel IA32, a CPU with either a P6 generation or Pentium 4 core is required. In marketing terms this translates to anything between an Intel Pentium Pro (not Pentium Classics) and a Pentium 4 / Xeon, including all Celerons. The AMD Athlon, and Duron CPUs are also supported. Other IA32 CPU types only support the RTC mode of OProfile; please see later in this manual for details. Hyper-threaded Pentium IVs are not supported in 2.4. For 2.4 kernels, the Intel IA-64 CPUs are also supported. For 2.6 kernels, there is additionally support for Alpha processors, MIPS, ARM, x86-64, sparc64, ppc64, and, in timer mode, PA-RISC and s390.

Uniprocessor or SMP

SMP machines are fully supported.

Required libraries

These libraries are required : popt, bfd, liberty (debian users: libiberty is provided in binutils-dev package), dl, plus the standard C++ libraries.

OProfile GUI

The use of the GUI to start the profiler requires the Qt 2 library. Qt 3 should also work.

ELF

Probably not too strenuous a requirement, but older A.OUT binaries/libraries are not supported.

K&R coding style

OK, so it's not really a requirement, but I wish it was...

Web page

There is a web page (which you may be reading now) at http://oprofile.sf.net/.

Download

You can download a source tarball or get anonymous CVS at the sourceforge page, http://sf.net/projects/oprofile/.

Mailing list

There is a low-traffic OProfile-specific mailing list, details at http://sf.net/mail/?group_id=16191.

Bug tracker

There is a bug tracker for OProfile at SourceForge, http://sf.net/tracker/?group_id=16191&atid=116191.

IRC channel

Several OProfile developers and users sometimes hang out on channel #oprofile on the freenode network.

First you need to build OProfile and install it. ./configure, make, make install is often all you need, but note these arguments to ./configure :

You'll need to have a configured kernel source for the current kernel to build the module for 2.4 kernels. Since all distributions provide different kernels it's unlikely the running kernel match the configured source you installed. The safest way is to recompile your own kernel, run it and compile oprofile. It is also recommended that if you have a uniprocessor machine, you enable the local APIC / IO_APIC support for your kernel (this is automatically enabled for SMP kernels). With many BIOS, kernel >= 2.6.9 and UP kernel it's not sufficient to enable the local APIC you must also turn it on explicitely at boot time by providing "lapic" option to the kernel. On machines with power management, such as laptops, the power management must be turned off when using OProfile with 2.4 kernels. The power management software in the BIOS cannot handle the non-maskable interrupts (NMIs) used by OProfile for data collection. If you use the NMI watchdog, be aware that the watchdog is disabled when profiling starts, and not re-enabled until the OProfile module is removed (or, in 2.6, when OProfile is not running). If you compile OProfile for a 2.2 kernel you must be root to compile the module. If you are using 2.6 kernels or higher, you do not need kernel source, as long as the OProfile driver is enabled; additionally, you should not need to disable power management.

Please note that you must save or have available the vmlinux file generated during a kernel compile, as OProfile needs it (you can use --no-vmlinux, but this will prevent kernel profiling).

Before you can use OProfile, you must set it up. The minimum setup required for this is to tell OProfile where the vmlinux file corresponding to the running kernel is, for example :

opcontrol --vmlinux=/boot/vmlinux-`uname -r`

If you don't want to profile the kernel itself, you can tell OProfile you don't have a vmlinux file :

opcontrol --no-vmlinux

Now we are ready to start the daemon (oprofiled) which collects the profile data :

opcontrol --start

When I want to stop profiling, I can do so with :

opcontrol --shutdown

Note that unlike gprof, no instrumentation (-pg and -a options to gcc) is necessary.

Periodically (or on opcontrol --shutdown or opcontrol --dump) the profile data is written out into the /var/lib/oprofile/samples directory. These profile files cover shared libraries, applications, the kernel (vmlinux), and kernel modules. You can clear the profile data (at any time) with opcontrol --reset.

You can get summaries of this data in a number of ways at any time. To get a summary of data across the entire system for all of these profiles, you can do :

opreport

Or to get a more detailed summary, for a particular image, you can do something like :

opreport -l /boot/vmlinux-`uname -r`

There are also a number of other ways of presenting the data, as described later in this manual. Note that OProfile will choose a default profiling setup for you. However, there are a number of options you can pass to opcontrol if you need to change something, also detailed later.

This section gives a brief description of the available OProfile utilities and their purpose.

ophelp

This utility lists the available events and short descriptions.

opcontrol

Used for controlling the OProfile data collection, discussed in Chapter 3.

opreport

This is the main tool for retrieving useful profile data, described in Section 2.

opannotate

This utility can be used to produce annotated source, assembly or mixed source/assembly. Source level annotation is available only if the application was compiled with debugging symbols. See Section 3.

opgprof

This utility can output gprof-style data files for a binary, for use with gprof -p. See Section 4.

oparchive

This utility can be used to collect executables, debuginfo, and sample files and copy the files into an archive. The archive is self-contained and can be moved to another machine for further analysis. See Section 5.

opimport

This utility converts sample database files from a foreign binary format (abi) to the native format. This is useful only when moving sample files between hosts, for analysis on platforms other than the one used for collection. See Section 6.

In this section we describe the configuration and control of the profiling system with opcontrol in more depth. The opcontrol script has a default setup, but you can alter this with the options given below. In particular, if your hardware supports performance counters, you can configure them. There are a number of counters (for example, counter 0 and counter 1 on the Pentium III). Each of these counters can be programmed with an event to count, such as cache misses or MMX operations. The event chosen for each counter is reflected in the profile data collected by OProfile: functions and binaries at the top of the profiles reflect that most of the chosen events happened within that code.

Additionally, each counter has a "count" value: this corresponds to how detailed the profile is. The lower the value, the more frequently profile samples are taken. A counter can choose to sample only kernel code, user-space code, or both (both is the default). Finally, some events have a "unit mask" - this is a value that further restricts the types of event that are counted. The event types and unit masks for your CPU are listed by opcontrol --list-events.

The opcontrol script provides the following actions :

--init

Loads the OProfile module if required and makes the OProfile driver interface available.

--setup

Followed by list arguments for profiling set up. List of arguments saved in /root/.oprofile/daemonrc. Giving this option is not necessary; you can just directly pass one of the setup options, e.g. opcontrol --no-vmlinux.

--status

Show configuration information.

--start-daemon

Start the oprofile daemon without starting actual profiling. The profiling can then be started using --start. This is useful for avoiding measuring the cost of daemon startup, as --start is a simple write to a file in oprofilefs. Not available in 2.2/2.4 kernels.

--start

Start data collection with either arguments provided by --setup or information saved in /root/.oprofile/daemonrc. Specifying the addition --verbose makes the daemon generate lots of debug data whilst it is running.

--dump

Force a flush of the collected profiling data to the daemon.

--stop

Stop data collection (this separate step is not possible with 2.2 or 2.4 kernels).

--shutdown

Stop data collection and kill the daemon.

--reset

Clears out data from current session, but leaves saved sessions.

--save=session_name

Save data from current session to session_name.

--deinit

Shuts down daemon. Unload the OProfile module and oprofilefs.

--list-events

List event types and unit masks.

--help

Generate usage messages.

There are a number of possible settings, of which, only --vmlinux (or --no-vmlinux) is required. These settings are stored in ~/.oprofile/daemonrc.

--buffer-size=num

Number of samples in kernel buffer. When using a 2.6 kernel buffer watershed need to be tweaked when changing this value.

--buffer-watershed=num

Set kernel buffer watershed to num samples (2.6 only). When it'll remain only buffer-size - buffer-watershed free entry in the kernel buffer data will be flushed to daemon, most usefull value are in the range [0.25 - 0.5] * buffer-size.

--cpu-buffer-size=num

Number of samples in kernel per-cpu buffer (2.6 only). If you profile at high rate it can help to increase this if the log file show excessive count of sample lost cpu buffer overflow.

--event=[eventspec]

Use the given performance counter event to profile. See Section 1.2 below.

--separate=[none,lib,kernel,thread,cpu,all]

By default, every profile is stored in a single file. Thus, for example, samples in the C library are all accredited to the /lib/libc.o profile. However, you choose to create separate sample files by specifying one of the below options.

none No profile separation (default)
lib Create per-application profiles for libraries
kernel Create per-application profiles for the kernel and kernel modules
thread Create profiles for each thread and each task
cpu Create profiles for each CPU
all All of the above options

Note that --separate=kernel also turns on --separate=lib. When using --separate=kernel, samples in hardware interrupts, soft-irqs, or other asynchronous kernel contexts are credited to the task currently running. This means you will see seemingly nonsense profiles such as /bin/bash showing samples for the PPP modules, etc.

On 2.2/2.4 only kernel threads already started when profiling begins are correctly profiled; newly started kernel thread samples are credited to the vmlinux (kernel) profile.

Using --separate=thread creates a lot of sample files if you leave OProfile running for a while; it's most useful when used for short sessions, or when using image filtering.

--callgraph=#depth

Enable call-graph sample collection with a maximum depth. Use 0 to disable callgraph profiling. NOTE: Callgraph support is available on a limited number of platforms at this time; for example:

  • x86 with recent 2.6 kernel

  • ARM with recent 2.6 kernel

  • PowerPC with 2.6.17 kernel

--image=image,[images]|"all"

Image filtering. If you specify one or more absolute paths to binaries, OProfile will only produce profile results for those binary images. This is useful for restricting the sometimes voluminous output you may get otherwise, especially with --separate=thread. Note that if you are using --separate=lib or --separate=kernel, then if you specification an application binary, the shared libraries and kernel code are included. Specify the value "all" to profile everything (the default).

--vmlinux=file

vmlinux kernel image.

--no-vmlinux

Use this when you don't have a kernel vmlinux file, and you don't want to profile the kernel. This still counts the total number of kernel samples, but can't give symbol-based results for the kernel or any modules.

The --event option to opcontrol takes a specification that indicates how the details of each hardware performance counter should be setup. If you want to revert to OProfile's default setting (--event is strictly optional), use --event=default.

You can pass multiple event specifications. OProfile will allocate hardware counters as necessary. Note that some combinations are not allowed by the CPU; running opcontrol --list-events gives the details of each event. The event specification is a colon-separated string of the form name:count:unitmask:kernel:user as described in this table:

Note

For the PowerPC platforms, all events specified must be in the same group; i.e., the group number appended to the event name (e.g. <some-event-name>_GRP9) must be the same.

name The symbolic event name, e.g. CPU_CLK_UNHALTED
count The counter reset value, e.g. 100000
unitmask The unit mask, as given in the events list, e.g. 0x0f
kernel Whether to profile kernel code
user Whether to profile userspace code

The last three values are optional, if you omit them (e.g. --event=DATA_MEM_REFS:30000), they will be set to the default values (a unit mask of 0, and profiling both kernel and userspace code). Note that some events require a unit mask.

If OProfile is using RTC mode, and you want to alter the default counter value, you can use something like --event=RTC_INTERRUPTS:2048. Note the last three values here are ignored. If OProfile is using timer-interrupt mode, there is no configuration possible.

The table below lists the events selected by default (--event=default) for the various computer architectures:

Processor cpu_type Default event
Alpha EV4 alpha/ev4 CYCLES:100000:0:1:1
Alpha EV5 alpha/ev5 CYCLES:100000:0:1:1
Alpha PCA56 alpha/pca56 CYCLES:100000:0:1:1
Alpha EV6 alpha/ev6 CYCLES:100000:0:1:1
Alpha EV67 alpha/ev67 CYCLES:100000:0:1:1
ARM/XScale PMU1 arm/xscale1 CPU_CYCLES:100000:0:1:1
ARM/XScale PMU2 arm/xscale2 CPU_CYCLES:100000:0:1:1
Athlon i386/athlon CPU_CLK_UNHALTED:100000:0:1:1
Pentium Pro i386/ppro CPU_CLK_UNHALTED:100000:0:1:1
Pentium II i386/pii CPU_CLK_UNHALTED:100000:0:1:1
Pentium III i386/piii CPU_CLK_UNHALTED:100000:0:1:1
Pentium M (P6 core) i386/p6_mobile CPU_CLK_UNHALTED:100000:0:1:1
Pentium 4 (non-HT) i386/p4 GLOBAL_POWER_EVENTS:100000:1:1:1
Pentium 4 (HT) i386/p4-ht GLOBAL_POWER_EVENTS:100000:1:1:1
Hammer x86-64/hammer CPU_CLK_UNHALTED:100000:0:1:1
Itanium ia64/itanium CPU_CYCLES:100000:0:1:1
Itanium 2 ia64/itanium2 CPU_CYCLES:100000:0:1:1
TIMER_INT timer None selectable
IBM iseries PowerPC 4/5/970 CYCLES:10000:0:1:1
IBM pseries PowerPC 4/5/970 CYCLES:10000:0:1:1
IBM s390 timer None selectable
IBM s390x timer None selectable

The oprof_start application provides a convenient way to start the profiler. Note that oprof_start is just a wrapper around the opcontrol script, so it does not provide more services than the script itself.

After oprof_start is started you can select the event type for each counter; the sampling rate and other related parameters are explained in Section 1. The "Configuration" section allows you to set general parameters such as the buffer size, kernel filename etc. The counter setup interface should be self-explanatory; Section 3.1 and related links contain information on using unit masks.

A status line shows the current status of the profiler: how long it has been running, and the average number of interrupts received per second and the total, over all processors. Note that quitting oprof_start does not stop the profiler.

Your configuration is saved in the same file as opcontrol uses; that is, ~/.oprofile/daemonrc.

Note

Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in RTC mode in 2.4 (see Section 3.2), or timer mode in 2.6 (see Section 3.3). You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.

The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Duron implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf. For PowerPC64 processors in IBM iSeries and pSeries systems, processor documentation is available at http://www-306.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC. (For example, the specific publication containing information on the performance monitor unit for the PowerPC970 is "IBM PowerPC 970FX RISC Microprocessor User's Manual.") These processors are capable of delivering an interrupt when a counter overflows. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC value and the current task are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. The daemon receives this data from the kernel, and writes it to the sample files.

If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.

However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5 and also in the Digital paper "ProfileMe: A Hardware Performance Counter".

Each counter has several configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.

After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Thus, higher values mean less-detailed profiling, and lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer). Specifying too small a value can lead to higher performance overhead.

OK, so the profiler has been running, but it's not much use unless we can get some data out. Fairly often, OProfile does a little too good a job of keeping overhead low, and no data reaches the profiler. This can happen on lightly-loaded machines. Remember you can force a dump at any time with :

opcontrol --dump

Remember to do this before complaining there is no profiling data ! Now that we've got some data, it has to be processed. That's the job of opreport, opannotate, or opgprof.

All of the analysis tools take a profile specification. This is a set of definitions that describe which actual profiles should be examined. The simplest profile specification is empty: this will match all the available profile files for the current session (this is what happens when you do opreport).

Specification parameters are of the form name:value[,value]. For example, if I wanted to get a combined symbol summary for /bin/myprog and /bin/myprog2, I could do opreport -l image:/bin/myprog,/bin/myprog2. As a special case, you don't actually need to specify the image: part here: anything left on the command line is assumed to be an image: name. Similarly, if no session: is specified, then session:current is assumed ("current" is a special name of the current / last profiling session).

In addition to the comma-separated list shown above, some of the specification parameters can take glob-style values. For example, if I want to see image summaries for all binaries profiled in /usr/bin/, I could do opreport image:/usr/bin/\*. Note the necessity to escape the special character from the shell.

For opreport, profile specifications can be used to define two profiles, giving differential output. This is done by enclosing each of the two specifications within curly braces, as shown in the examples below. Any specifications outside of curly braces are shared across both.

archive: archivepath

A path to an archive made with oparchive. Absence of this tag, unlike others, means "the current system", equivalent to specifying "archive:".

session: sessionlist

A comma-separated list of session names to resolve in. Absence of this tag, unlike others, means "the current session", equivalent to specifying "session:current".

session-exclude: sessionlist

A comma-separated list of sessions to exclude.

image: imagelist

A comma-separated list of image names to resolve. Each entry may be relative path, glob-style name, or full path, e.g.

opreport 'image:/usr/bin/oprofiled,*op*,./opreport'
image-exclude: imagelist

Same as image:, but the matching images are excluded.

lib-image: imagelist

Same as image:, but only for images that are for a particular primary binary image (namely, an application). This only makes sense to use if you're using --separate. This includes kernel modules and the kernel when using --separate=kernel.

lib-image-exclude: imagelist

Same as lib-image:, but the matching images are excluded.

event: eventlist

The symbolic event name to match on, e.g. event:DATA_MEM_REFS. You can pass a list of events for side-by-side comparison with opreport. When using the timer interrupt, the event is always "TIMER".

count: eventcountlist

The event count to match on, e.g. event:DATA_MEM_REFS count:30000. Note that this value refers to the setting used for opcontrol only, and has nothing to do with the sample counts in the profile data itself. You can pass a list of events for side-by-side comparison with opreport. When using the timer interrupt, the count is always 0 (indicating it cannot be set).

unit-mask: masklist

The unit mask value of the event to match on, e.g. unit-mask:1. You can pass a list of events for side-by-side comparison with opreport.

cpu: cpulist

Only consider profiles for the given numbered CPU (starting from zero). This is only useful when using CPU profile separation.

tgid: pidlist

Only consider profiles for the given task groups. Unless some program is using threads, the task group ID of a process is the same as its process ID. This option corresponds to the POSIX notion of a thread group. This is only useful when using per-process profile separation.

tid: tidlist

Only consider profiles for the given threads. When using recent thread libraries, all threads in a process share the same task group ID, but have different thread IDs. You can use this option in combination with tgid: to restrict the results to particular threads within a process. This is only useful when using per-process profile separation.

When attempting to get output, you may see the error :

error: no sample files found: profile specification too strict ?

What this is saying is that the profile specification you passed in, when matched against the available sample files, resulted in no matches. There are a number of reasons this might happen:

spelling

You specified a binary name, but spelt it wrongly. Check your spelling !

profiler wasn't running

Make very sure that OProfile was actually up and running when you ran the binary.

binary didn't run long enough

Remember OProfile is a statistical profiler - you're not guaranteed to get samples for short-running programs. You can help this by using a lower count for the performance counter, so there are a lot more samples taken per second.

binary spent most of its time in libraries

Similarly, if the binary spends little time in the main binary image itself, with most of it spent in shared libraries it uses, you might not see any samples for the binary image itself. You can check this by using opcontrol --separate=lib before the profiling session, so opreport and friends show the library profiles on a per-application basis.

specification was really too strict

For example, you specified something like tgid:3433, but no task with that group ID ever ran the code.

binary didn't generate any events

If you're using a particular event counter, for example counting MMX operations, the code might simply have not generated any events in the first place. Verify the code you're profiling does what you expect it to.

you didn't specify kernel module name correctly

If you're using 2.6 kernels, and trying to get reports for a kernel module, make sure to use the -p option, and specify the module name with the .ko extension. Check if the module is one loaded from initrd.

The opreport utility is the primary utility you will use for getting formatted data out of OProfile. It produces two types of data: image summaries and symbol summaries. An image summary lists the number of samples for individual binary images such as libraries or applications. Symbol summaries provide per-symbol profile data. In the following example, we're getting an image summary for the whole system:

$ opreport --long-filenames
CPU: PIII, speed 863.195 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
   905898 59.7415 /usr/lib/gcc-lib/i386-redhat-linux/3.2/cc1plus
   214320 14.1338 /boot/2.6.0/vmlinux
   103450  6.8222 /lib/i686/libc-2.3.2.so
    60160  3.9674 /usr/local/bin/madplay
    31769  2.0951 /usr/local/oprofile-pp/bin/oprofiled
    26550  1.7509 /usr/lib/libartsflow.so.1.0.0
    23906  1.5765 /usr/bin/as
    18770  1.2378 /oprofile
    15528  1.0240 /usr/lib/qt-3.0.5/lib/libqt-mt.so.3.0.5
    11979  0.7900 /usr/X11R6/bin/XFree86
    11328  0.7471 /bin/bash
    ...

If we had specified --symbols in the previous command, we would have gotten a symbol summary of all the images across the entire system. We can restrict this to only part of the system profile; for example, below is a symbol summary of the OProfile daemon. Note that as we used opcontrol --separate=kernel, symbols from images that oprofiled has used are also shown.

$ opreport -l `which oprofiled` 2>/dev/null | more
CPU: PIII, speed 863.195 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
vma      samples  %           image name               symbol name
0804be10 14971    28.1993     oprofiled                odb_insert
0804afdc 7144     13.4564     oprofiled                pop_buffer_value
c01daea0 6113     11.5144     vmlinux                  __copy_to_user_ll
0804b060 2816      5.3042     oprofiled                opd_put_sample
0804b4a0 2147      4.0441     oprofiled                opd_process_samples
0804acf4 1855      3.4941     oprofiled                opd_put_image_sample
0804ad84 1766      3.3264     oprofiled                opd_find_image
0804a5ec 1084      2.0418     oprofiled                opd_find_module
0804ba5c 741       1.3957     oprofiled                odb_hash_add_node
...

These are the two basic ways you are most likely to use regularly, but opreport can do a lot more than that, as described below.

When using the opcontrol --callgraph option, you can see what functions are calling other functions in the output. Consider the following program:

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#define SIZE 500000

static int compare(const void *s1, const void *s2)
{
        return strcmp(s1, s2);
}

static void repeat(void)
{
        int i;
        char *strings[SIZE];
        char str[] = "abcdefghijklmnopqrstuvwxyz";

        for (i = 0; i < SIZE; ++i) {
                strings[i] = strdup(str);
                strfry(strings[i]);
        }

        qsort(strings, SIZE, sizeof(char *), compare);
}

int main()
{
        while (1)
                repeat();
}

When running with the call-graph option, OProfile will record the function stack every time it takes a sample. opreport --callgraph outputs an entry for each function, where each entry looks similar to:

samples  %        image name               symbol name
  197       0.1548  cg                       main
  127036   99.8452  cg                       repeat
84590    42.5084  libc-2.3.2.so            strfry
  84590    66.4838  libc-2.3.2.so            strfry [self]
  39169    30.7850  libc-2.3.2.so            random_r
  3475      2.7312  libc-2.3.2.so            __i686.get_pc_thunk.bx
-------------------------------------------------------------------------------

Here the non-indented line is the function we're focussing upon (strfry()). This line is the same as you'd get from a normal opreport output.

Above the non-indented line we find the functions that called this function (for example, repeat() calls strfry()). The samples and percentage values here refer to the number of times we took a sample where this call was found in the stack; the percentage is relative to all other callers of the function we're focussing on. Note that these values are not call counts; they only reflect the call stack every time a sample is taken; that is, if a call is found in the stack at the time of a sample, it is recorded in this count.

Below the line are functions that are called by strfry() (called callees). It's clear here that strfry() calls random_r(). We also see a special entry with a "[self]" marker. This records the normal samples for the function, but the percentage becomes relative to all callees. This allows you to compare time spent in the function itself compared to functions it calls. Note that if a function calls itself, then it will appear in the list of callees of itself, but without the "[self]" marker; so recursive calls are still clearly separable.

You may have noticed that the output lists main() as calling strfry(), but it's clear from the source that this doesn't actually happen. See Section 3 for an explanation.

Often, we'd like to be able to compare two profiles. For example, when analysing the performance of an application, we'd like to make code changes and examine the effect of the change. This is supported in opreport by giving a profile specification that identifies two different profiles. The general form is of:

$ opreport <shared-spec> { <first-profile> } { <second-profile> }

For each of the profiles, the shared section is prefixed, and then the specification is analysed. The usual parameters work both within the shared section, and in the sub-specification within the curly braces.

A typical way to use this feature is with archives created with oparchive. Let's look at an example:

$ ./a
$ oparchive -o orig ./a
$ opcontrol --reset
  # edit and recompile a
$ ./a
  # now compare the current profile of a with the archived profile
$ opreport -xl ./a { archive:./orig } { }
CPU: PIII, speed 863.233 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        diff %    symbol name
92435    48.5366  +0.4999   a
54226    ---      ---       c
49222    25.8459  +++       d
48787    25.6175  -2.2e-01  b

Note that we specified an empty second profile in the curly braces, as we wanted to use the current session; alternatively, we could have specified another archive, or a tgid etc. We specified the binary a in the shared section, so we matched that in both the profiles we're diffing.

As in the normal output, the results are sorted by the number of samples, and the percentage field represents the relative percentage of the symbol's samples in the second profile.

Notice the new column in the output. This value represents the percentage change of the relative percent between the first and the second profile: roughly, "how much more important this symbol is". Looking at the symbol a(), we can see that it took roughly the same amount of the total profile in both the first and the second profile. The function c() was not in the new profile, so has been marked with ---. Note that the sample value is the number of samples in the first profile; since we're displaying results for the second profile, we don't list a percentage value for it, as it would be meaningless. d() is new in the second profile, and consequently marked with +++.

When comparing profiles between different binaries, it should be clear that functions can change in terms of VMA and size. To avoid this problem, opreport considers a symbol to be the same if the symbol name, image name, and owning application name all match; any other factors are ignored. Note that the check for application name means that trying to compare library profiles between two different applications will not work as you might expect: each symbol will be considered different.

--accumulated / -a

Accumulate sample and percentage counts in the symbol list.

--callgraph / -c

Show callgraph information.

--debug-info / -g

Show source file and line for each symbol.

--demangle / -D none|normal|smart

none: no demangling. normal: use default demangler (default) smart: use pattern-matching to make C++ symbol demangling more readable.

--details / -d

Show per-instruction details for all selected symbols. Note that, for binaries without symbol information, the VMA values shown are raw file offsets for the image binary.

--exclude-dependent / -x

Do not include application-specific images for libraries, kernel modules and the kernel. This option only makes sense if the profile session used --separate.

--exclude-symbols / -e [symbols]

Exclude all the symbols in the given comma-separated list.

--global-percent / -%

Make all percentages relative to the whole profile.

--help / -? / --usage

Show help message.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find modules in kernels 2.6 and upwards.

--include-symbols / -i [symbols]

Only include symbols in the given comma-separated list.

--long-filenames / -l

Output full paths instead of basenames.

--merge / -m [lib,cpu,tid,tgid,unitmask,all]

Merge any profiles separated in a --separate session.

--no-header

Don't output a header detailing profiling parameters.

--output-file / -o [file]

Output to the given file instead of stdout.

--reverse-sort / -r

Reverse the sort from the default.

--show-address / -w

Show the VMA address of each symbol (off by default).

--sort / -s [vma,sample,symbol,debug,image]

Sort the list of symbols by, respectively, symbol address, number of samples, symbol name, debug filename and line number, binary image filename.

--symbols / -l

List per-symbol information instead of a binary image summary.

--threshold / -t [percentage]

Only output data for symbols that have more than the given percentage of total samples.

--verbose / -V [options]

Give verbose debugging output.

--version / -v

Show version.

The opannotate utility generates annotated source files or assembly listings, optionally mixed with source. If you want to see the source file, the profiled application needs to have debug information, and the source must be available through this debug information. For GCC, you must use the -g option when you are compiling. If the binary doesn't contain sufficient debug information, you can still use opannotate --assembly to get annotated assembly.

Note that for the reason explained in Section 3.1 the results can be inaccurate. The debug information itself can add other problems; for example, the line number for a symbol can be incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to crediting source lines with samples not really "owned" by this line. Also see Chapter 5.

You can output the annotation to one single file, containing all the source found using the --source. You can use this in conjunction with --assembly to get combined source/assembly output.

You can also output a directory of annotated source files that maintains the structure of the original sources. Each line in the annotated source is prepended with the samples for that line. Additionally, each symbol is annotated giving details for the symbol as a whole. An example:

$ opannotate --source --output-dir=annotated /usr/local/oprofile-pp/bin/oprofiled
$ ls annotated/home/moz/src/oprofile-pp/daemon/
opd_cookie.h  opd_image.c  opd_kernel.c  opd_sample_files.c  oprofiled.c

Line numbers are maintained in the source files, but each file has a footer appended describing the profiling details. The actual annotation looks something like this :

...
               :static uint64_t pop_buffer_value(struct transient * trans)
 11510  1.9661 :{ /* pop_buffer_value total:  89901 15.3566 */
               :        uint64_t val;
               :
 10227  1.7469 :        if (!trans->remaining) {
               :                fprintf(stderr, "BUG: popping empty buffer !\n");
               :                exit(EXIT_FAILURE);
               :        }
               :
               :        val = get_buffer_value(trans->buffer, 0);
  2281  0.3896 :        trans->remaining--;
  2296  0.3922 :        trans->buffer += kernel_pointer_size;
               :        return val;
 10454  1.7857 :}
...

The first number on each line is the number of samples, whilst the second is the relative percentage of total samples.

--assembly / -a

Output annotated assembly. If this is combined with --source, then mixed source / assembly annotations are output.

--base-dirs / -b [paths]/

Comma-separated list of path prefixes. This can be used to point OProfile to a different location for source files when the debug information specifies an absolute path on your system for the source that does not exist. The prefix is stripped from the debug source file paths, then searched in the search dirs specified by --search-dirs.

--demangle / -D none|normal|smart

none: no demangling. normal: use default demangler (default) smart: use pattern-matching to make C++ symbol demangling more readable.

--exclude-dependent / -x

Do not include application-specific images for libraries, kernel modules and the kernel. This option only makes sense if the profile session used --separate.

--exclude-file [files]

Exclude all files in the given comma-separated list of glob patterns.

--exclude-symbols / -e [symbols]

Exclude all the symbols in the given comma-separated list.

--help / -? / --usage

Show help message.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find modules in kernels 2.6 and upwards.

--include-file [files]

Only include files in the given comma-separated list of glob patterns.

--include-symbols / -i [symbols]

Only include symbols in the given comma-separated list.

--objdump-params [params]

Pass the given parameters as extra values when calling objdump.

--output-dir / -o [dir]

Output directory. This makes opannotate output one annotated file for each source file. This option can't be used in conjunction with --assembly.

--search-dirs / -d [paths]

Comma-separated list of paths to search for source files. This is useful to find source files when the debug information only contains relative paths.

--source / -s

Output annotated source. This requires debugging information to be available for the binaries.

--threshold / -t [percentage]

Only output data for symbols that have more than the given percentage of total samples.

--verbose / -V [options]

Give verbose debugging output.

--version / -v

Show version.

The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)

This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.

double fun(double a, double b, double c)
{
 double result = 0;
 for (int i = 0 ; i < 10000; ++i) {
  result += a;
  result *= b;
  result /= c;
 }
 return result;
}

Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):

$ opannotate -a -t 10 ./a.out

     88 15.38% : 8048337:       fadd   %st(3),%st
     48 8.391% : 8048339:       fmul   %st(2),%st
     68 11.88% : 804833b:       fdiv   %st(1),%st
    368 64.33% : 804833d:       inc    %eax
               : 804833e:       cmp    $0x270f,%eax
               : 8048343:       jle    8048337

The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :

$ opannotate -s -t 10 ./a.out

               :double fun(double a, double b, double c)
               :{ /* _Z3funddd total:     572 100.0% */
               : double result = 0;
    368 64.33% : for (int i = 0 ; i < 10000; ++i) {
     88 15.38% :  result += a;
     48 8.391% :  result *= b;
     68 11.88% :  result /= c;
               : }
               : return result;
               :}

The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.

You may see that a function is credited with a certain number of samples, but the listing does not add up to the correct total. To pick a real example :

               :internal_sk_buff_alloc_security(struct sk_buff *skb)
 353 2.342%    :{ /* internal_sk_buff_alloc_security total: 1882 12.48% */
               :
               :        sk_buff_security_t *sksec;
  15 0.0995%   :        int rc = 0;
               :
  10 0.06633%  :        sksec = skb->lsm_security;
 468 3.104%    :        if (sksec && sksec->magic == DSI_MAGIC) {
               :                goto out;
               :        }
               :
               :        sksec = (sk_buff_security_t *) get_sk_buff_memory(skb);
   3 0.0199%   :        if (!sksec) {
  38 0.2521%   :                rc = -ENOMEM;
               :                goto out;
  10 0.06633%  :        }
               :        memset(sksec, 0, sizeof (sk_buff_security_t));
  44 0.2919%   :        sksec->magic = DSI_MAGIC;
  32 0.2123%   :        sksec->skb = skb;
  45 0.2985%   :        sksec->sid = DSI_SID_NORMAL;
  31 0.2056%   :        skb->lsm_security = sksec;
               :
               :      out:
               :
 146 0.9685%   :        return rc;
               :
  98 0.6501%   :}

Here, the function is credited with 1,882 samples, but the annotations below do not account for this. This is usually because of inline functions - the compiler marks such code with debug entries for the inline function definition, and this is where opannotate annotates such samples. In the case above, memset is the most likely candidate for this problem. Examining the mixed source/assembly output can help identify such results.

This problem is more visible when there is no source file available, in the following example it's trivially visible the sums of symbols samples is less than the number of the samples for this file. The difference must be accounted to inline functions.

/*
 * Total samples for file : "arch/i386/kernel/process.c"
 *
 *    109  2.4616
 */

 /* default_idle total:     84  1.8970 */
 /* cpu_idle total:         21  0.4743 */
 /* flush_thread total:      1  0.0226 */
 /* prepare_to_copy total:   1  0.0226 */
 /* __switch_to total:      18  0.4065 */

The missing samples are not lost, they will be credited to another source location where the inlined function is defined. The inlined function will be credited from multiple call site and merged in one place in the annotated source file so there is no way to see from what call site are coming the samples for an inlined function.

When running opannotate, you may get a warning "some functions compiled without debug information may have incorrect source line attributions". In some rare cases, OProfile is not able to verify that the derived source line is correct (when some parts of the binary image are compiled without debugging information). Be wary of results if this warning appears.

Furthermore, for some languages the compiler can implicitly generate functions, such as default copy constructors. Such functions are labelled by the compiler as having a line number of 0, which means the source annotation can be confusing.

Depending on your compiler you can fall into the following problem:

struct big_object { int a[500]; };

int main()
{
	big_object a, b;
	for (int i = 0 ; i != 1000 * 1000; ++i)
		b = a;
	return 0;
}

Compiled with gcc 3.0.4 the annotated source is clearly inaccurate:

               :int main()
               :{  /* main total: 7871 100% */
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               :                b = a;
 7871 100%     :        return 0;
               :}

The problem here is distinct from the IRQ latency problem; the debug line number information is not precise enough; again, looking at output of opannoatate -as can help.

               :int main()
               :{
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               : 80484c0:       push   %ebp
               : 80484c1:       mov    %esp,%ebp
               : 80484c3:       sub    $0xfac,%esp
               : 80484c9:       push   %edi
               : 80484ca:       push   %esi
               : 80484cb:       push   %ebx
               :                b = a;
               : 80484cc:       lea    0xfffff060(%ebp),%edx
               : 80484d2:       lea    0xfffff830(%ebp),%eax
               : 80484d8:       mov    $0xf423f,%ebx
               : 80484dd:       lea    0x0(%esi),%esi
               :        return 0;
    3 0.03811% : 80484e0:       mov    %edx,%edi
               : 80484e2:       mov    %eax,%esi
    1 0.0127%  : 80484e4:       cld
    8 0.1016%  : 80484e5:       mov    $0x1f4,%ecx
 7850 99.73%   : 80484ea:       repz movsl %ds:(%esi),%es:(%edi)
    9 0.1143%  : 80484ec:       dec    %ebx
               : 80484ed:       jns    80484e0
               : 80484ef:       xor    %eax,%eax
               : 80484f1:       pop    %ebx
               : 80484f2:       pop    %esi
               : 80484f3:       pop    %edi
               : 80484f4:       leave
               : 80484f5:       ret

So here it's clear that copying is correctly credited with of all the samples, but the line number information is misplaced. objdump -dS exposes the same problem. Note that maintaining accurate debug information for compilers when optimizing is difficult, so this problem is not suprising. The problem of debug information accuracy is also dependent on the binutils version used; some BFD library versions contain a work-around for known problems of gcc, some others do not. This is unfortunate but we must live with that, since profiling is pointless when you disable optimisation (which would give better debugging entries).

Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie, Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu, Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Jason Yeh, Randolph Chung, Anton Blanchard, Richard Henderson, Andries Brouwer, Bryan Rittmeyer, Maynard P. Johnson, Richard Reich (rreich@rdrtech.com), Zwane Mwaikambo, Dave Jones, Charles Filtness; and finally Pulp, for "Intro".