Overview -------- Pct is a program counter/instruction pointer sampling-based profiling system. It allows low overhead, instruction-level granularity accounting of where programs spend their time. No recompiling or code instrumentation is required. Some information can be gleaned even for stripped binaries, though reports can be more detailed as more debugging information is available in object files. A variety of output formats are available. On some systems LD_PRELOAD can even elide the re-linking step for dynamically linked executables. Statistical Profiling Background -------------------------------- Statistical profiling is amongst the oldest of program profiling techniques. The basic idea is to just let hardware timer interrupts temporarily suspend program execution, and then to observe where the OS must resume execution. At this observation point, a counter corresponding with the instruction pointer (aka program counter) is incremented. In any time-sharing OS, these timer interrupts already happen as a matter of course. Hence the only overhead is in the memory access to increment the observation counter. If the program spends a lot of its time at particular spots then statistically one will observe high counts in that portion of this "execution interruption histogram". The basic tradeoff of sampling-based profiling vs. code instrumentation is that the former allows instruction-level timing, but does not allow reconstruction of program control flow transfer graphs. If you really need the latter you can compile all your objects with -pg and use gprof. At least for me, profiling data of such high semantic content is only necessary very rarely, while often I need to track down basic block or even instruction-level timing. There are also occasionally strange interactions with compiler optimizations and code instrumentation, causing errant behavior or even compiler crashes. There is also an issue of "Heisenberg observation interference". Code which has been instrumented to collect data on its execution is really not the same as the original code. Data inaccuracies can result. Since timer interrupts happen anyway, the only extra overhead is the main memory access and cache contention to the buffer of counters. This is typically a truly tiny error since timer interrupts are not usually configured to go off much faster than 100,000 times main memory access latencies. Even so, this data accuracy issue is not usually very important in practice. This performance hit could be much worse if the file is stored on a network file system since network latency can easily be comparable to timer-interrupt frequencies. Care should be taken as to the placement of profiling files. There is a final issue of statistical accuracy. The basic assumption of statistical profiling is that the interruption probability is proportional to how long it takes to execute the part of the program at a given location. This assumes at least that timer interrupts are not synchronized with paths through a program. The basic task of statistical profiling amounts to estimating the multinomial probability density of interruptions at all interruptible program locations. One cannot make this estimate without error. How big is the error? In the typical situation with many possible locations, a per-location binomial approximation to the multinomial becomes valid. When the probability of being at any one location is small, per-location Poisson approximations become valid. This is the common case in profiling any reasonably complex program. In this case, one can use the fact that the variance of a Poisson distribution equals its mean, and that an efficient estimator of the mean is simply the observed rate. All this basically amounts to the fact that if n counts out of a total of N samples are seen at location X, then the expected variance is sqrt(n). The upshot of this is that caution should be exercised in concluding that a count of 4 at some location and a count of 1 at another really amounts to much. It is true that the estimate of the difference time the program spends at the two spots will be 3. But the sum or difference of two Poisson random variables is also Poisson with the combined variance equal to the sum of the summand variances. So for this example, while the estimate of the difference is 3, the variance on that estimate is sqrt(5)=2.2. The tail probability for a Poisson variable of mean 2.2 being 3 or more is about 62% { This probability is 1-P(diff<3)=P(diff=0)+P(diff=1)+P(diff=2)=(1+2.2+2.2*2.2/2)*exp(-2.2)=0.62 } Hence, simply by chance one will see differences of this size or greater 62% of the time. It is thusly not statistically significant. Notice, however, that if one count were 10 and the other 40, the difference is 30 with a variance 1700, or root variance 41. The similar tail probability is only 3.1%. Hence the difference is significant with fairly high confidence. All this detailed probability is just to ward one away from making grandiose claims about relative timings. For the most part any parts of the program that are worth optimizing will have a significant number of counts. In sum, recompiling or maintaining separate profiling libraries and programs is seldom warranted. A much easier to adhere to policy is to *always* compile with debugging, or at least symbols enabled on any development system. In addition to enabling good interpretation of any profile data PCT might collect, this also has the virtue of allowing core dumps to be more meaningful. Proprietary system softare, of course, often limits how much can occur here, but one does what one can. Disk space is really very cheap these days. Platform Dependencies and System Architecture --------------------------------------------- The PCT architecture is very portable and very modular. It should work on almost any variant of Unix. Minimal system requirements are something like: any way to sample PC's (system call profil(2) or virtual I-timer signals), any way to get pct_on() called before much happens, and any way to interpret program counters (nm, disassembly, the GNU addr2line). Notice that there usually multiple methods for each requirement, and it isn't usually hard to get at least one of them to work. This is another part of the appeal of statistical profiling. The system tools needed are usually present. The profil(2) system call seems to date back nearly to AT&T version 7 Unix. It is trivial to implement and useful. I have not yet found an OS without it. Virtual I-timers allow arbitrary signal handlers to be run at timer interrupt time. The point of program resumption is almost always availiable to signal handlers. Usually they can grab the PC off of a prior stack frame. In latter day POSIX there is even a siginfo_t structure that exports this information. One benefit of I-timers is that they allow 1-byte rather than 2-byte sampling granularity. There is an interface bug in profil(2) which prevents this. This only ever matters on instruction sets where an instruction can be an odd number of bytes (e.g. x86). Even there it usually creates only a very small error. In general the great portability of profil(2) makes it attractive, at least for tracking time spent in the main program text. Ideally pct_on() can be called automagically. Any system on which g++ works even a little bit should allow the gcc extension __attribute__((constructor)) to work, at least in a static linking context. This relies on the collect2 program to cull together global code. Many ELF and prior executable formats have ".ini" or ".init" object initializer sections, which may work for static or shared objects or both. __asm__ or asm directives can probably place pct_on() into the appropriate section. On Unix systems with fully functional shared library systems, profiling can also be activated with an LD_PRELOAD environment variable. On other Unix systems one can toggle whether profiling is used via the $PCT environment variable. If it is not set, profiling will not occur. Finally, in a pinch, pct_on() can always be called manually from main(). This might also be necessary for reliable argv[0] determination. Profiling isn't activated if $PCT is not set. Hence, calling pct_on() does not imply that each program invocation will result in profiling data. So it is safe to just leave in your source. pct_on() is also careful to not re-run itself, so calling it twice is fine, in case you link with a shared library whose initializer calls it before your function in main(). Translating program counters to relevant source text coordinates is a low-level activity. Most systems provide at least some tools for this. Pct provides a variety of report formats, available as composable Unix filters. This allows a lot of flexibility in how data are generated and combined. It also lets PCT easily yield incremental usefulness of reports according to however many of the system's tools actually work or are available. The more working tools, the more PCT reports can be. This choice also lets the filters be very simple programs, since they are decoupled from most of the system-specific complexity. Portability Obstacles --------------------- One choice that Pct makes that might break is using, by default, argv[0] as a path prefix for the profil count data. This choice is motivated to make possible/easy profiling multiple programs whose invocations are all coordinated by shell-scripts. Since in many interesting cases pct_on is called from outside of main(), an inferential technique is used to deduce argv[0] from the global variable 'environ'. The implementation of this _argv() function is described in the NOTES file. Any pitfalls associated with bogus inferred argv[0]s can be avoided by setting $PCT appropriately to direct output to some named file or a default "pc.pct" file in the working directory at program startup time. Chuck Blake 5/14/2000 cb@mit.edu