_argv implementation notes -------------------------- As far as I know this is a very reliable routine on Linux, OpenBSD, FreeBSD, SunOS 4, Solaris 2+, OSF/Digital/True64 Unix, Irix 6, and HP-UX. I'd also like to know if it works on AIX. We rely on the fact that the vector of environment pointers is stored in memory immediately before or after the vector of argument pointers. This is the case in almost every Unix-like runtime executable layout. This lets us know that environ - 2 is actually the same as argv + (argc - 1). Now it is nearly universal that argv[argc - 1] points to the same compact string data area that argv[0] will points to. So environ[-2] is a safe high memory fencepost. argv[x] will not point above this area for any x. The data area is always at higher addresses areas for the pointer vectors and the environment vector comes after argv. The end of environ represents a low memory fencepost. This is a bit more approximate. While some formats put the argv[] string data right after the pointer data, others like Solaris, Linux, etc. put 40 or 50 bytes between these two regions. Hence there is some chance that the data just below argv will mimick a pointer pointing to this region and we will be fooled into thinking we are still in argv. We should never segfault, though, since this is valid memory. We just might get an extra or two fictional elements of argv off the bottom end. The memory looks like this: --- HIGH ADDRESS environ[envc - 1] environ[0] --- this gap is usually very small, almost always zero exact bnd_hi argv[argc - 1] argv[0] --- this gap is usually quite small, occasionally zero. approx bnd_lo-4 environ + envc(*==NULL) environ --- this gap is almost always of zero size. technique depends on zero argv + argc(*==NULL) argv + argc - 1(*==high fencepost) LOW ADDRESS argv --- Notice that if the lower fencepost heuristic fails, we will never segfault. It just means that memory which was just below the true beginning of argv really looks like more elements of argv and points to a valid data area. We might get a crazy looking string, but it will be some string and the length to the NUL character will be reasonably short if the gap size is small. Note that if the non-string portion of our fenced area can contain NUL bytes then we can double check that the null byte after our inferred argv[0] is our inferred argv[1]. Note that strings are packed back-to-back on likely every Unix-like OS ever written. Occasionally the order of pointer and string data areas is reversed. For OS layouts like these the same basic strategy applies since environ[-1] is still argv[argc]. The low memory loop termination needs to be slightly different, but a reasonably tight lower-bound does exist, namely the end of the string data area environ[last] + strlen(environ[last]). profil.o/libprofil.so implementation notes ------------------------------------ itimer.o/libitimer.so implementation notes ------------------------------------ Circular files are 0-terminated/separated log file. Basically always write out a pointer and a zero pointer atomically, but seek backward one pointer's worth. The file ends up looking like (say a limit of 4 values is imposed): x0 xx0 xxx0 y0xx yy0x yyy0 Here x are the first pass values and y are the second pass values. A simple scan for the null value gives the old/new paritition or, if it occurs at the end, lets you know there was no wrap around. addr2func implementation notes ------------------------------ The basic algorithm is an approximate binary search where data is subdivided according to byte offsets rather than record numbers. It scans backward or forward to the nearest newline to bound the new record. It assumes the name list is already sorted in ascending order by address, as nm -n might produce. The search could be an exact binary search if name list lines were padded with blanks to the right. awk '{printf "%8s%2s %-21.21s\n",$1,$2,$3}' would do the trick for this but either makes the file larger or truncates symbols. The approximate method seems to be pretty darn fast, though. signal handler implementation notes ----------------------------------- Assuming about 4 pointers per cache line and a common case of only 4..8 regions, it only takes 2..3 loads to get the whole array be in L1 cache (one is for the pointer itself). In the worst case it might take 4 reads (32/4 not lg 32), and then two extra reads for ctrbuf and whatever index is used. So that is 6 reads, plus of course the one read and write for the counter variable itself. The grand total worst case is then 7 reads and 1 write or 8 main memory accesses (in the worst case). For 4 or fewer regions in the list, the cost is just 2R+1R+1R+1W or 4R+1W. Assuming 100 ns or better for RAM, this means our time overhead is roughly between 500 and 800 ns per sample, which is not less than 1/1024 sec. Thus a good worst case overhead estimate is "under .08%". If our region array data stays in higher level caches (which do much better than 100 ns latency) things are obviously much, much better. However CPU intensive programs are often also memory intensive programs. 0.001 seconds is an opportunity for over a million accesses. That could well wipe out the cache presence of our region data. But, we do what we can. Our main memory accesses may well be dwarfed by the kernel. And 100 ns is a pretty conservative estimate for latencies to main memory. region_discover Linux notes --------------------------- region_discover populates region[] from memory areas that are marked executable in /proc/PID/maps. Linux guarantees this list is sorted by address. This routine forces its caller to pass in buf so that file buffering can be stack allocated. PC translation IO notes ----------------------- It would be nicest to simply have a pct-print program which dumps out count-PC pairs in an easily manipulable format. Unfortunately nice translator programs, e.g. addr2line, do not expect anything besides a series of PCs. The pipeline thusly gets horrifically complex. Specifically you need something like: pct-print --just_pcs foo.pct | addr2line -e foo > tmp_file pct-print --just_cts foo.pct | paste tmp_file - This has the ugly property of multiple passes over all the pct files and making temporary files, both of which are not really necessary. It might be prettier if pct-print sent output to a pair of file descriptors, separately buffered: pct-print -F counts_fifo foo.pct | addr2line -e foo | paste counts_fifo - Users *are* intended to understand these pipelines, though. This strays far enough from typical usage to require explanation for even systems types. To make the burden on users simple, we make pct-print much more complex. To elide funky pipeline argument splitting we formulate a trivial translator program protocol, patterned after addr2line. Any PC translator must read addresses from its stdin and write translated coordinates to its stdout. Pct-print does the proper IO flow control to manage these translator programs as co-processes. Another issue is that various 'types' of binary 'foo' might need different PC translation programs, depending on how much debugging data is available in the object file. This could be inferred somehow and placed in a header field of the pct file. This sort of structuring would still allow pct-print to take multiple files in its argument list and smartly adapt the translation program on a per-pct file basis. num2str notes ------------- Care is taken to batch output. This makes an enormous (10X) difference for binary. Also need to make sure that % and / always have compile-time constant second arguments. This allows gcc and probably other compilers to effect a divide -> multiply trick. ***TODO XXX Should time a floating point reciprocal implementation, too. ***TODO XXX complete formatting/padding enhancements. str2num notes ------------- Pretty cookie-cutter. We use a whole byte-value array to speed up parsing. ndigits notes ------------- There are a couple different unrollings of the basic binary search loop. It turns out that the one with branches (see test/ndigit.c) is quite a bit faster than the one without (on x86 and Alpha ev56). Presumably this has to do with CPU's being much better at speculative branching than speculative memory loads. defunct ptrace.c notes -------------- Things are similar in this program as in itimer.c. The main difference is that inside our timer signal handler we use ptrace() to grab data for a target process instead of ourselves. Options that we might ordinarily have taken from PCT we can take from the command line. We can call getenv() on the target process to infer environ, and from environ argv[0] via _argv(). If the target process is dynamically linked we can call putenv() in its context to activate profiling for exec()d children. fork()ed children are a hole. We need to trace fork() calls in the target process to catch them. Ok. So there is really a list (in preferred order) of post startup methods depending on the kind of executable we are attaching to. This list is also roughly in order of increasing implementation complexity. already linked/LD_PRELOADED (controller dies) attach+invoke putenv(PCT) ?pct_on dynamically linked, PCT unaware (controller dies) attach+invoke putenv(PCT),putenv(LD_PRELOAD),dlopen(itimer),??pct_on statically linked, PCT aware (controller dies) attach+putenv(PCT),putenv(LD_PRELOAD),pct_on() statically linked, PCT unaware (controller dies) attach+putenv(PCT),putenv(LD_PRELOAD),implement our own dlopen() just for the code that we need to run pct_on()/signal processing. statically linked, PCT unaware (controller hangs around) attach+putenv(PCT),putenv(LD_PRELOAD),install remote timer, local sighdl statically linked, fork()ing (controller hangs around) all of the above, but also trace through fork() and exec()