_argv implementation notes
--------------------------

As far as I know this is a very reliable routine on Linux, OpenBSD, FreeBSD,
SunOS 4, Solaris 2+, OSF/Digital/True64 Unix, Irix 6, and HP-UX.  I'd also like
to know if it works on AIX.

We rely on the fact that the vector of environment pointers is stored in memory
immediately before or after the vector of argument pointers.  This is the case
in almost every Unix-like runtime executable layout.  This lets us know that
environ - 2 is actually the same as argv + (argc - 1).  Now it is nearly
universal that argv[argc - 1] points to the same compact string data area that
argv[0] will points to.  So environ[-2] is a safe high memory fencepost.
argv[x] will not point above this area for any x.

The data area is always at higher addresses areas for the pointer vectors and
the environment vector comes after argv.  The end of environ represents a low
memory fencepost.  This is a bit more approximate.  While some formats put the
argv[] string data right after the pointer data, others like Solaris, Linux,
etc. put 40 or 50 bytes between these two regions.  Hence there is some chance
that the data just below argv will mimick a pointer pointing to this region and
we will be fooled into thinking we are still in argv.  We should never
segfault, though, since this is valid memory.  We just might get an extra or
two fictional elements of argv off the bottom end.  The memory looks like this:
        ---
	HIGH ADDRESS	environ[envc - 1]
			environ[0]
        --- this gap is usually very small, almost always zero
exact	bnd_hi	argv[argc - 1]
			argv[0]
        --- this gap is usually quite small, occasionally zero.
approx	bnd_lo-4	environ + envc(*==NULL)
			environ
        --- this gap is almost always of zero size.  technique depends on zero
	            	argv + argc(*==NULL)
	            	argv + argc - 1(*==high fencepost)
	LOW ADDRESS	argv
        ---

Notice that if the lower fencepost heuristic fails, we will never segfault.  It
just means that memory which was just below the true beginning of argv really
looks like more elements of argv and points to a valid data area.  We might get
a crazy looking string, but it will be some string and the length to the NUL
character will be reasonably short if the gap size is small.

Note that if the non-string portion of our fenced area can contain NUL bytes
then we can double check that the null byte after our inferred argv[0] is our
inferred argv[1].

Note that strings are packed back-to-back on likely every Unix-like OS ever
written.

Occasionally the order of pointer and string data areas is reversed.  For OS
layouts like these the same basic strategy applies since environ[-1] is still
argv[argc].  The low memory loop termination needs to be slightly different,
but a reasonably tight lower-bound does exist, namely the end of the string
data area environ[last] + strlen(environ[last]).

profil.o/libprofil.so implementation notes
------------------------------------

itimer.o/libitimer.so implementation notes
------------------------------------
  Circular files are 0-terminated/separated log file.  Basically always write
  out a pointer and a zero pointer atomically, but seek backward one pointer's
  worth.  The file ends up looking like (say a limit of 4 values is imposed):
    x0
    xx0
    xxx0
    y0xx
    yy0x
    yyy0
  Here x are the first pass values and y are the second pass values.
  A simple scan for the null value gives the old/new paritition or,
  if it occurs at the end, lets you know there was no wrap around.

addr2func implementation notes
------------------------------

The basic algorithm is an approximate binary search where data is subdivided
according to byte offsets rather than record numbers.  It scans backward or
forward to the nearest newline to bound the new record.  It assumes the name
list is already sorted in ascending order by address, as nm -n might produce.
The search could be an exact binary search if name list lines were padded with
blanks to the right.  awk '{printf "%8s%2s %-21.21s\n",$1,$2,$3}' would do the
trick for this but either makes the file larger or truncates symbols.  The
approximate method seems to be pretty darn fast, though.

signal handler implementation notes
-----------------------------------

Assuming about 4 pointers per cache line and a common case of only 4..8 regions,
it only takes 2..3 loads to get the whole array be in L1 cache (one is for the
pointer itself).  In the worst case it might take 4 reads (32/4 not lg 32), and
then two extra reads for ctrbuf and whatever index is used.  So that is 6 reads,
plus of course the one read and write for the counter variable itself.  The
grand total worst case is then 7 reads and 1 write or 8 main memory accesses (in
the worst case).  For 4 or fewer regions in the list, the cost is just
2R+1R+1R+1W or 4R+1W.  Assuming 100 ns or better for RAM, this means our time
overhead is roughly between 500 and 800 ns per sample, which is not less than
1/1024 sec.

Thus a good worst case overhead estimate is "under .08%".  If our region array
data stays in higher level caches (which do much better than 100 ns latency)
things are obviously much, much better.  However CPU intensive programs are
often also memory intensive programs.  0.001 seconds is an opportunity for over
a million accesses.  That could well wipe out the cache presence of our region
data.  But, we do what we can.  Our main memory accesses may well be dwarfed by
the kernel.  And 100 ns is a pretty conservative estimate for latencies to main
memory.

region_discover Linux notes
---------------------------

region_discover populates region[] from memory areas that are marked executable
in /proc/PID/maps.  Linux guarantees this list is sorted by address.  This
routine forces its caller to pass in buf so that file buffering can be stack
allocated.

PC translation IO notes
-----------------------

It would be nicest to simply have a pct-print program which dumps out count-PC
pairs in an easily manipulable format.  Unfortunately nice translator programs,
e.g.  addr2line, do not expect anything besides a series of PCs.  The pipeline
thusly gets horrifically complex.  Specifically you need something like:

    pct-print --just_pcs foo.pct | addr2line -e foo > tmp_file
    pct-print --just_cts foo.pct | paste tmp_file -

This has the ugly property of multiple passes over all the pct files and making
temporary files, both of which are not really necessary.  It might be prettier
if pct-print sent output to a pair of file descriptors, separately buffered:

    pct-print -F counts_fifo foo.pct | addr2line -e foo | paste counts_fifo -

Users *are* intended to understand these pipelines, though.  This strays far
enough from typical usage to require explanation for even systems types.

To make the burden on users simple, we make pct-print much more complex.
To elide funky pipeline argument splitting we formulate a trivial translator
program protocol, patterned after addr2line.  Any PC translator must read
addresses from its stdin and write translated coordinates to its stdout.
Pct-print does the proper IO flow control to manage these translator programs
as co-processes.

Another issue is that various 'types' of binary 'foo' might need different PC
translation programs, depending on how much debugging data is available in the
object file.  This could be inferred somehow and placed in a header field of
the pct file.  This sort of structuring would still allow pct-print to take
multiple files in its argument list and smartly adapt the translation program
on a per-pct file basis.

num2str notes
-------------
Care is taken to batch output.  This makes an enormous (10X) difference for
binary.  Also need to make sure that % and / always have compile-time constant
second arguments.  This allows gcc and probably other compilers to effect a
divide -> multiply trick.  ***TODO XXX Should time a floating point reciprocal
implementation, too. ***TODO XXX complete formatting/padding enhancements.

str2num notes
-------------
Pretty cookie-cutter.  We use a whole byte-value array to speed up parsing.

ndigits notes
-------------
There are a couple different unrollings of the basic binary search loop.
It turns out that the one with branches (see test/ndigit.c) is quite a bit
faster than the one without (on x86 and Alpha ev56).  Presumably this has
to do with CPU's being much better at speculative branching than speculative
memory loads.

defunct ptrace.c notes
--------------
Things are similar in this program as in itimer.c.  The main difference is
that inside our timer signal handler we use ptrace() to grab data for a
target process instead of ourselves.  Options that we might ordinarily have
taken from PCT we can take from the command line.  We can call getenv() on
the target process to infer environ, and from environ argv[0] via _argv().
If the target process is dynamically linked we can call putenv() in its
context to activate profiling for exec()d children.  fork()ed children are
a hole.  We need to trace fork() calls in the target process to catch them.

Ok.  So there is really a list (in preferred order) of post startup methods
depending on the kind of executable we are attaching to.  This list is also
roughly in order of increasing implementation complexity.

  already linked/LD_PRELOADED (controller dies)
    attach+invoke putenv(PCT) ?pct_on

  dynamically linked, PCT unaware (controller dies)
    attach+invoke putenv(PCT),putenv(LD_PRELOAD),dlopen(itimer),??pct_on

  statically linked, PCT aware (controller dies)
    attach+putenv(PCT),putenv(LD_PRELOAD),pct_on() 

  statically linked, PCT unaware (controller dies)
    attach+putenv(PCT),putenv(LD_PRELOAD),implement our own dlopen() just
    for the code that we need to run pct_on()/signal processing.

  statically linked, PCT unaware (controller hangs around)
    attach+putenv(PCT),putenv(LD_PRELOAD),install remote timer, local sighdl

  statically linked, fork()ing (controller hangs around)
    all of the above, but also trace through fork() and exec()