Compute Distribution Index Freeware

Jun  6 2006 naddi.c ASCII C program text checksum manpage (nroff) manpage (html)

The index computed by this program is a single number, says only something about distribution, nothing about absolute values, assumes no distribution shape whatsoever, and takes all values into account. See source. 100% naddi is completely equal and ~0% naddi is completely unequal distribution. N.A.D.D.I. stands for `normalized average data distribution index', an arbitrary name, derived from the way it is computed. It can be used to compare and track data distribution issues (such as income distribution).

 Compiling on GNU/Linux (link with math library):
 
> gcc -lm naddi.c -o naddi
Example runs:
The program eats pairs of numbers: first number in a pair is the value, second a frequency for that number, third the first of the next pair; numbers divided by any white space, until end-of-file. The order of values only matters when requesting computation of median and related properties, otherwise chaotic is fine.

This index is useful because other indexes require a certain data distribution shape, such as the Gini coefficient. Other indicators do not take all data into account, such as comparing the highest with the lowest value. When it comes to income distribution comparison, it is useful and more objective not to be dependent on a certain income distribution shape.

The way this index is computed should be its defense. It computes first for every value an `element index.' This element index is a computation for how far a value is away from the average, but it warps the very high values to approach 1, and it warps the very low values to approach 0.

Money that is bottled up with the rich produces less material happiness in the rich, then it would in the relatively poor or more accurately in the rich if they were poor, hence the more disparity the less absolute material happiness extracted from the same amount of money. If someone says he likes to be poor, then making this person rich means he throws its money away (presumably), but will he throw away that final 10 units bill that buys him food for tonight ? Not likely, proving that the more money a person has, the less it is worth to him per unit. The same goes for rich people, it is an individual rule that the formula standardizes, assuming different reactions to the same amount of wealth cancel each other out. It simply calls everyone who possesses average wealth `50% material-happy,' assuming some people would really only reach 20% of their potential 100% material-happiness at average wealth, while others really reach 80% of their potential 100% material-happiness at average wealth, canceling each other out resulting in a 50% average for the population. It is not important to this formula that some people might be 100 times as happy with one third of average wealth, as someone else who has triple average wealth. Such issues are beyond the capacity of this tool, and might also cancel each other out. People on double average wealth would be assumed to be on average at 70% of their 100% material-happiness on average, while people at half the average wealth would be assumed to be on average at 25% of their 100% material-happiness, etc. People on 100 times average wealth are at 99%, people on one fifth of average wealth are at 3% material happiness, people on 1/100th of average wealth have a material happiness ~0%. The naddi index adds all these assumed material happiness indexes together, and computes their average. Since that would always be 50% or less (50% for absolute equality, everyone on the average), the naddi index is multiplied by two, normalizing it to a 0%-100% range. An index that maxes out at 50% seems odd and unusual (though it has some good points: "fifty-fifty"). The formula works because the larger the sum total, the larger a value the average will have, but if the bulk of the sum is bottled up raising a 99.94% material-happiness to 99.95% rather then raising 35% to 55% for someone else for instance, a lot of height in the index disappears into the rich-man's hole, as it were. Masses of money only add .01% to one person, but because it raises the average significantly for all persons, many other persons lower from for instance 55% otherwise, to just 35%, because all element indexes are computed relative to that average. Computing happiness relative to the average reflects the effect that people often experience their material happiness as relative to their surrounding wealth, a psychological effect.

Because the index says nothing about absolute values, it hangs a bit in the air when you see it. It should really be compared to other naddi indexes for different distribution sets, such as different in time or place. After some usage on a known problem, it should start to mean something on its own. Naddi index only has relative meaning, though if you're seeing through the formula perhaps there is some objective meaning as well. Quite different distribution shapes could result in the same index, which is particularly misleading when distribution shape is out of the ordinary. When distribution shapes are equal, the comparison should be reasonable objective. If one data set has an unusual distribution shape, it would probably have to be marked as such. A `normal' (ant hill) distribution shape could have naddi 80%, but a distribution shape that consists of separate blobs of data of unequal size, could also have naddi 80%, yet the two are very different. Most data around `value 18' and a little around `value 2', could equal one wider data cloud, such as around `value 17-19.' The small cloud around `value 2' can make it just as much unequal as the wider ant-hill cloud could do. Still one could compute naddi for all, it is just that 80% naddi for `normal' distribution says something different then it does for `separate blobs' distribution, or exponential distribution, etc. For fairness sake, one would need to say what the shape is. Such are the limits of one number, no way around it. At least one can compute for any distribution, and in any case ``tightly around 18 and some at 2'' can equal "distribution" of ``widely around 18,'' if one really wanted to compare them.

The program also computes other indexes that can say something about distribution, such as how many times the smallest value fits in the highest, what the median is, you can compute the amount of value (after you sort(1)ed the input, see manual) that is contained within the upper or lower N percent of frequencies where you can specify N, totals and average for the data, etc. You can specify the precision. It does not compute standard deviation or Gini. There is obviously no reason for why the index would not work for things other then money. The index is sensitive to the absolute values, the same distribution around `990-1010' gives a different result around `90-110', you can use the offset function to adjust this.

 
> naddi 8 1 16 1 32 1 <^D> 87.4%
> naddi 600 1 300 1 2000 1 <^D> 76.7%
> naddi 1000 1 1000 1 1000 1 <^D> 100.0%
> naddi 1000000 1 1 1 3 3 9 6 2 9 <^D> 9.7%
> naddi 1000 1 1 1 3 3 9 6 2 9 <^D> 10.6%
> naddi 100 1 1 1 3 3 9 6 2 9 <^D> 46.7%
> naddi 10 1 1 1 3 3 9 6 2 9 <^D> 78.4%
> naddi 10 1 1 1 3 3 9 6 10 9 <^D> 93.1%
> naddi -v 10 1 1 100 <^D> 10 factor ( maximum ( value ) / minimum ( value ) ) 94.9% distribution ( naddi )
> naddi -v2 1e6 1e0 1e4 5e2 <^D> 6e+06 sum ( value ) 501 elements ( frequency ) 11976 average ( value ) 100 factor ( maximum ( value ) / minimum ( value ) ) 10000 minimum ( value ) 10000 middle ( value ) 1e+06 maximum ( value ) 58% share ( value% up to middle ) 44% minimum ( element index ) 44% middle ( element index ) 99% maximum ( element index ) 87.4% distribution ( naddi )
> naddi -v3 1000000 1 100000 10 10000 500 <^D> value frequency element-distribution ( 0% - 100% ) 1e+06 1 99% 100000 10 91% 10000 500 39% 7e+06 sum ( value ) 511 elements ( frequency ) 13698.6 average ( value ) 100 factor ( maximum ( value ) / minimum ( value ) ) 10000 minimum ( value ) 10000 middle ( value ) 1e+06 maximum ( value ) 4.445e+06 share ( value up to middle ) 64% share ( value% up to middle ) 39% minimum ( element index ) 39% middle ( element index ) 99% maximum ( element index ) 79.7% distribution ( naddi )
> naddi --fraction 1000000 1 100000 10 10000 500 <^D> 0.7966681475
> naddi --share --divide=10% --fizz 10 1 1 10 <^D> The top 10% corresponds to 51% of the value.
> naddi --help Usage: naddi [OPTIONS ...] [INFILE [OUTFILE]] Compute distribution index. -a --average sum, frequency and average of element values -dF[%] --divide=F[%] -e, -E and -s 'middle' Fraction (default 50%) -e --extreme minimum, 'middle', maximum element values -E --extreme-index minimum, 'middle', maximum element index -f --fraction distribution as fraction --formula print the distribution formula --fizz [--fizz] print --share verbosely -h --help this -oV --offset=V offset V for values for index computation -pN --precision=N precision N for calculations and output -s --share fraction of value until 'middle' -v[N] --verbose[=N] verbosity N (max 4) --version print program version Reads (sorted) pairs of numbers separated by whitespace. First number: a value, second number: its frequency.
> naddi --formula Average-Data / Element-Data naddi = 2 * Sum { .5 ^ } / Total-Elements Average-Data = Total-Data / Total-elements Total-Data = The total of all data corresponding to all elements. Element-Data = The data corresponding to element N = 1, 2, 3, ..., N+1. Total-elements = The total number of elements. Element-Index *) = 0.5 ^ ( Average-Data / Element-Data ) Sum { ... } = The Element-indexes for each element, added together. Average-Index = Sum { Element-Indices } / Total-elements Normalized-Average-Data-Distribution-Index = 100% * 2 * Average-Index ( ^ means `power' ) *) Printed in verbose mode.