Data Hacks is a new library we have developed at bit.ly which is a set of command line tools to assist in data analysis.
We love the beauty of command line tools that read/write from stdin/stdout and these are a set of utilities that do that, and help explore large data sets.
Included: a tool to calculate 95 percentile values, a histogram display, sample to a % of stdin, and a tool to pass stdin to stdout for a set time period.
For example you can now run this on the fly to get a histogram of request response time for a 30 second period. (in my case awk '{print $NF}'
gets the last column in a access log which has the response time)
$ tail -f access.log | awk '{print $NF}' | run_for.py 30s | sample.py 10% | histogram.py --min=0 --max=1.0 --buckets=20
# NumSamples = 6809; Min = 0.00; Max = 0.05
# 313 values outside of min/max
# Mean = 0.014075; Variance = 0.001441; SD = 0.037954
# each * represents a count of 34
0.0000 - 0.0025 [ 404]: ***********
0.0025 - 0.0050 [ 2595]: ****************************************************************************
0.0050 - 0.0075 [ 1099]: ********************************
0.0075 - 0.0100 [ 1056]: *******************************
0.0100 - 0.0125 [ 476]: **************
0.0125 - 0.0150 [ 403]: ***********
0.0150 - 0.0175 [ 122]: ***
0.0175 - 0.0200 [ 81]: **
0.0200 - 0.0225 [ 37]: *
0.0225 - 0.0250 [ 32]:
0.0250 - 0.0275 [ 25]:
0.0275 - 0.0300 [ 26]:
0.0300 - 0.0325 [ 6]:
0.0325 - 0.0350 [ 29]:
0.0350 - 0.0375 [ 12]:
0.0375 - 0.0400 [ 25]:
0.0400 - 0.0425 [ 10]:
0.0425 - 0.0450 [ 28]:
0.0450 - 0.0475 [ 13]:
0.0475 - 0.0500 [ 17]:
For more information and examples see http://github.com/bitly/data_hacks
Update 2010/10/20: I’ve also added a utility to generate ascii bar chart.