Tuesday, December 17, 2013

HistogramTools 0.3

A new version 0.3 of HistogramTools is now on CRAN. HistogramTools provides a number of methods for manipulating histograms, measuring the distance between histograms, calculating the information loss due to binning aggregate data sets, and other tools useful for statistical analysis of binned/histogram data. It also uses RProtoBuf to provide a protocol buffer representations of the default R histogram class to allow histograms over large data sets to be computed and manipulated in a MapReduce environment with tools written in other languages.

The full list of updates includes :

  • Moved 'Hmisc' from Depends to Imports.
  • Improved introduction vignette significantly.
  • Added ScaleHistograms function.
  • Added PlotRelativeFrequency function to plot relative frequency histograms.
  • Added minkowski.dist, intersect.dist, kl.divergence, jeffrey.divergence measures for two histograms.
  • Added PreBinnedHistogram for creating histogram objects from an already binned dataset (e.g. just a vector of bins and counts).

Dirk's CRANberries service provides a diff to the previous release 0.2. More information is at the HistogramTools page on CRAN which includes the 18-page package vignette and 1-page Quick Reference Guide. Please mail me directly with any questions or suggestions about this package.

Friday, June 7, 2013

New Work on Flash Provisioning at USENIX ATC on June 26

Later this month I'll be at the USENIX Annual Technical Conference in San Jose with some coauthors on the Storage Analytics and Colossus teams at Google to present some of our recent work on optimizing flash provisioning for cloud storage workloads. Our paper is titled "Janus: Optimal Flash Provisioning for Cloud Storage Workloads", and a pre-print is available from Google Research.

I'm going to ATC '13

This work is about using statistical samples of I/O patterns from a large distributed filesystem to formulate and solve an optimization problem that helps us allocate flash better in our datacenters. I'm looking forward to returning to USENIX ATC as it has been nearly 10 years since I've been to this conference. Shoot me a mail if you will be there and want to meet up.

Sunday, October 14, 2012

Two Recent Short Papers

My group at Google continues to grow, and we had the opportunity to publish a few short workshop papers this year about some of the areas we've investigated this year.

The first paper describes some of the work we've done on forecasting storage growth in datacenters for capacity planning purposes using ensemble forecasting methods and trend-change detection. It builds on some of the earlier work we did for websearch traffic forecasting and, to a lesser extent, building a market economy for datacenter resources.

The second paper, to which I made only minor contributions, is a more mathematical description of a method of quantifying the uncertainty in aggregate metrics from a sampled RPC tracing system for large-scale distributed systems (e.g. Dapper).

Both papers are addressing problems that usually come up in very large-scale distributed systems, and the applicability is somewhat limited in smaller contexts, but I would be very interested in feedback regardless.

Thursday, September 6, 2012

Cycling 320+ Miles Next Week for Charity

Next week I will be cycling from Eureka to San Francisco for the California Climate Ride. I'll be mostly out of touch for the week, but will try to post pictures and check in via email and mobile phone when possible. Please consider donating towards my fundraising goal to support the San Francisco Bicycle Coalition.

Thursday, October 28, 2010

What I've been up to..

It's been nearly a year since I posted here and much has changed. The obvious and most important change is a second new addition to our family which I've been blogging about elsewhere. On the work front I was able to publish a paper about some of my work studying the Availability in Globally Distributed Storage Systems at Google last year. This is an exciting space given the growth of cloud based storage services and sophisticated distributed storage software.

I've been blogging a little more regularly about work-related topics on Google company blogs, with four posts so far this year :
As you can see I've been working on data analysis, distributed cloud storage, and open source, along with some other projects I'm not yet able to talk about. I'll try to post more updates about some of my interests and side projects in the remainder of the year.

Sunday, January 10, 2010

Fun with Amazon Web Services

Amazon has been doing a really great job at selling excess compute capacity in their datacenters through products such as Amazon Elastic Compute Cloud (EC2), Elastic MapReduce, and their simple and structured distributed storage products. The economics of this kind of model, as represented in the two graphs here are clearly compelling. Instead of buying large numbers of computer to mostly sit idle, new start-up companies, researchers, and individuals can rent the excess capacity from Amazon instead. Last year I worked on some related ideas for internal pricing and provisioning of resources at work. This was my first direct experience with the Amazon consumer offerings however, and I was impressed. It took less than half an hour last night to sign up, start a few basic Linux instances, copy some application code over, compile it, and begin running it on the Linux Xen instances.

Not everything is so easily scaled to run on more computers. Some tasks are more feasibly done with human involvement, and I've also been experimenting with Amazon Mechanical Turk as well. This service is named after the 18th century fake chess-playing machine that actually used a hidden human operator to control the device. I have used this service recently to improve the captions for FreeBSD technical conference videos that I am involved with and the results have been stunning.

The results of cheap on-demand distributed computer clusters and a global english language work force that can be paid by the task almost engender too many business ideas to contemplate.. If only there were more hours in the day..

Sunday, June 7, 2009

Support Simon Singh and Scientific Debate

Simon Singh has been sued for libel by the British Chiropractic Association. Simon is an author, journalist, and TV producer who works to popularize math and science. I had the opportunity to hear Simon speak about an earlier book on the Big Bang at Keble College, Oxford. Simon wrote a more recent book on alternative medicine and suggests that there is no evidence for the efficacy of chiropractic treatments for asthma, ear infections, and other infant conditions. British Libel laws are more strict than those in the U.S. and this scientific debate has unbelievably been construed as a form of libel. Read more about the dispute and sign the petition here.