The current performance infrastructure is setup to measure an average of multiple values. However average can only identify just one aspect of a random variable - two variables with the same average can still be very, very different. This is said to be the case of various time measurements in the current infrastructure as well and that is why we suggest to add another aspect of random variable - the standard deviation.
Currently each value for a build+OS is taken from three or nine measurements. From these measurements the standard variation should be computed - we will get a random variable representing measurements done on one OS and one build, we will know its average and its standard deviation.
For each OS, the standard deviation should be stable day by day. If it is not, then we have wrong type of measurement, but let's suppose it is, for now.
As our time measurements are likely a random variable with normal distribution, then we can expect that if the results of one day measurement differs from long time average more than three times the value of standard deviation, then with more than 99% probability we are facing a regression (or improvement).
This way we will be able to find out a daily noise in the results, from important regressions in our codebase.