The infrastructure we have now is very good start. It contains fair amount of test and it is capable of assuring that the main performance/responsiveness criteria will be fulfiled in given release. However it still does not seem to be in the state to catch reasonable amounts of regressions soon enough. We should try to extend it that way that it will be possible to make sure that any performance improvement done in the codebase will be durable. Currently even a huge effort put into some performance improvement vanishes soon after it is done as the extra time or memory gained is usually immediately used by some other less efficient part of the IDE. This should preferably not happen. Other issue is that we are not able to detect exactly enough where and when given regression occurs. This issue may seem less important than it really is. Without proper detection almost all the issues end up in either Core or Editor or Other/Performance team category which leads to situation where these teams are overloaded with performance tasks. Performance breaking changes from other teams have to be detected and sometimes even fixed by the core developers. Even if we would decide that this is somehow OK it will not work later when we will want to have separate repositories for teams and doing the semiautomatic pushes to the main repository after tests (including performance tests) pass.
The biggest problem we have now seems to be high variance in the results. This might have several reasons and several solutions:
Looking at the current methodology documentation it is not possible to find out what is the actual setup of the computers. Does the tests setup the whole environement before running the tests? It is virualized somehow? What is the computer doing besides running the tests? Is it conected to the netwrok? Does it allways really start fresh? Is the filesytem fresh each time or does it accomodate changes from previous setups and testruns? If everything is OK fine. If not can we change it somehow.
Current UI measurement methods are based on measuring differences beteen INPUT/Paint events. This may or may not be reliable depending on conditions. More precise method of measuring time is to add reporting hooks directly into the code (which already happens to some extent) and use these hooks in the tests.
Try to make the measurements relative to some plain operation reather than be absolute number of miliseconds. This is not to say that the absolute number has no value. It is usefull for checking whether some operation fits into given performance/responsiveness criteria (Current tables with the red/yellow/green fields) and it should continue to be measured maybe with some improvements. Looking at thre results of the pe
However it posesses less value for detecting regressions. For this case we should make the measurement relative to some plain operation. E.g.:
There does not seem to be any Apple test
It would be usefull to compare the relative values for different operating systems. Such comparison could point developers to performance differences between the operting systems and help to avoid usuall mistakes in the future. Nice example is JarEntry.inputStream whuch returns BufferedInputStream on Linux and plain InputStream on Windows.
Test the platform itself
The testing infrastructrue is prone to errors as any other system. Failing sectors on disk or broken DIMM can screw the numbers a lot.Therefore from time to time run some empty tests on the infrastructure and used hardware several times. Keep the history, compute the variance and keep it minimal. If a bug occures in these tests fix it immediately as it would impact all the measurements.
Create web page which would the imprvements and regressions summarized. Clusterize the regressions correctly in order to be able to guess what happened. This is necessary in order to be able to use the test infrastructure for filing regression bugs.
Besides of being able to detect regressions soon(ish) other details are also very important: