QE Performance as part of the Fitness project

Current State vs. Desired State

The infrastructure we have now is very good start. It contains fair amount of test and it is capable of assuring that the main performance/responsiveness criteria will be fulfiled in given release. However it still does not seem to be in the state to catch reasonable amounts of regressions soon enough. We should try to extend it that way that it will be possible to make sure that any performance improvement done in the codebase will be durable. Currently even a huge effort put into some performance improvement vanishes soon after it is done as the extra time or memory gained is usually immediately used by some other less efficient part of the IDE. This should preferably not happen. Other issue is that we are not able to detect exactly enough where and when given regression occurs. This issue may seem less important than it really is. Without proper detection almost all the issues end up in either Core or Editor or Other/Performance team category which leads to situation where these teams are overloaded with performance tasks. Performance breaking changes from other teams have to be detected and sometimes even fixed by the core developers. Even if we would decide that this is somehow OK it will not work later when we will want to have separate repositories for teams and doing the semiautomatic pushes to the main repository after tests (including performance tests) pass.

Memory Leaks



The biggest problem we have now seems to be high variance in the results. This might have several reasons and several solutions:

Hardware/Software setup (virtualization, using images, disconnecting from network)

Looking at the current methodology documentation it is not possible to find out what is the actual setup of the computers. Does the tests setup the whole environement before running the tests? It is virualized somehow? What is the computer doing besides running the tests? Is it conected to the netwrok? Does it allways really start fresh? Is the filesytem fresh each time or does it accomodate changes from previous setups and testruns? If everything is OK fine. If not can we change it somehow.

Methods of measurement (putting hooks directly into the code, disabling GC)

Current UI measurement methods are based on measuring differences beteen INPUT/Paint events. This may or may not be reliable depending on conditions. More precise method of measuring time is to add reporting hooks directly into the code (which already happens to some extent) and use these hooks in the tests.

  • Use structured logging to send well defined messages to the testing infrastructure
  • Generate UI Gestures Collector info messages when something goes wrong with the performance and sent them to the gestures server, so we can analyse them.

Other (Flaska's) idea do keep memory dumps and compare them for different builds. Zezula's idea - use dtrace on solaris for some core measurements

Difference in scale (where it make sens assure at least three measurements with different amount of data)

  • Make all the test work on different amount of data (Small, Medium, Large)
  • length of the file
  • number of files
  • number of projects
  • number of modules loaded in the IDE
  • depth of the file structure
  • size of the java classes closure

Measuring the absolute values (relativizations of the values)

Try to make the measurements relative to some plain operation reather than be absolute number of miliseconds. This is not to say that the absolute number has no value. It is usefull for checking whether some operation fits into given performance/responsiveness criteria (Current tables with the red/yellow/green fields) and it should continue to be measured maybe with some improvements. Looking at thre results of the pe

However it posesses less value for detecting regressions. For this case we should make the measurement relative to some plain operation. E.g.:

  • Typing in Java editor vs. typing in plain editor
  • Expanding a plain JTree and expanding nodes in explorer
  • Switching tabs i Swing tab pane vs. switching editor tabs
  • Displaying code completion vs. displayig plain popup with list box
  • ...

This is necessary to do in order to be able to fulfill the regression detection and trend signalization goal. It will alsobe of importance if we decide to include some basic performance testing as a criteria for pushing changes into the main HG repository.

Differences between the operating systems (use relative values to compare the operating systems)

There does not seem to be any Apple test

It would be usefull to compare the relative values for different operating systems. Such comparison could point developers to performance differences between the operting systems and help to avoid usuall mistakes in the future. Nice example is JarEntry.inputStream whuch returns BufferedInputStream on Linux and plain InputStream on Windows.

Test the platform itself

The testing infrastructrue is prone to errors as any other system. Failing sectors on disk or broken DIMM can screw the numbers a lot.Therefore from time to time run some empty tests on the infrastructure and used hardware several times. Keep the history, compute the variance and keep it minimal. If a bug occures in these tests fix it immediately as it would impact all the measurements.

Other Areas of Interest

Better Trend Signalization

Create web page which would the imprvements and regressions summarized. Clusterize the regressions correctly in order to be able to guess what happened. This is necessary in order to be able to use the test infrastructure for filing regression bugs.

Detection of Regressions

Besides of being able to detect regressions soon(ish) other details are also very important:

  • Which module
  • Which operating system
  • Which scalability level
  • Which change in the CVS
Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo