HgParallelProjectIntegration

Parallel Project Integration with Mercurial

This is an implementation of distributed development and integration for NetBeans. This page reflects the current state and describes why it is useful to guarantee success of daily production builds, while allowing projects to continuously validate their work and integrate only when it is known to be stable.

A newer proposal not yet adopted is in HgPerDeveloperBranch.

Motto

Are you tired of waiting for ages before every integration for the following set of targets to finish?

ant clean
ant build
ant commit-validation

Switch to a project repository, integrate immediately, work on other things and let the project builder verify that your integration breaks nothing essential!


How it works?

The solution is based on creation pre-tested integration repository populated automatically and in parallel from each registered project's repository (like main, core-main, web-main, ergonomics and many more). But I guess a picture is better than thousand words. This is view of the system from a single developer point of view:

Image:parallel-repositories_HgParallelProjectIntegration.png

Work on your local repository, do changes, verify they are sort of OK (no need to build everything or test everything). When you feel your work deserves integration: push. When you feel you'd like to see work done by other teams (tested and verified), or by your team members (untested, but you trust them, don't you?) do: pull.

As soon as your changes are in your project repository (like main, core-main, web-main, ergonomics and many more), your dedicated builder picks them up and verifies them and validates them (this is customizable, it is your builder, tell it what test it shall run). Failures on this level are OK and acceptable. It is your team's responsibility to deal with them (or at least evaluate them and report a bug in case the changes infiltrated from somewhere else).

If everything is OK, then once a while the integration server decides to pick your project's last known stable changes and merge them with already approved changes done by others. After verifying that the merge results in sane bits that compile and run it pushes the change to main-silver repository. Failures on this level shall be rare. If they happen it is sign of instability of the system or the fact that the previous builder did bad job. Your team is supposed to evaluate the failure and (very likely) report a bug against the infrastructure.

After an integration server's successful build the verified changes are distributed back to each project's repository (the green arrow).

Twice a day the bits in main-silver repository are taken by the production builder. It builds various targets not verified sooner (jnlp, javadoc, nbms, etc.) and if things are OK, it publishes the bits and integrates the changes to the main-golden for use by those who want absolutely stable revisions only. It also marks the latest stable revision with "golden" tag (use hg log -r golden to see which revision is stable in your repository).

The automated push made on various levels guarantees that basic quality criteria are fulfilled while individual developers do not need to waste time with individual validation themselves. Btw. the we have many tens of developers and about ten teams, so the above picture is simplified slice, to get the whole image just multiply the project builders by ten and individual developers by hundred.

Precise algorithm

There is a dedicated integration server responsible for the integration. It contains lists of jobs, and executes them continuously one by one populating main-silver repository with all changesets known to meet the quality criteria for production build. The daily build uses bits from main-silver repository (known to be correct) and as such its every build run succeeds in creating the highly desirable production bits.

The current version of the integration algorithm can be seen in nbbuild/hudson/round-robin-push. From high level point of view the hudson integration server runs:

  1. contact project build server and get last successful revision
  2. update from main-silver
  3. merge changest from project repository up to the successful revision
  4. run build
  5. run validation
  6. if (ok) push to main-silver
  7. else fail the job

This process can be started either manually per human request, or be triggered periodically as is common in Hudson. The setting is individual to each project, e.g. main can have different integration initiation gesture than ergonomics project, etc. Please see BuildMachines for actual topology of the build machines and the systems that they empower.

Ramifications

Synchronization of project repositories

The project repositories need to get synchronized with main-golden from time to time. This can be either done by a human, or automatically, by dedicated hudson task, just like in case of many repositories. The main repository will be populated automatically after successful production build with changesets from main-golden.

Most projects would likely prefer to have their repositories hosted on hg.netbeans.org for visibility and consistency of developer access control, though a project repository could be hosted in a geographically closer location if necessary. hg.netbeans.org should also be backed up consistently, though for Mercurial this is not much of an advantage.

The project repository should use the same pre-push hooks as main does, so that for example a developer accidentally pushing a file using CRLF line endings will be blocked immediately at the project level.

Independent Project Build Servers

In order to prevent the main-silver integration Hudson server from doing useless job, it is important to pay off its load to project own build servers. These do the initial build and evaluation of project changes and only if everything succeeds, the main server tries to integrate their changes.

These project build servers are supposed to run whole build of NetBeans IDE, commit validation tests and any of tests in own area that are critical for the project and its developers. Among the produce build artifacts include nbuild/build/build_info file. The URL for this file is used by the main-silver Hudson integration server to identify the last successful build revision from given project repository.

Full-test builds

The merge server is not intended to replace automated builds which run more complete test suites. In particular, its builds must be fairly quick, and fully reproducible.

QE will likely still want to run other automated builds which go through comprehensive test suites and report on the status of individual tests across builds, as well as across different platforms and operating systems.

Tests in a full suite might not be 100% reproducible. GUI mode tests are usually not completely reliable. Even some unit tests rely on thread or garbage collection semantics.

Examples of quick and repeatable tests suited for the merge server:

  1. Compilability of all IDE modules (of course).
  2. Compilability of stable AU modules (also available in the main repo).
  3. Core commit validation (no GUI): layer parsing clean, etc.
  4. All files in clusters tracked by NBMs.
  5. No binaries missing licenses.
  6. All tests in standard test config compilable.
  7. Selected fast unit tests from important modules all pass.

Availability of fixes after push to project repositories

If a project keeps its repository in good condition, fixes pushed to it should propagate to main-silver within a few hours. This, together with 100% success of production builds guarantees that project fixes will be propagated in less then a day. Which is sort of on par with current state in case main is OK and much faster in case main get broken by malicious integration.

My push-* job fails. What can I do?

Due to various machines in topology of our builders, it may happen that some tests pass on the project builder, but fail on the push machine. What can be done in this situation?

  1. if you know that your build is unlikely to pass, you can kill it. It is quite safe to kill job waiting in the queue. It shall be OK to kill even a running job.
  2. if you have a business need to get your changes to the main-golden repository sooner than others, you may also kill foreign jobs waiting in the build queue. Killing foreign, already running job, however is not considered best practice (unless the job is known to fail).

The above can be done manually. Use with care! Btw. in critical situations it may even be possible to temporarily disable some of the push- jobs. Contact mzlamal at netbeans org to do it for you.

Resolution of merge conflicts and merge-induced failures

Under the rare situation that there is a conflict between repositories, a project member is asked to take a manual action and merge the content of main-golden or main-silver into the project repository and resolve conflicts by hand. As all these repositories are expected to be found on hg.netbeans.org, the merge can be done locally on any computer. In case on already has the project repository available, it shall be matter of few minutes.

Branching

Branching for milestone or during high resistance mode shall be done from main-golden branch. This guarantees the bits are verified (as such there is no need to do new build to verify that). On the other hand, it is necessary to wait until desirable changes from all teams propagate to main-golden. It is responsibility of the projects having their own repositories to make sure their changes are propagated, with one exception - it is responsibility of the person doing the clone to make sure that main is pushed by its builder. In case the build is broken a warning shall be send to broken_builds@netbeans.org and the creation of the clone delayed.

Where is main-silver?

Actually, the main-silver repository is more a concept than actual place to watch and see the integrations. It is an implementation detail. Nobody really needs to care about it. In case you really believe you want, the current main-silver is here.

Why we accepted this system?

There has been many attempts to improve our integration policies including HgParallelTeamIntegration, HgPerDeveloperBranch, HgTeamIntegration. All of these require major changes to the developers workflow. In contrast, this proposal, builds and reuses current coding best practices with introduction of minimal enhancement gives us reliability, predictability, less regressions and teamwork scalability.

No production builds

If there is anything wrong with the main repository, the daily production builds fail. This creates a complete crisis as everyone on the whole team, is supposed to pay attention to it, and solve it. However as this crisis happens almost instantly, nobody really cares. As such the quality of our work constantly regresses and having broken build feels sort of normal.

Completely broken builds

It is lamentably common for someone to not only commit a bogus change which includes uncompilable source code, but to immediately push it as well, breaking the repository for everyone. Furthermore, most people apparently fail to pay any attention to broken build emails, and often leave work right after pushing, leaving it to someone else to fix or back out their bad change.

Stop the world commits

Broken builds basically stop work of all our developers and prevent everyone to deliver their improvements and fixes to our users. This is completely unnecessary, as almost all the time only part of the source base is broken, the rest is OK. This is caused by "single threaded" nature of our integration that allows everyone to push untested changes and thus block valid integration.

Heavy merge contention

The contention is caused in part by people being too eager to push after committing, but also in part by there being too many people pushing into one central location.

Failing tests

Similarly, it is common for changes to cause some tests to fail. While not as severe as a broken build, an unstable build is of questionable quality. Some of the most important tests even cause the production build to fail. Again, this is implemented as "stop the world" blocker, which is unnecessary, and results in problems described above.

In some cases, it is hard to avoid the problems on single developer's side, since the continuous builders run a lot of tests that developers do not, including some that require nontrivial setup (e.g. Glassfish, WTK). However it is undesirable for breakages caused by one project to create a negative impact on other projects.

Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo