How does duplicates algorithm work in exceptions reporter

Input of the algorithm is a new Throwable and output is an Exception object from DB of which is the Throwable duplicate of. Algorithm is made of two filters.

First filter is based on hashcodes. It counts a hashcode of the stacktrace and try to find the same hashcode in DB, if the same hashcode is found, all the lines of stacktrace are verified to be the same and Throwable is marked as a duplicate of issue assigned to hashcode in DB. If no hashcode is found or lines are different, second filter is started.

Second filter is based on lines comparision. First line of new stacktrace is found in DB and all Exceptions with the same first line are possible candidates for a duplicate. From this group of candidates Exceptions with classes (NPE, IOE, ...) different from the new one are removed - in some situations (example: ClassNotFoundException?) exception message is checked as well. In next step second line of stacktrace is found in DB and the group of candidates is reduced for Exceptions with the same first two lines. The reducing continues with third, fourth, ... line until the last line of matching is found. If the number of same lines is higher then "three" (parametrical value) new exception is marked as a duplicate of this exception.

The reason for the first filter is to find exact duplicates and to do it as fast as possible thanks to a hashcode. The reason for the second filter is to find the most similar stacktrace between all the reports and decide whether it is a duplicate or not.

It's hard to judge the effectiveness of this algorithm. In some situation where the duplicity is based on stacktrace this algorithm should work well, but there are situations of duplicates where just one line of stacktrace is the same and so this algorithm fails. The actual design suppose new ideas and changes. Every one can write his own filter and it can be inserted in the filtration queue. We would also need some metric of the quality of this algorithm, but neither a human is always able to decide about duplicates, so how to measure this quality.

Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo