ParsingAPIReview

Parsing API Inception Review

Contents


Submitter:

  • Jan Jancura (JJ)
  • Tomas Zezula (TZ)

Reviewers:

  • Tor Norbye (TN) (voting)
  • Marek Fukala (MF) (voting)
  • Petr Pisl (PP) (voting)
  • David Strupl (JavaFX)
  • Vladimir Kvashin (C++)
  • Vladimir Voskresenskyj
  • Frank Kieviet (JCAPS)


Review material

Talks during the review

TN01

Task/UserTask/MultiUserTask:

There's a run() method. Shouldn't there be a cancel() method? I assume some of the tasks can be slow, so when you for example switch away from this task it ought to be cancelled. A task which isn't capable of cancelling itself can simply ignore the cancel() method, but there should be a way for the infrastructure to take advantage of cooperative tasks willing to cancel when the result is known to be obsolete.

The cancel method should perhaps pass a context object passing some information about why the task was cancelled. File edited, Document Closed, Document Hidden are some possible reasons, and I can see tasks choosing to do different things based on these. Most tasks will probably just abort no matter what when they see a cancel, but for example a task which updates the persistent store (something like a Lucene index) will probably want to update itself for closed or hidden documents, but maybe not edited.

JJ Re TN01

UserTask and MultiUserTask both represents so called user tasks. User tasks are not cancelable. All other tasks extends SchedulerTask class, so they are cancelable.

Storage updater task should extends SchedulerTask too. There will be some special friend contract for this task, probably. Tomas already has some implementation for Java module.

TZ Re TN01

Task/UserTask/MultiUserTask - are user Tasks which have highest priority, they are not the factories task. Like in retouche they are not cancellable, the reason is that it is impossible to say which user task can interrupt other user task.


TN02:

Parser.parse(...), Parser.getResult(...)

There should be a way to register errors and warnings encountered during compilation. I see that there is a ParserException class - but I don't think that's the right way to do it. First, a compiler can often generate hundreds of warnings for a file, and certainly more than one. Throwing an exception during parsing (to indicate an error) means that you cannot register multiple errors without breaking control flow, and besides, exceptions should be used for unexpected conditions - and inside an IDE (where parsers are called while the user is typing source) will very often have to deal with errors.

Thus, I think there should be a way to register errors and warnings during parsing. It should probably be stored on the Parser.Result object such that a client can look at a parser result and see if there were any fatal or non fatal errors or warnings associated with the compilation.

JJ Re TN02

ParseException should be used when there is some unrecoverable error only (IOError, OutOfMemory...). It should not be used for regular errors and warnings in user code. Such errors should be collected in in parser result implementation. I do not want to add such support to Parsing API, because Parsing API has no support for visualization of errors / warnings.

TZ Re TN02

The ParseException is not designed for warning or error in source code, it's rather a parser abort, parser fatal error from which the parser cannot heal. The warning and errors should be collected in the returned Parser.Result

TN Re JJ

Jan wrote: "TN02: ParseException should be used when there is some unrecoverable error only (IOError, OutOfMemory...). It should not be used for regular errors and warnings in user code. Such errors should be collected in in parser result implementation. I do not want to add such support to Parsing API, because Parsing API has no support for visualization of errors / warnings."

How would somebody write a visualization of errors/warnings without it? For example, GSF wants to automatically show errors in the editor, and in the tasklist. If the information isn't in the parsing result, it would need to define its own ParserErrorResult interface and require all clients to implement that in their parser result classes.

TN03:

Parser.Result

What does the invalidate() method do? What does it mean that a parser result is invalid, that its source has changed so the result is no longer accurate?

One area I'm a bit confused about (which may be why I have questions above) is really what the lifetime of a Parser is. It looks from the ParserFactory like a Parser is created for a specific set of sources. The "getResult" and "invalidate" methods make it look like it can keep calling getResult and invalidate and after each invalidate call, the sources should be reparsed. This seems a bit cumbersome to me. If the results are really to be invalidated, why don't we just have a single parse method and call it with the list of sources each time they need to be parsed? Or if we need to keep track of the list of sources as a unit (such that a parser implementation can share state between parse jobs) perhaps we can have a ParserContext object that we pass around?


JJ Re TN03:

Parser is created for one, or more Snapshots. When some Source (snapshots.getSource ()) is changed, parser.parse method is called once and parser.getResult () method is called for each task registered for current mimeType. parser.getResult should always return new instance of result. result.invalidate () is called when task is executed. So it looks like:

parser.parse ();
for (Task task : tasks) {
    Result result = parser.getResult (task);
    task.run (result);
    result.invalidate ();
}

TZ Re TN03

Parser.Result.invalidate - comes from retouche NB 6.1, should be protected - not sure if it's not public now. It's used to verify confinement of the Parser.Result passed to the client (caller of the parsing API). The goal is to prevent clients to hijack the Result outside the critical section and prevent race conditions. The invalidate has nothing to do with editor changes.

TN04:

ResultIterator

I thought this would be an actual Iterator (e.g. hasNext, next, hasPrevious, previous). It should probably have a different name. I'm slightly confused about what it is. It seems like some sort of result collection, but with an unusual way to iterate it.

It seems to me that what you really want is to obtain the parser result for a particular mime type. I looked at the code in java.source and it was doing a breadth first search of results, looking for the Java mime type. Wouldn't it be better if you just had a MultiParserResult object with a Parser.Result getResult(String mimeType) method on it which let you directly access the parser result(s) you are interested in?


JJ Re TN04

ResultIterator

You are right, the name of this class may be confusing. I will look at it. Result iterator allow you to "iterate" over the tree of embeddings. You can start parser for current level, or you can iterate over all embedded blocks of code. It should not be a collection of results, because you sometimes do not need to parse everything. Parser.Result MultiParserResult.getResult(String mimeType) is not sufficient, I think. You may have more Embeddings with the same mime type in one file, for example (they can be merged to one Embedding, but separate blocks should be supported too). And MultiParserResult.getResult(String mimeType) does not give you full info. about all embeddings. Thats why I prefer current solution. But we can add MultiParserResult.getFirstResult(String mimeType) as "utility" method, of course.

TZ Re TN04

Renaming ResultIterator to something else is good idea. +1

TN05:

EmbeddingProvider, and language embedding in general

In general, I'm a bit confused about the Language Embedding story - and I couldn't really find any examples of this being used in the jet-parsing-api sources. Several of the javadocs pointed to an example for how to create an embedding or list of sources with some different mimetypes, but it would be great to see a more complete example.

Specifically: I assume EmbeddingProviders are the

translators capable of letting say the Java parser get a "Java view" of a JSP such that it can parse it using normal Javac. Let's say I have many EmbeddingProviders in the system, some handling JavaScript, some handling Java, some handling Ruby, etc. How are the Embedding Providers chosen by the system? I see there is a getPriority() method - what is that used for - task priority or choosing between EmbeddingProviders or something else?

I realize that EmbeddingProviders are registered in a mime folder under Editors, so there is some sort of mime type associated with them, but I expected there to be both "input" and "output" mimetypes for these. Here's what the similar class to EmbeddingProvider looks like in GSF:

 public interface EmbeddingModel {
     /** This model creates target source code of the given mime type */
     @NonNull
     String getTargetMimeType();
     /** This model is capable of translating to the {@link #getTargetMimeType}
      * from the following source mime types.
      */
     @NonNull
     Set<String> getSourceMimeTypes();
     /** Produce the source code (of mime type {@link #getTargetMimeType}
      * from the given document (which must have outer mime type of one
      * of the given {@link #getSourceMimeTypes}.
      */
     @NonNull
     Collection<? extends TranslatedSource> translate(Document doc);
 }

The point here is that an embedding model is capable of -creating- a model of a certain mime type. But it can only do that for certain -types- of documents.

For example, the JavaScript embedding provider in GSF has a target mime type of text/javascript, but it can handle input mime types for HTML, JSP, RHTML and PHP. When GSF needs to parse a certain language section, it looks at the target mime type, and looks at the source it has, and then does a table lookup among the source providers it has to produce an appropriate embedding. In theory this could involve multiple steps. Let's say there was no direct GSP (Grails) to JavaScript translator, but there is one which takes GSP to HTML. GSF could then first invoke the GSP to HTML translator, and then HTML to JavaScript.

I'm assuming the Parsing API will need to do this as well, but with the current APIs I don't see how the selection can be made.

JJ Re TN05

EmbeddingProvider, and language embedding in general

I think that our two models are equivalent. In parsing API you just implement HTMLToJavaScriptEmbeddingProvider, and register it to text/html. Then you should have PHPToHTMLEmbeddingProvider registerred in text/php, and RubyToHTMLEmbeddingProvider registered in text/ruby. HTMLToJavaScriptEmbeddingProvider is able to recognize blocks of JavaScript code embedded in HTML, and PHPToHTMLEmbeddingProvider returns blocks of HTML code in RHTML.

But you can implement PHPToJavaScriptEmbeddingProvider and register it to text/php too, if you need to do some special computations there (PHP>HTML>JavaScript).

Or you can register PHPToJavaScriptEmbeddingProvider to text/html/text/php.


EmbeddingProvider.getPriority ()

is used to sort embedding provider together with other tasks. Than Parsing API client can exactly define, what should be done sooner:

  • parsing of embedded language or top level one?
  • execution of tasks for inner language, or outer one?
  • etc..

TN06

Incremental Parsing

The other thing I'm thinking about is how to handle incremental parsing. This is something I've been thinking I need to add to GSF now. I've added some type analysis for Ruby and JavaScript, but I'm really running into problems where it's just too expensive to do full-file analysis of types. In order to go to the next level (e.g. looking up return types for all the function calls I see in a file) I really need to do incremental parsing - so that's one of the things I'd really like to address in GSF - and obviously in the Parsing API as well.

So, do you think we can add this to the parsing API? I think Tomas must have a lot of experience with this after adding incremental parsing support in Retouche for 6.1, right?

I haven't started this work for GSF, but my thinking was that on the implementation side, I would keep track of the "damaged areas" since the most recent parse, and pass that to the parser. I would ALSO pass it to all the user tasks, who can use the same information (and extra information from their incremental parser) to limit the work they have to do to update the corresponding data structures (e.g. semantic highlighting only needs to be redone for the local function - unless there are reasons it needs to go looking outside the current function, etc.)

TZ Re TN06

Yes, currently JavaParser handles partial reparse itself, but this is just a workaround to get it work somehow. The Parser.parse method was extended by SchedulerEvent (not sure about correctness of the name) by which the infrastructure sends hints to the parser about changes, currently there is an TokenHierarchySchedulerEvent which is passed to the parser in case of changes in editor and provides information about added, removed, changed tokens. JavacParser does not use it yet, but I will rewrite it soon.


VK01:

Need for multiple parsing threads

For C/C++, in the case machine has multiple CPUs or cores, parallel parsing in two or several threads is used now. This is essential for C/C++ support, because projects often are quite large (tens thousands of files) and C/C++ parsing is complex. As I understand, the proposed model is single threaded. Is this correct?

TZ Re VK01

The generic Parsing API cannot be multithreaded since most of the parsers are not thread safe,same of them are not even reentrant. So the Parsing API protects the SPI (Parser) to be called from multiple thread at given time. But you as an implementor of the Parser SPI, if you have a thread save parser which parses files in parallel, can still use thread pool to do the parsing using simple fork join pattern. The parsing API ensures that the your Parser.parse will be called in a thread save way. In the parse method you can fork the work into thread pool and await the result from the workers by the thread calling the Parser.parse.


VK Re TZ Re VK01

We don't use several threads to parse a single file. But parsing in parallel is useful when a lot of files should be parsed (which occurs mostly after user opens the project or changes some parsing-relevant properties). I don't see how your proposal work in this case.

JJ Re VK01

What to do when user opens project...

Current version of Parsing API does not have direct support for this usecase (so you should probably use some proprietary hack, like Java impl. and GSF). Tomas would like to design some new API for this purpose in the future (not for next release).

There is some possibility how to do it even in the current Parsing API. You can implement special TaskScheduler listening on Repository changes, and register some special SchedulerTask that will store result of parsing to some persistent storage. But this solution is not tested / supported, so I can not recommend it.


VK02

Who is responsible for identifying the list of documents to be parsed?

One more C/C++ specific is #including headers. A header file, which is included via #include directive, may not be in the project file list.

Nevertheless it should be treated as source; for example, it should be reparsed when changed, etc. But nobody except the parser itself nows about this.

How can this situation be solved in the model you propose?

JJ Re VK02

Who is responsible for identifying the list of documents to be parsed?

Parser. Parser should compute dependencies and parse them. Parser should listen on these files and fire changes (Parser.addChangeListener).

Y01

Provide reasonable usecases - I have a feeling that the first part of the wiki page describes the constraints, and the second part solution (at least it talks about Parser Request). I do not want to talk about the code, neither read it at this stage. I'd like some description of what is the usecase and how it will be handled on a general level. Like "people will get an instance of blablabla, and send a callback interface to it. It will be executed synchronously/asynchronously, blablabla". I miss this kind of information in the provided documents.

JJ Re Y01

Usecases are described in arch.xml. I can add more usecases there...

Y02

Specify dependencies on other projects. Give me list of modules that you want to introduce, their dependencies among themselves, and the stability of each API that will export. Which cluster these modules belong? Highlevel answer is enough.

JJ Re Y02

Dependencies are specified in the same document.

Y03

Compatibility. Do you have some statement with respect to compatibility? Like, clients of Java infrastructure will not be affected at all, including binary and functional compatibility. Can you say the same about clients of 6.1's GSF?

JJ Re Y03 I am not able to specify compatibility of Java module and GSF. Its question for Tomas and Tor. As far as I know Tomas is trying to do fully compatible bridge for Java module APIs. GSF API is not public/official/stable, so changes are possible, I think. But it depends on Tor, if he wont to create bridge or just remove some parts of GSF API.

JB01

Which modules will use this API in 6.5 time frame? (Java, GSF, who else?)

JB02

Which modules will use this API in post 6.5 (7.0)?

JB03

Will be the Parsing API exposed by GSF or just used

TN07

This is just a follow-up to your answer to TN03: So the point of "invalidate" is to say that the Parser Result is

no longer used - e.g. free any resources you've stored in the parser result? (A C++ destructor or sorts, or something like a "close" method on various resources).

TN08

The documentation (and the design of the result iterator) mentions the possibility of nested embeddings. Can you

give some examples of where this need arises?


TN09

As a meta comment, I think it's really unfortunate that the parsing API doesn't address indexing. I think most

language supports needing parsing will also need indexing - after all, it's vital for support cross-file operations like go to declaration, code completion, etc. The big problem I have trying to retrofit GSF to use the Parsing API is that the Parsing API only addresses a -part- of what GSF (the Retouche fork part) does. I can't just replace it, I'll have to do a lot of surgery in there, and that makes me worried and I still end up with a lot of code to maintain. It's also unfortunate since a big point of the parsing API is to avoid code duplication, and if there's no indexing in the parsing api, then the clients (java, C/C++, GSF, others?) will all duplicate the indexing code.


TN10

I still don't understand embedding. I took a look at your latest javadoc, but I still don't see a concrete example

of how this will work. Would you mind writing up a much more detailed example? I will write one for how GSF works here to illustrate the kind of detail I'm looking for. I'd like to see the same thing expressed using the Parsing API.

Say you have this source RHTML file (could be JSP, but I don't know JSPs very well - but I think the <% %> syntax to insert server side Ruby/Java calls are the same)

<p>
<script type="text/javascript" src="javascripts/effects.js"></script>
<style>body { background: red }</style>
<script>var x = <% foo() %>;</script>
<span>Hello</span>
<script>
function f() { alert(x.toString()); }
</script>
<input type="submit" onclick="f()"/>

What does the embedding scenario look like for this in the Parsing API? What are the Sources, Snapshots, Embeddings?
Offsets?

In GSF, you will involve several source models here. It will create a separate "virtual source" for each of these
languages: CSS, JavaScript, Ruby

The JavaScript virtual source (which is passed to the JavaScript parser) looks something like this:

'''netbeans_import'''('javascripts/effects.js');
var x = '''UNDEFINED''';
function f() { alert(x.toString()); }
;function(){
f()
}

The virtual source translator from RHTML to JavaScript replaced the server call <% foo %> with __UNDEFINED, and the onclick handlers have been translated into anonymous functions (such that writing "return this" etc. in an event handler makes sense from an ast perspective, you can only return inside functions).

Similarly, the Ruby embedding model (RHTML to Ruby) will create this:

class ActionView::Base
_buf='';;_buf << '<p>
<script type="text/javascript" src="javascripts/effects.js"></script>
<style>body { background: red }</style>
<script>var x = ';
 foo() ;_buf << ';</script>
<span>Hello</span>
<script>
function f() { alert(x.toString()); }
</script>
<input type="submit" onclick="f()"/>
';
end

In both cases, some of the code is coming from the original source, some is generated (e.g. the class ActionView::Base part above, and the netbeans_import to record JavaScript file inclusion and functions to contain event handlers). Some source is removed - typically embedded source that isn't relevant for this language -- such as all the markup.

These blocks of code are maintained by each embedding model, which has a set of code blocks with corresponding offsets in the generated source and in the original source. You get a map like this:

codeBlocks=
[CodeBlockData[
  RHTML(0,0)="",
  RUBY(0,31)="class ActionView::Base
_buf='';"], CodeBlockData[
  RHTML(0,130)="<p>
<script type="text/javascript" src="javascripts/effects.js"></script>
<style>body { background: red }</style>
<script>var x = ",
  RUBY(31,174)=";_buf << '<p>
<script type="text/javascript" src="javascripts/effects.js"></script>
<style>body { background: red }</style>
<script>var x = ';
"], CodeBlockData[
  RHTML(132,139)=" foo() ",
  RUBY(174,181)=" foo() "], CodeBlockData[
  RHTML(141,266)=";</script>
<span>Hello</span>
<script>
function f() { alert(x.toString()); }
</script>
<input type="submit" onclick="f()"/>

",
  RUBY(181,319)=";_buf << ';</script>
<span>Hello</span>
<script>
function f() { alert(x.toString()); }
</script>
<input type="submit" onclick="f()"/>

';
"]]

So, can you show us what this will look like with the Parsing API? The only example I could find was one which constructs an Embedding consisting of three fragments of Java, but I'd like to understand how the virtual source for each language is constructed, how nesting of embeddings etc. come into play, and so on.

(In GSF, a virtual source translator can return a collection of models which is used when you don't create a single "synthesized" model like we do here - e.g. the multiple <script> blocks in JavaScript get merged into one. For CSS, the theory is that each style="foo:bar" block would generate its own model - these are all completely independent and don't have to (and shouldn't) be parsed together. Thus, for CSS you get a series of models - at least I believe that's the case; Marek wrote the CSS embedding model so he'd know the details.)

Voting

  • Approved with TCRs for implementation.

TCRs

  • Indexing API: Do as much as you can to support Indexing, at least for other languages than Java.
  • Incremental parsing: Make generic, for any language.
  • Caching, naming: Cache something, later find out what it is. Language clients shall not worry for infrastructure things.

TCAs

Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo