EmbeddingLanguagesFormatting

Version: 0.3 Draft

Plan: implementation of this proposal is tracked here

Contents


This is specification for formatting of embedded languages in NetBeans editor. Primary languages of interest are CSS, HTML, JSP and JavaScript. However, intent of this specification is to be generic enough to be applicable for any other languages.

Background

To have the same domain understanding let's briefly reiterate basic concepts/facts.

What is meant by "formatting"?

Formatting actually covers two different features:

  • indentation
  • formatting

Indentation needs to be performed when user pressed Enter and caret is placed on a next line. It also needs to happen when user closed a language block, for example when typing "}" in Java language or "</table>" in HTML. In below example

    if (condition) {
        call();
        |    

caret is indented by default for coding within the "if" block but if user close the block by typing "}" then the line needs to be reindented and "}" needs to be aligned with "if".

Formatting is indentation and prettifying of source code, for example

if (   condition   )      {call  (   param  ); }

can be reformatted as

    if (condition) {
        call(param);
    }

It is important to distinguish indentation from formatting because indentation needs to be performed as quickly as possible - it should never block user from typing their code. In that respect it is better to provide fast-and-sometimes-wrong indentation rather then always-correct-but-slow one. After all it is only editor typing aid which if wrong is easy to fix by user as they type. Formatting on the other hand is usually performed on whole file or block of code and must always be correct.

From implementation point of view this implies:

  • indentation should rely only on lexical information
  • formatting should rely on abstract syntax tree information

Further argument supporting this is that indentation happens while code is being typed and thus there is high likelihood code in syntactically erroneous. Formatting on the other hand is executed by user after they finished coding and believe everything is correct. It is also acceptable that formatting action may be more time consuming - it is action triggered by user after they finished a task and they do not mind being blocked for couple hundreds of milliseconds.

Both indentation and formatting must share the same configuration parameters (sub-indent of multiline statements, number of spaces per tab, etc.)

What is meant by formatting of "embedded languages"?

There are two types of language embeddings:

  • native embedding
  • templating

Native embedding is when a language specification says how other languages can be embedded in it, for examples HTML and its <script> tag with Java Script code or HTML's style attribute containing CSS code.

Templating is when a regular language file is turned into a template and interspersed with templating language code, for example JSP page which is HTML page with JSP templating code or Ruby HTML which is HTML with Ruby templating code.

It is templating which is problematic because part of language file are being generated dynamically and without them the language file might be syntactically incomplete.

How does formatting of embedded languages work today?

Editor Indentation library collects all languages available in a file and sorts them according to length of their MIME types. This is order in which formatters of individual languages are called and asked to perform formatting.

There is no other support in editor indentation library.

There exist TagBasedLexerFormatter in editor.structure module which provides mechanism for sharing formatting data between different formatters. Unfortunately formatter in its current state limits its reuse to tag based languages.

Current Problems

  • missing infrastructure or recommended way how to handle templating and native embedding
  • editor indentation API limits: formatters ordering is not configurable and have hardcoded exceptions; no API for passing indentation state between formatters
  • all formatters (apart from ones based on TagBasedLexerFormatter) are written ad hoc and repeat non trivial code which could be handled by formatting infrastructure

Proposal

Handling of Templating and Native Embedding

As explained earlier native embedding is defined in language specification and its handling is straightforward. It is templating which is difficult to support. Let's illustrate it on an example. Let's say we want to format JavaScript in following JSP document:

<html>
    <script type="text/javascript">
        function total() {
            <% for (int i=0; i<10; i++) { %>
                total += <%=i%>;
            <% } %>
            alert(total);
        }
    </script>
</html>

The document is split into several language token sequences (HTML, Java Script, JSP, Java):

<html>

   <script type="text/javascript">

function total() {

            <%   for (int index=0; index<10; index++) {   %>

total += <%= index %> ;

            <%   }  %> 

alert(total);

       }

</script> </html>

Java Script within HTML is example of native embedding - anything within <script> tag in this instance is one block of JavaScript code. Within JavaScript are JSP token sequences which is example of templating. What's important is that native embedding tells us all four token sequences of JavaScript code are related and should be treated together.

If we try to merge four JavaScript token sequences we end up with:

        function total() {
                total +=    ;
            alert(total);
        }

which depending on error tolerance of JavaScript formatter is formattable code.

Both GSF API and Parsing API has concept of virtual source (TranslatedSource in GSF and Embedding in Parsing API) which can be created for embedded languages. Objective of virtual source is to extract pieces of a language from a file and merge them into syntactically correct form. In our JavaScript example the virtual source for Java Script could look like:

        function total() {
                total +=  GENERATED;
            alert(total);
        }

where GENERATED is an attempt to substitute templated parts of Java Script with something syntactically correct in terms of JavaScript grammar.

So far this concept has been used only by parsers to enable them creation of valid abstract syntax tree. This proposal would like to reuse the same concept also for formatting. When formatter is started it is given indent.spi.Context from which it gets Document and its TokenHierarchy. There should be a helper method which from TokenHierarchy:

  • collects all token sequences of a given language
  • identifies templating gaps and uses virtual source to replace them with "virtual" token sequences
  • joins real and virtual token sequences into a JoinedTokenSequence

Returned joined token sequence is what formatters would iterate over and perform formatting. That would isolate formatter from complexities of embedded languages. Formatter would have to be aware that some of tokens in joined token sequence might be virtual and that such tokens should be ignored apart from updating formatter's internal state.


One precondition for usage of virtual source for formatting (and especially indentation) is that provider of virtual source is as quick as possible what means it is using only lexical information and not abstract syntax tree. In case of Parsing API (EmbeddingProvider) this is already requested in Javadoc contract which satisfies this precondition.

Editor Indentation API enhancements

Current editor approach is to call individual formatters of all languages presented in a file one by one. Formatters are called in some order and each formatter is given whole file. An alternative would be a single top-down pass in which formatters are executed according to language of token being processed. This approach would work well for native embedding. But in case of templating where a single block of one language is potentially interspersed with templating language the situation gets complicated. In such a case current approach - a language formatter processes all its language tokens at once - seems better.

Current formatting approach needs to be slightly enhanced:

  • order of formatters needs to be configurable per MIME type; and
  • there needs to be a mechanism of passing an indentation context between formatters

Let's look again at our JSP example. If formatters would be run according to their priority then:

  • HTML would have priority 0 - should always run first as it is format of file
  • JSP would have priority 10000 - it is templating language and should run always last
  • all other formatters would have priority somewhere in between

Unformatted file would look like:

<html>
<script type="text/javascript">
function total() {
<% for (int i=0; i<10; i++) { %>
total += <%=i%>;
<% } %>
alert(total);
}
</script>
</html>

After HTML formatter it would look like:

<html>
    <script type="text/javascript">
function total() {
<% for (int i=0; i<10; i++) { %>
total += <%=i%>;
<% } %>
alert(total);
}
    </script>
</html>

HTML formatter formatted all HTML language tokens but indented only lines which started with HTML tokens (ignoring whitespace). After JavaScript formatter:

<html>
    <script type="text/javascript">
        function total() {
<% for (int i=0; i<10; i++) { %>
            total += <%=i%>;
<% } %>
            alert(total);
        }
    </script>
</html>

and finally after JSP formatter which not only formats JSP code but also sub-indents any other code within a templating block, for example line 5 which lies within templating loop:

<html>
    <script type="text/javascript">
        function total() {
            <% for (int i=0; i<10; i++) { %>
                total += <%=i%>;
            <% } %>
            alert(total);
        }
    </script>
</html>

Subsequent formatters need to follow indentation set by previous formatters. For example when JavaScript formatter is about to start formatting on

    <script type="text/javascript">
function total() {

it needs to know initial indent. In this case it could be simply deduced from previous line but no always. For example in case:

    <script foo="bar"
            type="text/javascript">
function total() {

the indent cannot be deduced from previous line because it is multiline statement and it is line on which multiline statement starts which is relevant.

Solution for ordering could perhaps be as simple as adding an SPI interface:

package org.netbeans.modules.editor.indent.spi;

/**
 * To control order in which formatters are run for a MIME type 
 * implement this interface and register its instance in MIME's folder in
 * NetBeans module layer.
 */
public interface LanguagePathFormattingComparator extends Comparator<LanguagePath> {};

and letting clients to implement it and register it in layer for main document MIME type.

Solution for sharing initial line indent could be enhancing org.netbeans.modules.editor.indent.spi.Context with two new methods:

/**
 * Returns initial indentation of a line as suggested by previously run 
 * formatter. If none was set then indentation of previous line is returned.
 * Each formatter should call this method when formatting first line of a code
 * and use it as initial indentation of block being formatted.
 */
public int getLineInitialIndent(int lineIndex) {...}

/**
 * Formatter should call this method at the end of each line it processed
 * and set initial indentation of next line. For native embedded languages
 * this could be called only at the end of language block but because of
 * templating it is recommended to call this at the end of each line.
 */
public void setLineInitialIndent(int lineIndex, int initialIndent) {...}

Implementation of these methods should avoid storing line index as these may change during formatting. Translating line index into first token on that line and keeping that token as key in a map with value being initial indent might be a better solution.

Abstract Formatter

TBD

Appendix A - Embedding Examples

Please feel free to extend this part of document with your own real life examples. If you have a code snippet of language embedding which you think is not handled by this specification ADD IT here! Do not feel restrained by set of primary languages.

Case C1 (#151393)

JSP document containing HTML and CSS:

<tr bgcolor="#f2f2f2" style="height:<%= Common.INTRA_GROUP_GAP %>px"><td></td></tr>

Case C2 (#151393)

JSP document containing HTML and JavaScript:

<script type="text/javascript">
<!--//
function doDocSearch(id) {
    document.docSearchForm.<%= DocumentSearch.DOC_ID_PREFIX %>0.value = id;
    document.docSearchForm.submit();
}
// -->
</script>

Case C3 (#151393)

JSP document containing HTML and JavaScript:

<script type="text/javascript">
    function recalcualte() {
        var numTotal = 0.0; <%
            for (i = 1;i < quantity; i++)
            {
                String suffix = act.getAuthorityCertificateID() + "_" + i;
        %>
        numTotal = numTotal + parseFloat(document.now_step.subTotal_<%= suffix %>.value); <%
            }
        %>
    }
</script>

Case C4 (#151393)

JSP document containing HTML and JavaScript:


<script type="text/javascript">
    var HINTS = {
        'top' : 5,
        'left' : 0
    },
    HINTS_ITEMS = {<%
            for (int idx = 0; idx < passwordList.size(); idx++)
            {
        %>
        '<%= idx %>' : 'some textual value <%= passwordList.get(idx).smth() %>', <%
            }
        %>
        'x':'x'
    };
    
</script>

Case C5 (Test4U harness sources)

JSP document containing HTML:

<table class="main" cellpadding="2" cellspacing="1">
  <%List<String[]> list = userSession.getJavaVersions(pageContext, seqid);
    for (int i=0;i < list.size()-1;i++) {
      String[] lst=list.get(i);%>
  <tr><td>
      <a href="serverstate?<%=Const.A%>=<%=lst[0]%>&<%=Const.B%>=<%=seqid%>">
        <%=lst[0]%>
      </a></td><td>
      <%for (int j=1;j < lst.length;j++) {%>
        <%=lst[J]%><%=((j < lst.length-1) ? ", ":"")%>
      <%}%>
    </td></tr>
  <%}%>
</table>

Case C6 (Test4U harness sources)

JSP document containing HTML:

<input type="checkbox" <%=((testRun.hasMachine(server.getName()))?"checked=\"checked\" ":"")%> />

Case C7 (Embedding levels interplay)

Sometimes the parent pushes up the kid, sometimes kid pushes up the parent...

Example 1:

<jsp:tag>
   <htmltag>
      <jsp:tag/>
   </htmltag>
</jsp:tag>

Example 2:

<? if ($foo){ ?>
   <p><?
        echo $foo ;
   ?></p>
<? } ?>

Case C8 (Two kinds of unformattable areas in embedded language)

Case 1 (absolute):

<jsp:tag>
   <textarea>
Indentation here needs to be strictly preserved
or it will affect the content of the page
    </textarea>
<jsp:tag>

Case 2 (relative):

<jsp:tag>
    <!--
       Formatting within HTML comment should be preserved
          * like this bullet point
          * but the whole block should be moved according to surrounding content
    -->
</jsp:tag>

Case C9 (Two formatting styles in PHP)

1. JSP style:

<table>
   <tr>
      <td>
          <? 
              if ($foo)
              {
                 echo $foo;
              }
          ?>
      </td>
   </tr>
</table>

2. Flat HTML

<table>
<tr>
<td>
<? 
    if ($foo)
    {
        echo $foo;
    }
?>
</td>
</tr>
</table>
Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo