High level programming guidelines for handling encoding

Information provided by Ken Frank, Tools I18N Quality and Testing

Contact kfrank@netbeans.org for more information or with comments or questions about these pages.

NOTE - there is a new project encoding property and more enhanced handling of project and file encoding in netbeans 6.0. The overall name for these features is file encoding query (FEQ)

see nb api javadoc on these for more information at implementation level FileEncodingQuery FileEncodingQueryImplementation

These netbeans wiki faq documents tell more details about this

http://wiki.netbeans.org/wiki/view/FaqI18nProjectEncoding http://wiki.netbeans.org/wiki/view/FaqI18nChangeProjectEncodingImpact http://wiki.netbeans.org/wiki/view/FaqI18nFileEncodingQueryObject http://wiki.netbeans.org/DevFaqI18nFileEncodingQueryObject

These documents tell more about encoding handling and testing for it.




The information in these docs and api might replace some details below if they refer to specific api, but the principles of encoding handling and how to test it are the same.


       System locale encoding = default encoding of the locale
       user is currently in

utf-8 = UTF-8 encoding

mbyte = multibyte

Guidelines and Requirements

* Use FEQ mentioned above as part of implementing encoding handling related to files and projects.

   * Some information below might not be valid in context of the FEQ implementation mentioned.

* use UTF-8 encoding for processing internal data and information within the module. (Not sure if this applies to other modules in same ide.)

 Make sure that if there are files created from this data, or that this data can be edited via properties or elsewhere in module, that the files are opened and displayed also using this utf8 encoding, or else they will not show properly. (ie, if system default encoding is used)

* use utf-8 for processing, reading, writing xml files

I think this is also true for deployment descriptor kinds of files and data.
   * xml uses utf-8 as default, so a generated or other xml file does not need an encoding tag if utf-8 is used.
   * use utf8 if a generated file will be executed without any modification; its when user might modify a file using data from other encoding that issues could arise (or that ide modifies file based on other user input)
   * XML files default encoding is utf8 so declaration like <?xml version="1.0"?> implies utf-8 XML should be handled as utf-8 and generated with that tag

* Application files

For encoding of application files, like java, whether created by user or created by module initially for user, the default encoding should be that of the locale user is in when the file is created.

This applies to java, html and text files, for example.

* text files

         * text files don't have encoding property themselves, so it can be better not to use them for files, configuration or data since multibyte or information in other encodings might be used.
         * encoding of the locale user is in (system locale) should be used for text files, but since user might later be in another locale, there is no way to know the original encoding - which is another good reason not to use text files for configuration or control data.
         * Don't use utf8 when writing plain text files that will be interpreted in the host OS's default encoding.

* java related files (.java, .jar, .class, etc)

         *  system default encoding be used for java files; the javac compiler uses that encoding if user does not specify encoding option to javac, and usually user does not specify this.
         * be sure to handle encoding correctly for generated code also - if user sets a property value in property editor to have multibyte, make sure encoding is handled ok for the generated java or other code

* html files

currently, 2 approaches are used:

        1.  generate .html by utf8

Because the user does not have to modify generated html files, the encoding would always be utf8 to execute all supported locales. The reason to specify utf8 is that it's difficult to set appropriate encoding name like euc-jp, gb2312.

By default Java API behavior, the runtime encoding name which can be used as charset value does not match what user would use -- e.g. API returns "PCK" but Shift_JIS should be used by user. In the current implementation, generated UTF-8 files works correctly but they should not be edited since IDE open those files system default.]

         2.   generate .html with system default encoding providing a meta charset tag that has the encoding of the locale the user is in as its default value.

The empty .html in this case would need to always be modified by the user who might need to modify charset tag

* Where else to use system default encoding ?

   * if the file format does not allow user to specify charset or encoding value, its better to generate files using system default encoding.
   * when the user might modify generated files like html, system default encoding is better to use, and if user does manually put in encoding or charset tags, then those tags should take care of further encoding handling (we assume user will put in charaters appropriate to the charset they are using)
   * when files have no encoding declaration, like plain text, system default should be used; if the ide does have an encoding property for text files, that property should be consulted; if not set, then we assume system default encoding.

* JSP Files

         * for jsp, default encoding per spec is iso-8859-1 and it is expected user will change it if needed to  other encoding ie <%@page contentType="text/html; charset=EUC-JP" %>
         * but since utf8 is helpful, suggestion is to generate jsp files and put utf8 encoding tag in them - user can always change it if needed.
         * if generate jsp files without encoding declaration, make sure user knows they need to change it otherwise iso-8859-1 will be used no matter what locale they are in.
         * be sure to handle encoding correctly for generated code also - if user sets a property value in property editor to have multibyte, make sure encoding is handled ok for the generated code

Other topics

   *   One should assume that the encoding used in a file or other data being converted may not be in the encoding of the locale that the user is now in.

The file might have been created when they were in another locale.

That is, user might have used ide options and/or directives in the file to change the encoding; so detection of this encoding is needed.

   * its not easy to know encoding for plain text, html or java file that is not generated by ide itself but is from some external source.

   *  For getting encoding of external files or data from outside ide that were not xml or jsp, special additional detection would be needed, unless it was known what encoding was used by that external data, which might be true for databases or for AS log file data.
   * if choices of encodings for user is provided, for example in a drop down box, make sure UTF-8 and asian encodings are provided.
   * sometimes using native2ascii to process characters can be good - previous jse web service application used this approach as does nb swing application wizard.

using native2ascii can reduce character corruption issues but it can make it difficult for user to view the \uXXXX characters so created by it.

So if user does not need to modify a file or view it, then native2ascii could be used - but if user does need to, then system default enoding should be used.

   * For data that is sent/received from the operating system or JRE, the encoding used for that communication is that of the locale the user is in when they start the product.
 same would apply to data that would be shown in a browser; data shown in UI whether from files or not - same principles of encoding handling above apply.
   *   For data sent by module to outside the product, the proper encoding handling needs to be used.
   *  Don't assume all characters will be in ascii range, and handle comparisons and other processing with that in mind.

   *  Be careful when using java.io.Reader and java.io.Writer classes for data conversion or manipulation, (file creating, reading, writing, processing) because they do not convert all bytes to characters and characters to bytes correctly when some multibyte encodings are being used. This may lead to data loss or incorrect data.

Instead, use the java.nio.* package and APIs for these conversions.

Not logged in. Log in, Register

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo