How XML deals with different character sets |
It is possible (and preferable) to specify the encoding of an XML document in the XML declaration, e.g.
<?xml version="1.0" encoding="iso-8859-1"?>
You will notice that many XML documents do not contain any encoding information. In fact the XML declaration itself is optional.
According to the XML 1.0 specification, all processors are required to automatically support (and detect) the UTF-8 and UTF-16 encodings.
UTF-8 and UTF-16 are forms of Unicode. If you use one of these two encodings when saving your documents, you don't need to specify the encoding of the document as the encoding will be automatically recognized by the XML processor.
This automatic recognition is due to the BOM (Byte Order Mark) at the beginning of your file which can be seen if you view the file as binary in a Hex Editor such as in the screenshot below.
0xFF 0xFE = UTF-16 little endian/Unicode
0xFE 0xFF = UTF-16 big endian
0xEF 0xBB 0xBF = UTF-8
If this BOM is missing from the beginning of your file then Catalyst will always treat your file as ANSI.
If this happens then your file will be extracted from the TTK with an ANSI encoding. To avoid this you must resave the file as UTF-8 or Unicode in Notepad.
If you type an XML document into Notepad, you can choose from one of several supported character encodings including ANSI, UTF-8, or UTF-16.
If you save an XML file using an encoding other than UTF-8/UTF-16, then you must use an XML declaration to specify the actual encoding used.
Examples of correctly specified character encoding in an XML declaration is shown below.
<?xml version="1.0" encoding="windows-1252"?>
<?xml version="1.0" encoding="ISO-8859-1"?>
Depending on the Target Language in your TTK you will be required to specify different character encodings in Catalyst in order to display extended and double byte characters correctly.