Extracted file not in UTF8 encoding

Issue

The translated file extracted out of Alchemy Catalyst are not saved in UTF8 encoding.

Reason

Alchemy Catalyst auto detects the encoding of a file on insertion in a TTK and will retain this encoding when extracting the translated file.

Solution

Encoding respected on extraction

A file inserted as Unicode or ANSI into Alchemy Catalyst will always be extracted as Unicode or ANSI.

Alchemy Catalyst automatically detects if the file is Unicode or ANSI on insertion and will then extract to this same format. It is not possible for the user to change the format of the file from Unicode to ANSI in Alchemy Catalyst.

This automatic recognition of the files encoding is due to the BOM (Byte Order Mark) at the beginning of your file which can be seen if you view the file as binary in a Hex Editor such as in the screenshot below.

0xFF 0xFE            =        UTF-16 little endian/Unicode
0xFE 0xFF            =        UTF-16 big endian
0xEF 0xBB 0xBF     =        UTF-8

If this BOM is missing from the beginning of your file then Alchemy Catalyst will always treat your file as ANSI. However when the BOM is at the beginning of the file Alchemy Catalyst knows to treat the file as a Unicode file.

If the file is inserted into Alchemy Catalyst without a BOM then this file will be extracted from the TTK with an ANSI encoding. To avoid this you must resave the file as UTF-8 or Unicode in Notepad.

Save your source file specifying the encoding

If you open an XML, Text or HTML document in Notepad, you can choose from one of several supported character encodings including ANSI, UTF-8, or UTF-16 as shown in the screenshot below.

XML files encoding

It is possible (and preferable) to specify the encoding of an XML document in the XML declaration, e.g. <?xml version="1.0" encoding="iso-8859-1"?>

Many XML documents do not contain any encoding information. In fact the XML declaration itself is optional. According to the XML 1.0 specification, all processors are required to automatically support (and detect) the UTF-8 and UTF-16 encodings. UTF-8 and UTF-16 are forms of Unicode.

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

First bytes	Encoding assumed
EF BB BF	UTF-8
FE FF	UTF-16 (big-endian)
FF FE	UTF-16 (little-endian)
00 00 FE FF	UTF-32 (big-endian)
FF FE 00 00	UTF-32 (little-endian)
None of the above	UTF-8

Note that the encoding of an XML document is never iso-8859-1 by default.

One of the most common mistake when editing an XML document is to add some extended characters and forget to set the encoding declaration at the top of the document.

If this BOM is missing from the beginning of your XML file, Alchemy Catalyst will always treat your file as ANSI. If this happens then your file will be extracted from the TTK with an ANSI encoding. To avoid this you must resave the file as UTF-8 or Unicode in Notepad.

If you type an XML document into Notepad, you can choose from one of several supported character encodings including ANSI, UTF-8, or UTF-16 as shown in the screenshot above.

If you save an XML file using an encoding other than UTF-8/UTF-16, then you must use an XML declaration to specify the actual encoding used.

Examples of correctly specified character encoding in an XML declaration:

<?xml version="1.0" encoding="windows-1252"?>
or
<?xml version="1.0" encoding="ISO-8859-1"?>

Depending on the Target Language in your TTK you will be required to specify different character encodings in Alchemy Catalyst in order to display extended and double byte characters correctly.

Products or Versions Affected

Alchemy CATALYST 6.0 and greater