Segmentation refers to the rules used to break a paragraph into individual sentences. Segmentation rules can vary from language to language. While a full-stop may be an indicator of a sentence terminator in English and German paragraphs for example, it is not so in Japanese. Alchemy CATALYST has predefined segmentation rules but these can be replaced and modified in HOME > Segmentation.

  
 
In this dialog, you opt to use either  Sentence based parsing or Paragraph based parsing. More simply put, you choose to segment or not segment your project files.
 
While  using Sentence based parsing segmentation is applied when inserting a file in your Catalyst project.

 

Segmentation rules are always applied to the following file formats:

and it is optional for the following file formats/resources types:

Exporting and Importing Segmentation rules

Use the buttons to Export the current segmentation rules to an .ini file. This allows to share or safeguard the rules.

Importing Segmentation rules will overwrite the current settings.

Segmentation rules are also part of the Settings export and import.

 

Sentence Based Segmentation

Using Sentence based parsing, Alchemy CATALYST will apply the matching language segmentation rules and detect sentence boundaries when inserting a file (i.e. parsing the file) in the project.

This is the most accurate way of building TMs as translation units will generally consist of single sentences. This mode thus produces more accurate TM re-use and discovery.

 

Take the following example. A source project file includes the following localizable string:

String_ID251=The cat is black. He is sleeping on the chair. He slept for two hours already!

Using the Sentence based segmentation, on inserting the file in a CATALYST project, the string list view will look like this

  • The cat is black.
  • He is sleeping on the chair.
  • He slept for two hours already!

while using the Paragraph based segmentation, the string list view will show

  • The cat is black. He is sleeping on the chair. He slept for two hours already!

 

Let's assume this is now translated in your TTK project.

The source file is then updated with a small change. The string is updated to: (the difference is highlighted in yellow)

String_ID251=The cat is black. He is sleeping on the sofa. He slept for two hours already!

Using Sentence based segmentation will have far less of an impact because leveraging your translations will result in one string needing to be reviewed in project parsed

  • The cat is black.
  • He is sleeping on the sofa.   <= String needs review
  • He slept for two hours already!

while the entire paragraph needs to be reviewed when not segmenting

  • The cat is black. He is sleeping on the sofa. He slept for two hours already!   <= String needs review

 

As per the example above, when leveraging translations from TMs that are not directly related to the current project files, there will be more high percentage matches when comparing individual sentences rather than full paragraphs.

Configuring segmentation delimiters

By default, rules listed under Language Neutral will apply to ALL files parsed into your projects, regardless of the source language set. You can Add a language to the dropdown and set different segmentation rules applicable to that source language only. Those rules will only apply to projects with the matching source language.

A Sentence delimiter is the syntax which when found within a parsed localizable string triggers the segmentation. Meaning it will section a paragraph string into mutliple sentences.

The basic segmentation rule in most languages is for instance a full stop followed by a space, indicating the end of a sentence. Here it is entered below, and part of the default rules in CATALYST.

The delimiter can be refined further using the Ignore options associated. If you think of a sentence with an person's initial, the delimiter alone without the "Ignore when preceded by uppercase letter" would segment in the wrong place.

The book was written by J. Doe. He is a great novelist.  <= Wrong segmentation

Adding inline tags as segment delimiters

Inline tags can be used as part of sentence delimiters using the syntax [TAG:tagname] where tagname is the label on the desired tag.

For example:

to segment on the inline tag, enter [TAG:br] as a delimiter

to segment on the inline tag only if preceded by a colon, then enter :[TAG:br] as a delimiter

to segment on the inline tag, enter [TAG:x[SPACE]lb] as a delimiter

 

If defining an inline tag which includes an opening and closing pair such as <b></b> or <span></span>, segmentation will include the text included within the tags.

For example:

Defining the sentence delimiter [TAG:b], the segmentation will look like

 

Consume following newline

By default, this option is selected, meaning that newline characters are part of the segmented localizable strings. The sentence delimiter consumes any following new line characters. It cleverly also takes in spaces if present.

 

Take the following XML example:

<element>The cat is black.

He is sleeping on the chair.  

He slept for two hours already!   

</element>

Using Consume following newline, when inserting the file in a CATALYST project, the string list View will look like the following. The newline characters are consumed if following the sentence delimiter.

  • The cat is black.
  • He is sleeping on the chair.
  • He slept for two hours already!   

while deselecting Consume following newline, the string list View will show the following. Where the newline characters are not consumed and are thus the start of the next segment.

  • The cat is black.

  • He is sleeping on the chair.


  • He slept for two hours already!   

This offers control over the way you want to parse newlines. For instance, when you want to have segments ending whenever there is a carriage return in the paragraph. Add [RETURN] to the Sentence delimiter list and deselect to Consume following newline option. Using the same example, this will result in:

  • The cat is black.
  • He is sleeping on the chair.
  • He slept for two hours already!   

while consuming the following newline the results would be again:

  • The cat is black.
  • He is sleeping on the chair.
  • He slept for two hours already!   

 

Handling Segmentation Exceptions

Abbreviated words or anagrams may occasionally cause the segmentation engine to misinterpret a sentence boundary. This can be avoided by specifying any sequence of characters that are to be ignored when applying segmentation rules.

In the same way specific syntax can be entered to trigger a segmentation, syntax can be entered in the exception list to avoid the segmentation.

 

Joining Segmented strings in your project

For files which have been segmented on insertion in the project, selecting a segmented string in the project workspace window will also highlight (in yellow) its associated segments. In other words, you see all the segments part of the original parsed string. This segments highlight can be turned off or its color changed in the settings: HOME > Settings > Colors

In the following example, the selected segment (in blue) is part of a paragraph which has been segmented into 2 segments. The MARK-UP column is another indication of the segmentation applied with the opening tag on the first segment and the closing tag on the second.

Right click on a segmented string to access the context menu and either Join with the next segment or Join all segments.

In the following example, right clicking the selected segment offers the option to Join with the following segment. Should the second and third segments need to be joined instead, you should select the second segment, right-click and select to Join next segment.

This ability to Join Segments is only available for untranslated segments.

 

If a segment is translated, it is assumed you no longer need to Join the segments.

 

Splitting strings in your project

If needing to split a string into 2 or more segments, in other words manually segmenting a string, select the desired string and place the cursor in the translation field where you wish to split the segment, Right-click and select to Split segment.

 

Once split, the string is now 2 segments in the string list.

This can be repeated any number of times, splitting segments which have already been split.

To revert the split, select either of the segment, right-click and select Join All Segments.

 

This ability to Split Segments is only available for untranslated strings.

 

If a string is translated, it is assumed you no longer need to Split into segments.

 

 

Paragraph Based Segmentation

In this mode, Alchemy CATALYST will ignore sentence boundaries and store complete paragraph objects in its TM. This may be useful when aligning two translations that have significant differences in structure and format.