The Term Harvest Expert generates terminology databases by analyzing and identifying terms from within Alchemy projects.  Through a combination of statistical methods, word analysis, content examination and optional manual override, the process is very accurate.

Projects containing user interface segments are typically rich in term candidates.  Combining this with the accurate detection in other Catalyst content such as documentation and help systems, Alchemy CATALYST is expertly placed to create high-quality termbases.  The job of the Term Harvest Expert is to separate the wheat from the chaff.  The Expert takes a statistical approach to identifying terminology and generating a database with those terms .

 

Note: Term Harvest generates *.tbx and *.xlsx as well as other formats.

The Term Harvest examines every single piece of content in a Catalyst project and breaks it down to the smallest unit.  Each of these units is a potential term and is examined in detail, scored and ranked with all other term candidates. The ranked list is then displayed for refinement before the final termbase is generated.

Launch the expert from the EXPERT ribbon.

 

Files Tab

Use the Files Tab to supply input files and specify the desired output.

Input Files

The process takes Catalyst project files (*.ttk) containing any content.

 

Term Harvest Generation Options

Output Excluded Terms File

The Term Harvest Expert will optionally store those segments and terms that were deemed excluded by the Expert.  To generate a file containing the excluded terms, select the Export Excluded Terms File check-box.  The filename will be automatically generated based on the terminology database file.  The word _excluded is appended to the file name in this case.

Output User Settings File

The options used to generate the terminology database may also be saved by the Term Harvest Expert.  Select the Export User Settings File option to store the settings used during the term analysis.  The filename is automatically generated based on the terminology database filename - the word _settings is appended and the file has an *.ini extension.

Copy source terms to target column where no translation present

Term Harvest examines source and target words in a Catalyst projects and outputs a bi-lingual termbase.  If no translation is available for a Term, it will be extracted into the .tbx output as a blank entry.  Chose this option to output the source text when no translation is present in the Analyzed projects.

 

Options Tab

You can control how Term Harvest identifies terms using the following options.

 

Content Control

Description

Stop words file

Term Harvest has a built in list of stop words.  These are common words that should be excluded from termbases such as 'the' 'an' 'a', etc.   You can add to this list by specifying your own stop words.

Exclude terms found in termbase

If you have a previously existing termbase containing already approved terms, you should specify that here to avoid having to re-approve terms if they are found again.

Exclude Locked / Fronzen strings

Use this option to ignore any locked strings in your project.

Extract nouns only

Term Harvest is a natural language processor and can identify parts of speech.  Frequently users wish to only consider nouns as terms.  Use this option to extract terms only.  Note: the POS tagger can be less accurate with short segments such as software UIs.

Exclude terms with non-dictionary words

Misspelled words or words not found in a dictionary can be excluded from your termbases by selecting this option.  It could have the side effect of excluding product or feature names such as ACME LaunchPad.

Extract term translations

With this option selected, any translation available in the harvested project(s) will be listed along with the identified terms. They are displayed in the Target column in the Candidates tab.

 

Statistics Control

Description

Minimum Frequency

It may be that you consider a term candidate unlikely to be a term unless it appears within your content a certain number of times.  The Catalyst default is that a term must be present at least three times before it will be considered a term candidate.  You can set this value here.

Maximum number of words in term

The longer a term candidate, the less re-usable and more segment-like it becomes.  You can control number of words at which a candidate should be excluded.

Minimum number of words in term

Should you wish to apply a minimum number of words required to be considered a term, you can do so here.

Maximum number of terms to output

Depending on your requirements, you may wish to define the overall size of the term output.  For example, if you just want to examine the top 300 terms, set this value to 300.  The number of terms written to the termbase is defined here.

Maximum number of term contexts to output

Each term is written to the term candidate list with context - an example segment that contained the term.  This helps the user refine the list.  Three is a good default, but you can determine the number of context segments to include.

 

Candidates Tab

The Candidates tab facilitates accurate identification of terminology from the term candidate list.  Read more information.