How to run terminology check in large TMs?
Thread poster: Lais Lewicki
Lais Lewicki
Lais Lewicki
Brazil
Local time: 13:33
English to Portuguese
Mar 21, 2022

Hello!

I have been tasked with "cleaning up" our (very large) TMs.

Our goals are to:

Remove duplicates/inconsistent translations
Run a number check
Run a spellcheck
Check terminology using our termbase for that specific TM

I've successfully used Heartsome Editor to remove duplicates and inconsistent translations, but I'm stuck on how I could best carry out the remaining tasks.

Usually, we use Verifika to run quali
... See more
Hello!

I have been tasked with "cleaning up" our (very large) TMs.

Our goals are to:

Remove duplicates/inconsistent translations
Run a number check
Run a spellcheck
Check terminology using our termbase for that specific TM

I've successfully used Heartsome Editor to remove duplicates and inconsistent translations, but I'm stuck on how I could best carry out the remaining tasks.

Usually, we use Verifika to run quality checks on translation projects. But when I tried to run it for this particular TM (over 300MB in size), the process ran for over 12 hours and it still did not finish. That seems unfeasible to me.

Can you give me any pointers on what I could do or software I could use for this task?

Thanks in advance!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:33
Member (2006)
English to Afrikaans
+ ...
Deselect checks Mar 21, 2022

Lais Lewicki wrote:
Usually, we use Verifika to run quality checks on translation projects. But when I tried to run it for this particular TM (over 300MB in size), the process ran for over 12 hours and it still did not finish. That seems unfeasible to me.

I'm not familiar with Verifika, but... could it be that Verifika ran so slow because it was checking too many types of errors? Try selecting *just one* type of error at a time.


Pablo Bouvier
 
Charles Peng
Charles Peng  Identity Verified
China
Local time: 00:33
Member (2022)
English to Chinese
Try Xbench 3.0 Mar 21, 2022

You can try Xbench 3.0, which can quickly check the issues you mentioned, i.e. export the TM as *.tmx format then load it into Xbench.

And as @Samuel Murray suggested, you can check one error type at a time;

[修改时间: 2022-03-21 16:01 GMT]


expressisverbis
Davide Fezzardi
 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 19:33
English to Russian
QA Distiller Mar 21, 2022

You can also try QA Distiller (free software). It checks all the items in your list plus many others.

Also, QA Distiller supports regex and you can use it to clean number-only segments for example:

^\P{L}*\d\P{L}*$
Examples of content to be cleaned: 1-22, [23], (3), 4+, !2, 3-3/3, ~2, 6.6.3, ^3*5…, 4:5, etc.

^\P{L}*\d\P{L}* .+
... See more
You can also try QA Distiller (free software). It checks all the items in your list plus many others.

Also, QA Distiller supports regex and you can use it to clean number-only segments for example:

^\P{L}*\d\P{L}*$
Examples of content to be cleaned: 1-22, [23], (3), 4+, !2, 3-3/3, ~2, 6.6.3, ^3*5…, 4:5, etc.

^\P{L}*\d\P{L}* .+$
Examples of content to be cleaned: 4 ÷ 12mA, 245 rpm, 0 ÷ 100 % C.C.W. – passline, etc.

I have processed a 221320 KB tmx just now. It took 15 minutes for QA Distiller to complete the task with all of your checks.
QA Distiller

[Edited at 2022-03-21 16:35 GMT]
Collapse


Jorge Payan
expressisverbis
 
Pablo Bouvier
Pablo Bouvier  Identity Verified
Local time: 18:33
German to Spanish
+ ...
Crosscheck online Mar 22, 2022

This is by far the best quality control tool I know. For a long time it has been free, but today it is pay-per-use:

https://www.idioma.com/crosscheck


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
CafeTran on Mac Studio M1 Ultra Mar 22, 2022

You can speed the QA process up with CafeTran on a Mac Studio M1 Ultra. Just open the TMX as a project.

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:33
Member (2006)
English to Afrikaans
+ ...
Translate Toolkit Mar 22, 2022

Lais Lewicki wrote:
Remove duplicates/inconsistent translations
Run a number check
Run a spellcheck
Check terminology using our termbase for that specific TM


I haven't used the Translate Toolkit in a number of years, but their most recent update is from this year, so they appear to be alive still. The Translate Toolkit works by exporting matching segments to a separate file, then the user corrects whatever segments he wants to correct in that file, and then importing the export back into the original file. The advantage is that you're always working on a file that contains only the smaller subset of segments that match the particular check or search string. However, it's quite basic and requires a bit of commandline skill, and the entire process is done with PO files (so you need to convert to PO and you need to edit the files in a PO editor or a text editor).

http://docs.translatehouse.org/projects/translate-toolkit/en/latest/installation.html

Step 1 is to convert your TMX file to CSV (using a tool of your choosing). Then use the csv2po.py script to convert it to a PO file. At the very end, use po2tmx.py to generate a TMX file again. Unfortunately there is no tmx2po.py script.

pofilter: this script exports segments based on quality check filters, e.g. number check, punctuation check, etc.
pomerge: this script imports the export file back into the original file.
pogrep: this script exports segments based on a search string (multiple search strings in a single query is possible).

So, if I remember correctly, an example of a commend would be:
pofilter.py -t startpunc "bigmomma.tmx.po" "export.po"
which would create a file named export.po with all segments where the start punctuation of the source is different from the target.

Before you use it on Windows, you must install some extra stuff (see installation guide), and create an environment where the EXE files will end up, e.g. C:\Users\YourName\Envs\myenvironment\Scripts.

In a quick test, it refused to work on my CSV file that I exported directly from Excel. Also, it exports UTF8 without BOM (and doesn't tolerate a file if it contains a UTF8 BOM), so there is that too.

[Edited at 2022-03-22 10:09 GMT]


 
Hema Gupta
Hema Gupta
India
How to run terminology check in large TMs? Apr 14, 2022

If you have big TM, it is better to distribute terms over several TMs. Usually, you can run the terms check for the separate elements in one TM. For example for a separate web site. In case you have split your TM into several parts, then there are 2 ways how to perform terminology checking in large TMs:

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to run terminology check in large TMs?







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »