I have translated manually 2 million words, no CAT tools. Best software to take advantage from it?
Thread poster: arnoldpredator
arnoldpredator
arnoldpredator
Local time: 15:56
English to Spanish
+ ...
May 29, 2022

Hello guys,

I have been working for 3 years and I have done all my translations manually.
The content is host in 2 websites, one in Spanish and the other in English.
I wanted to find a software that allows me to upload the English version and the Spanish version to "train" it, I mean to train a neural network or machine learning tool.

I have found AWS Translation but the problem is that they are asking me to use Memory files, or CSV, but I don't have that,
... See more
Hello guys,

I have been working for 3 years and I have done all my translations manually.
The content is host in 2 websites, one in Spanish and the other in English.
I wanted to find a software that allows me to upload the English version and the Spanish version to "train" it, I mean to train a neural network or machine learning tool.

I have found AWS Translation but the problem is that they are asking me to use Memory files, or CSV, but I don't have that, I only have plain text.

What is the best way to take advantage of all the hard work I have done in the past? Any ideas?

Thank you guys!
Collapse


Rolf Keller
 
Tony M
Tony M
France
Local time: 16:56
Member
French to English
+ ...
SITE LOCALIZER
Alignment May 29, 2022

I'm no expert; but when I have a matching pair of documents in a language pair like this, I would usually be looking at an alignment tool, to create a parallel bilingual file, from which you can then extract some kind of terminology database, which you will then be able to exploit within a CAT tool.
I'm sure utilities already exists for doing this, and particualrly, for 2 million words, you may well be glad of some help handling the sheer volume. I don't usually work with such large docume
... See more
I'm no expert; but when I have a matching pair of documents in a language pair like this, I would usually be looking at an alignment tool, to create a parallel bilingual file, from which you can then extract some kind of terminology database, which you will then be able to exploit within a CAT tool.
I'm sure utilities already exists for doing this, and particualrly, for 2 million words, you may well be glad of some help handling the sheer volume. I don't usually work with such large documents (in one go), so as I work in Wordfast, I just use the PlusTools 'alignment' tool which works very well for me... but does require manual intervention.
Collapse


arnoldpredator
Mark Fessenden
 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 17:56
English to Russian
BAT in action May 29, 2022

Now you have to align those two million words with AlignFactory, ABBYY Aligner or built-in aligners available with such CAT tools as memoQ, etc. once you have your translations aligned, you will be able to use tmx memory for MT training (for AWS, OPUS CAT, PROMT, etc.).

arnoldpredator
Rita Translator
 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Download the files first May 29, 2022

Download them with a download tool like this:

HTTrack
HTTrack is an extremely popular website downloader that allows users to download WWW site from the Internet with all the media files, HTML etc. All you have to do is to just copy the URL of the website and paste into downloader’s ‘URL’ bar. Select the parts of the website you wish to download such as media files, texts or HTML, choose the files you want to exclude from saving, select the location where you will save your downloaded website click “Download” button to begin downloading the entire website for offline reading.


Weed and make pairs of matching html files.

Auto-align the pairs of html files with https://autoaligner.freetm.com/

Use the aligned Excel or TMX files to train your private MT system.



[Edited at 2022-05-30 06:41 GMT]


arnoldpredator
Gennady Lapardin
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 15:56
English to French
+ ...
WF aligner May 29, 2022

arnoldpredator wrote:



As mentioned before even you want to train an MT engine, it is better if you have bilingual (parallel) data.
So, the chore is to take your files (find the source file and the target file) and use an alignment solution.
My favorite is http://wordfast.net/?go=align it is free and has smart features to weed out what seems to be misaligned (happens very often). You can get a txt out of it (along with other formats) which can be converted to csv etc. as you mentioned.


arnoldpredator
Mark Fessenden
 
arnoldpredator
arnoldpredator
Local time: 15:56
English to Spanish
+ ...
TOPIC STARTER
Thanks May 29, 2022

Thank you very much guys, I am going to investigate all the software you mentioned in the thread.

It looks promising.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
What's in a name? May 30, 2022

Philippe Locquet wrote:

My favorite is http://wordfast.net/?go=align it is free


What's the actual difference between the solution mentioned by you and the one mentioned by me?

Does either one of them perform better?

EDIT: I think that I can answer that question myself.

Your solution is simpler, but you have to paste your source and target text. I think this will potentially lead to less accurate alignment results (since paragraph styles like header, footnote etc. are lost). Will test.

EDIT 2: First attempt failed. There is some cryptic text in the input boxes:

Screen Shot 2022-05-30 at 08.50.08

Screen Shot 2022-05-30 at 08.50.13

I got this error message:

3493 source characters found; 3564 target characters found.
Both source and target text should have more than 4,000 characters (~ 2 pages) - try again!


Does this mean that the documents need to have at least 4K chars? Why on earth???



[Edited at 2022-05-30 06:51 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Words per day May 30, 2022

arnoldpredator wrote:

I have been working for 3 years


I just had to make the calculation 😊

Three years is equal to approximately 740 working days.

2.000.000 words in 740 working days is equal to 2703 words per working day.

It’s about 350 words per hour: the assumed/required output when I started translating in 1988. With source documents on paper (often fax).



[Edited at 2022-05-30 07:34 GMT]


Philippe Locquet
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 15:56
English to French
+ ...
Filters May 30, 2022

Hans Lenting wrote:

Philippe Locquet wrote:

My favorite is http://wordfast.net/?go=align it is free


What's the actual difference between the solution mentioned by you and the one mentioned by me?

Does either one of them perform better?

EDIT: I think that I can answer that question myself.

Your solution is simpler, but you have to paste your source and target text. I think this will potentially lead to less accurate alignment results (since paragraph styles like header, footnote etc. are lost). Will test.

EDIT 2: First attempt failed. There is some cryptic text in the input boxes:

Screen Shot 2022-05-30 at 08.50.08

Screen Shot 2022-05-30 at 08.50.13

I got this error message:

3493 source characters found; 3564 target characters found.
Both source and target text should have more than 4,000 characters (~ 2 pages) - try again!


Does this mean that the documents need to have at least 4K chars? Why on earth???



[Edited at 2022-05-30 06:51 GMT]


I have used both and I just stated my preference, if you like better Auto Aligner, nothing wrong with that. The one on wordfast.net has active smart filters and copy-pasting forces the user to actually open the file, which tends to help find potential issues. It gives the option to select only portions of a file and leave out vast portions of tables that some may want to leave out from their TM. In any case alignment always end up being time consuming if you want to get something clean.


Cécile A.-C.
 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Clean is overrated May 30, 2022

Philippe Locquet wrote:

In any case alignment always end up being time consuming if you want to get something clean.


I don’t waste much time on cleaning up: I only use aligned TMs for concordancing during HT.


 
Gennady Lapardin
Gennady Lapardin  Identity Verified
Russian Federation
Local time: 17:56
Italian to Russian
+ ...
a proposal of an article May 30, 2022

Stepan Konev wrote:

Now you have to align those two million words with AlignFactory, ABBYY Aligner or built-in aligners available with such CAT tools as memoQ, etc. once you have your translations aligned, you will be able to use tmx memory for MT training (for AWS, OPUS CAT, PROMT, etc.).


Hello Stepan,

I've read a while back, on FB, your very interesting post about tmx cleaning. Proposal: Could you please post a short article on the subject here on Proz for permanent reference?


Stepan Konev
 
jyuan_us
jyuan_us  Identity Verified
United States
Local time: 10:56
Member (2005)
English to Chinese
+ ...
Potential use of TM May 30, 2022

arnoldpredator wrote:

Thank you very much guys, I am going to investigate all the software you mentioned in the thread.

It looks promising.


If you are sure the same clients would come back to you asking you to translate the same topics, it might be worthwhile aligning your translations with their source files. If what you would translate going forward would be totally new topics, the TM you would create by alignment would have little use.


Jorge Payan
 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Concordancing only May 30, 2022

jyuan_us wrote:

If what you would translate going forward would be totally new topics, the TM you would create by alignment would have little use.


Exactly. You could only use them for concordancing. So don't spend too much time on aligning ...


Jorge Payan
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 15:56
English to French
+ ...
Clean is a must May 31, 2022

Hans Lenting wrote:

"Clean is overrated"

I don’t waste much time on cleaning up: I only use aligned TMs for concordancing during HT.


As mentioned, it seems that a proper article on the theme is sorely needed, I look forward to Stepan’s article if he gets around to make one.
The truth is that alignment is not always used to full potential, although it can provide tangible ROI in the long run.

Historically, alignment is perceived as yielding poor quality TMs and thus aligned TMs usually have an align attribute which gets them a penalty of high percentage in CATs. This makes sense when alignment was machine-performed with little to no human revision. Such scenarios are only good for concordancing, true.

Now modern real-life applications of alignment differ from that model. Here are two scenarios:
_An agency takes on a new customer that has no TM but has existing translations which they are happy with. Creating a proper alignment, good enough for matching and pre-translating will help provide translations that the customer will be familiar with giving the LSP immediate customer satisfaction and fast turnarounds.
_You want to train an MT (as Stepan mentioned): your MT will only be as good as the data you’re training it with. Misaligned segments or segments where the target language is not the desired one (happens often) will hurt MT performance.

Past translations performed by accurate translators with good writing skills are a “premium” product, with modern CAT technology it seems unreasonable to let them drown in a sea of garbage and just fish for a few words.
My two cents 😊


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

I have translated manually 2 million words, no CAT tools. Best software to take advantage from it?







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »