tmx from Parallel corpus of Patent Translation Resource? (Software applications)

Technical forums » Software applications »
tmx from Parallel corpus of Patent Translation Resource?
Track this topic

Pages in topic: < [1 2 3 4 5] >

tmx from Parallel corpus of Patent Translation Resource?

Thread poster: Noe Tessmann

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

Import took several days

Jan 2, 2015

Dear Michael,

thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ...

Good times ahead

Noe

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

@Roberto:

Jan 2, 2015

Robert Bononno wrote:

I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus on line (if possible).

Macs aren't great at handling very large text files, which is one of my reasons for sticking with Windows. You might want to ask over on the CafeTran mailing list, as several people use Macs there and are quite knowledgeable when it comes to this stuff.

I don't think anyone has added these files to an online database yet. However, they might pop up on the Opus site one of these days, which I recommend you have a look at every now and again: http://opus.lingfil.uu.se/ (the site also has a rudimentary online search interface)

Michael

https://groups.google.com/forum/#!forum/cafetranslators

[Edited at 2015-01-02 21:29 GMT]

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

@Noe:

Jan 2, 2015

Noe Tessmann wrote:

Dear Michael,

thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ...

Good times ahead

Noe

Cool, that's good to hear.

Incidentally, are you actually interested in the metadata? If not, it would simplify the process of converting the data to a TMX somewhat. No big deal if you want it though. it's just an extra step of two.

I will most likely be doing the other folders ("title", "description" and "claims") sometime this weekend.

Michael

Meta Arkadia
Local time: 14:47
English to Indonesian
+ ...

Mac

Jan 2, 2015

Robert Bononno wrote:
One of my text editors, TextWrangler, refuses to open them.

Like for its big brother BBEdit, the maximum text file size for TextWrangler is limited to 384 MB, the text file size limit for OS X. To open and edit large text files, you'll have to use either a Java app and assign enough RAM to the heap or a Unix app, or split the files into ones less than 384 MB (the trick Michael's EmEditor uses). You can do the splitting in the Terminal.

Cheers,

Hans

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

Anyone already aligned the other parts (description, ...)?

Feb 23, 2015

Hi,

so my laptop has finally been fixed. Has anyone (Michael you're the master of alignment) already aligned the other parts of this patent corpus. Abstracts are already really helpful.

Kind regards and a nice new week

Noe

This corpus site doesn't seem to be online.

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

metadata too?

Feb 23, 2015

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

metadata are not so important.

Feb 23, 2015

Dearest Michael,

I think metadata are not so important. I don't need to know where exactly the translation comes from.
Whenever you have time. It's not urgent. I am fine with the abstracts part you kindly aligned.

All the best

Noe

Michael Beijer wrote:

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael

Jean Lachaud

United States
Local time: 03:47
English to French
+ ...

FR/EN and EN/FR too, please

Feb 23, 2015

Michael:

I am interested in the FR/En and EN/FR versions, too. Or maybe a more detailed description of the workflow (I'm a Windows user).

Thanks in advance.

JL

Michael Beijer wrote:

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

Phew! (PatTR: Patent Translation Resource files converted to TMXs)

Feb 24, 2015

Wow, these files are very, very big.

OK, so I managed to do the first part of the Claims batch. Claims is so big, I will have to split it up into around 11 batches of 1,000,000 TUs each. That is, it will be spread across 11 TMXs.

• Claims #1 = here: (1)-PatTR-CLAIMS-(de-en)(TUs-1-1,000,000).tmx (185 MB)
• Claims #2 = here: (2)-PatTR-CLAIMS-(de-en)(TUs-1,000,000-2,000,000).tmx (175.48 MB)

I re-uploaded the TMX derived from "Abstract" here: PatTR-ABSTRACT-(de-en)(718,201-TUs).tmx (134.01 MB)

TMXs for Claims #2-11 will follow as soon as I have a moment, and FR↔EN after that!

Here, in a nutshell, is my workflow:

• append .txt to file names
• open files in EmEditor (or a good text editor capable of opening large files; UltraEdit is also good)
• split these .txt files into manageable chunks (of 1 million TUs/lines each)
• in Ron's CSV Editor, create empty file and paste in contents of .txt files (of src + trgt language) to create a tab-delimited .csv
• in Xbench, convert aforementioned .csv to .tmx;
• in Heartsome TMX editor, edit the TMX custom attributes and clean up the TMX (remove duplicates).

Michael

PS: Not sure what's going on with the Opus corpora site (http://opus.lingfil.uu.se/ ).
PPS: Original files here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/

[Edited at 2015-02-24 14:05 GMT]

[Edited at 2015-02-24 14:05 GMT]

[Edited at 2015-02-24 14:06 GMT] ▲ Collapse

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

You're my hero of the year

Feb 24, 2015

Dear Michael,

now I realize how complicate this alignment must be. 6 steps to get a usable tmx file out of it. I never could have figured that out. Thanks so much. I'll suck the claims part into my TM and enjoy.

Strange that nobody before tried to align this really good stuff.

Kindest regards

Noe

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

You're welcome!

Feb 25, 2015

Yes, I keep wondering whether someone else might already have done it, and whether it might already be available somewhere else…

Michael

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

1st part digested

Feb 25, 2015

Dear Michael,

it took half a day to import the 1st part of Claims into MemoQ. This is much more than for Istvan's EU TMs. It's really a lot.

Thanks once again, I'll test it next week with a patent translation.

KR

Noe

Michael Beijer

United Kingdom
Local time: 08:47
Member (2009)
Dutch to English
+ ...

Hi Noe,

Feb 26, 2015

I think the only way to really search amounts of data of this size is to use something like TMLookup: http://www.farkastranslations.com/tmlookup.php

You can easily import all of these TMXs (or .txt files) (and a lot more) into a TMLookup database and then search it all as fast as lightning. It generally works a lot faster than any CAT tool I have ever tried (I've tried CafeTran, SDL Studio, memoQ, Felix, DVX2, Wordfast, Fluency and a few others).

I finished the entire CLAIMS batch (9 TMXs in total), and am currently uploading them all. I'll post links when they are ready!

Michael ▲ Collapse

Noe Tessmann

Local time: 09:47
English to German
+ ...

TOPIC STARTER

Really incredible

Feb 26, 2015

Dear Michael,

incredible you really managed to convert the whole corpus. Really amazing.
I already use the lookup tool for Andras' EU-TMs via Intelliwebsearch. You're right nothing is faster than this tool.

Can't wait to download the stuff

Kind regards

Noe

2nl (X)

Netherlands
Local time: 09:47

UltraEdit handles large files

Feb 27, 2015

UltraEdit for Mac handles large files. A must have editor for OS X, even if you have TW.

Pages in topic: < [1 2 3 4 5] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

tmx from Parallel corpus of Patent Translation Resource?

Forum rules

Help and orientation

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers! The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc. More info »

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »


	X Sign in to your ProZ.com account... Username: Password: Forgot your password? Or create a new account