Pages in topic: < [1 2 3 4 5] > |
tmx from Parallel corpus of Patent Translation Resource? Thread poster: Noe Tessmann
|
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER Import took several days | Jan 2, 2015 |
Dear Michael, thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ... Good times ahead Noe | | |
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ...
Robert Bononno wrote: I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus on line (if possible). Macs aren't great at handling very large text files, which is one of my reasons for sticking with Windows. You might want to ask over on the CafeTran mailing list, as several people use Macs there and are quite knowledgeable when it comes to this stuff. I don't think anyone has added these files to an online database yet. However, they might pop up on the Opus site one of these days, which I recommend you have a look at every now and again: http://opus.lingfil.uu.se/ (the site also has a rudimentary online search interface) Michael https://groups.google.com/forum/#!forum/cafetranslators
[Edited at 2015-01-02 21:29 GMT] | | |
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ...
Noe Tessmann wrote: Dear Michael, thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ... Good times ahead Noe Cool, that's good to hear. Incidentally, are you actually interested in the metadata? If not, it would simplify the process of converting the data to a TMX somewhat. No big deal if you want it though. it's just an extra step of two. I will most likely be doing the other folders ("title", "description" and "claims") sometime this weekend. Michael | | |
Robert Bononno wrote: One of my text editors, TextWrangler, refuses to open them. Like for its big brother BBEdit, the maximum text file size for TextWrangler is limited to 384 MB, the text file size limit for OS X. To open and edit large text files, you'll have to use either a Java app and assign enough RAM to the heap or a Unix app, or split the files into ones less than 384 MB (the trick Michael's EmEditor uses). You can do the splitting in the Terminal. Cheers, Hans | |
|
|
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER Anyone already aligned the other parts (description, ...)? | Feb 23, 2015 |
Hi, so my laptop has finally been fixed. Has anyone (Michael you're the master of alignment) already aligned the other parts of this patent corpus. Abstracts are already really helpful. Kind regards and a nice new week Noe This corpus site doesn't seem to be online. | | |
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ... metadata too? | Feb 23, 2015 |
Hi Noe, No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align. Michael | | |
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER metadata are not so important. | Feb 23, 2015 |
Dearest Michael, I think metadata are not so important. I don't need to know where exactly the translation comes from. Whenever you have time. It's not urgent. I am fine with the abstracts part you kindly aligned. All the best Noe Michael Beijer wrote: Hi Noe, No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align. Michael | | |
Jean Lachaud United States Local time: 03:47 English to French + ... FR/EN and EN/FR too, please | Feb 23, 2015 |
Michael: I am interested in the FR/En and EN/FR versions, too. Or maybe a more detailed description of the workflow (I'm a Windows user). Thanks in advance. JL Michael Beijer wrote: Hi Noe, No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align. Michael | |
|
|
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ... Phew! (PatTR: Patent Translation Resource files converted to TMXs) | Feb 24, 2015 |
Wow, these files are very, very big. OK, so I managed to do the first part of the Claims batch. Claims is so big, I will have to split it up into around 11 batches of 1,000,000 TUs each. That is, it will be spread across 11 TMXs. • Claims #1 = here: (1)-PatTR-CLAIMS-(de-en)(TUs-1-1,000,000).tmx (185 MB) • <... See more Wow, these files are very, very big. OK, so I managed to do the first part of the Claims batch. Claims is so big, I will have to split it up into around 11 batches of 1,000,000 TUs each. That is, it will be spread across 11 TMXs. • Claims #1 = here: (1)-PatTR-CLAIMS-(de-en)(TUs-1-1,000,000).tmx (185 MB) • Claims #2 = here: (2)-PatTR-CLAIMS-(de-en)(TUs-1,000,000-2,000,000).tmx (175.48 MB) I re-uploaded the TMX derived from "Abstract" here: PatTR-ABSTRACT-(de-en)(718,201-TUs).tmx (134.01 MB) TMXs for Claims #2-11 will follow as soon as I have a moment, and FR↔EN after that! Here, in a nutshell, is my workflow: • append .txt to file names • open files in EmEditor (or a good text editor capable of opening large files; UltraEdit is also good) • split these .txt files into manageable chunks (of 1 million TUs/lines each) • in Ron's CSV Editor, create empty file and paste in contents of .txt files (of src + trgt language) to create a tab-delimited .csv • in Xbench, convert aforementioned .csv to .tmx; • in Heartsome TMX editor, edit the TMX custom attributes and clean up the TMX (remove duplicates). Michael PS: Not sure what's going on with the Opus corpora site (http://opus.lingfil.uu.se/ ). PPS: Original files here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/
[Edited at 2015-02-24 14:05 GMT]
[Edited at 2015-02-24 14:05 GMT]
[Edited at 2015-02-24 14:06 GMT] ▲ Collapse | | |
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER You're my hero of the year | Feb 24, 2015 |
Dear Michael, now I realize how complicate this alignment must be. 6 steps to get a usable tmx file out of it. I never could have figured that out. Thanks so much. I'll suck the claims part into my TM and enjoy. Strange that nobody before tried to align this really good stuff. Kindest regards Noe | | |
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ... You're welcome! | Feb 25, 2015 |
Yes, I keep wondering whether someone else might already have done it, and whether it might already be available somewhere else… Michael | | |
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER 1st part digested | Feb 25, 2015 |
Dear Michael, it took half a day to import the 1st part of Claims into MemoQ. This is much more than for Istvan's EU TMs. It's really a lot. Thanks once again, I'll test it next week with a patent translation. KR Noe | |
|
|
Michael Beijer United Kingdom Local time: 08:47 Member (2009) Dutch to English + ...
I think the only way to really search amounts of data of this size is to use something like TMLookup: http://www.farkastranslations.com/tmlookup.php You can easily import all of these TMXs (or .txt files) (and a lot more) into a TMLookup database and then search it all as fast as lightning. It generally works a lot faster than any CAT tool I have ever tried (I've tried CafeTran... See more I think the only way to really search amounts of data of this size is to use something like TMLookup: http://www.farkastranslations.com/tmlookup.php You can easily import all of these TMXs (or .txt files) (and a lot more) into a TMLookup database and then search it all as fast as lightning. It generally works a lot faster than any CAT tool I have ever tried (I've tried CafeTran, SDL Studio, memoQ, Felix, DVX2, Wordfast, Fluency and a few others). I finished the entire CLAIMS batch (9 TMXs in total), and am currently uploading them all. I'll post links when they are ready! Michael ▲ Collapse | | |
Noe Tessmann Local time: 09:47 English to German + ... TOPIC STARTER Really incredible | Feb 26, 2015 |
Dear Michael, incredible you really managed to convert the whole corpus. Really amazing. I already use the lookup tool for Andras' EU-TMs via Intelliwebsearch. You're right nothing is faster than this tool. Can't wait to download the stuff Kind regards Noe | | |
2nl (X) Netherlands Local time: 09:47 UltraEdit handles large files | Feb 27, 2015 |
UltraEdit for Mac handles large files. A must have editor for OS X, even if you have TW. | | |
Pages in topic: < [1 2 3 4 5] > |