PDF: I'm looking for trouble :)
Thread poster: MikeTrans
MikeTrans
MikeTrans
Germany
Local time: 04:41
Italian to German
+ ...
Dec 13, 2014

Hi,
because I have installed and do testing with some main tools able to handle and edit PDF files and also convert them to other formats, I'm now looking for trouble

I'm in search of some good 'dead' PDFs that do generally harass translators. Is there any free accessible file repository of such example files? I want to know how my programs will deal with them, and if they are able to transform them to some exte
... See more
Hi,
because I have installed and do testing with some main tools able to handle and edit PDF files and also convert them to other formats, I'm now looking for trouble

I'm in search of some good 'dead' PDFs that do generally harass translators. Is there any free accessible file repository of such example files? I want to know how my programs will deal with them, and if they are able to transform them to some extend prior of translating.

Any help or links are highly welcome!
Greets,
Mike
Collapse


 
Miguel Carmona
Miguel Carmona  Identity Verified
United States
Local time: 19:41
English to Spanish
... Dec 13, 2014

Why don't you create your own PDF documents for your testing?

All you need is a scanner and save the files as PDF files.

Scan documents of different print quality, and you can even position them at different angles on the scanner. You can also try different degrees of brightness and contrast with the scanner software, or even add "noise" or other effects (like wavy text, blur, etc.) in Photoshop or similar software.

If you follow a systematic, scientific
... See more
Why don't you create your own PDF documents for your testing?

All you need is a scanner and save the files as PDF files.

Scan documents of different print quality, and you can even position them at different angles on the scanner. You can also try different degrees of brightness and contrast with the scanner software, or even add "noise" or other effects (like wavy text, blur, etc.) in Photoshop or similar software.

If you follow a systematic, scientific approach, keeping notes on the different settings you use, print quality, etc., and the corresponding results you obtain with the software you are testing, you can end up with a very interesting study.

============================
EDIT:

I was thinking, the process would be more systematic and the results more informative if you use just one page to produce PDFs of different qualities for the tests.

First, you could start with a real clean document, placed perfectly square on the scanner bed which you would scan and save it in any format acceptable in Photoshop or similar program. That would be your initial image, which after saved as a PDF file, will be the image that will produce the baseline results in your various PDF-reading programs.

Then, you can subject the image to any kind of treatment you want in Photoshop, like I mentioned above, like angle, brightness (exposure), contrast, noise (“dirty” look), waviness, etc., and every time you change one of those settings, you can save the resulting image as a PDF file, which you would use for the tests in the PDF-reading programs you are testing.

============================

Good luck!

[Edited at 2014-12-13 19:38 GMT]
Collapse


 
MikeTrans
MikeTrans
Germany
Local time: 04:41
Italian to German
+ ...
TOPIC STARTER
Is a good or bad OCR the main source of problems with PDFs? Dec 13, 2014

Hi Miguel,
thank you for your extensive response, and excuse me for this somewhat relaxed title (I have realized too late that it may not sound quite appropriate or serious...)

Yes, I was primarily thinking about scanned documents and a following OCR process to transform them into searchable text files. I want mainly to compare different tools.
Sorry for the misleading word "testing", but I meant to say that I have downloaded various PDF creation software for which I ha
... See more
Hi Miguel,
thank you for your extensive response, and excuse me for this somewhat relaxed title (I have realized too late that it may not sound quite appropriate or serious...)

Yes, I was primarily thinking about scanned documents and a following OCR process to transform them into searchable text files. I want mainly to compare different tools.
Sorry for the misleading word "testing", but I meant to say that I have downloaded various PDF creation software for which I have a license with a trial period for testing. So I don't have a scientific approach in mind, but of course I want to push these programs to their limits to see how they can help me. I may then also see more clearly if and how much I should charge a client...

Actually I did find some sample files on the net which were published as examples, very simple printed forms or letters.
My programs use OCR with such simple prints in about 10 seconds (I don't have a scanner), transforming them very well, creating text or grafic boxes where appropriate and retaining the layout.
So, after that, I was just thinking: "It's all sunny...I should raise the difficulty or deal with real *practical* problems with PDFs relating to translators." Your suggestions make sense, but are these what translators fear the most with PDFs?

I should have asked this in my post above:
Still I feel that some translators don't like to translate PDF files. Why?
Is it just a false assumption, is additional OCR of a scanned file the main concern, and what about possible problems arising from 'normal' PDFs with text?
Thanks very much for anyone to respond by giving some examples of problems encountered.

Cheers,
Mike
Collapse


 
Emma Goldsmith
Emma Goldsmith  Identity Verified
Spain
Local time: 04:41
Member (2004)
Spanish to English
The OCR program isn't good or bad in itself Dec 14, 2014

MikeTrans wrote:

My programs use OCR with such simple prints in about 10 seconds (I don't have a scanner), transforming them very well, creating text or grafic boxes where appropriate and retaining the layout.


Yes, OCR programs do a very good job with simple texts. Yes, they retain the layout.

Still I feel that some translators don't like to translate PDF files. Why?


1. The apparently simple layout that you see in an OCRd Word file can turn into a nightmare when you open it in a CAT tool:
http://signsandsymptomsoftranslation.com/2012/06/15/tag-soup-in-trados-studio/
That's because the job of OCR software is to faithfully reproduce what it sees. So every tiny variation in a space between characters (AKA as kerning) will be reflected in a tag. Every time it thinks that the font has varied, another tag will be added.

2. Imagine a crystal clear text with a watermark in the background. An OCR program is likely to misinterpret every character that coincides with the watermark.

3. One simple solution is to set the OCR program to produce a plain text version of your scanned document. That avoids all "tag soup" issues, but, of course, you have to put all the formatting back into the Word document by hand. That's what I do, so long as the document isn't too complex.

Is it just a false assumption, is additional OCR of a scanned file the main concern, and what about possible problems arising from 'normal' PDFs with text?

Even editable PDFs can lead to problems. If you use a PDF converter, or simply open the PDF in a newer Word version, often the Table of Contents isn't preserved, bullet points are not automatically inserted, and there are issues with headers and footers and document sections.

It'll be interesting to hear what results you get testing different OCR programs, Mike. My guess is that they'll be pretty similar.


 
Madeleine Chevassus
Madeleine Chevassus  Identity Verified
France
Local time: 04:41
Member (2010)
English to French
SITE LOCALIZER
as a simple translator, I don't like PDFs. Dec 14, 2014

Hi,

I had several problems with PDFs.

When I quote, I always ask to receive the files before confirming that I'll take the job.

1) is the pdf correctly taken in account by Studio 2014 or MemoQ yes -> GO

2) the pdf is not correctly taken in account by Studio 2014 or MemoQ, the result is an empty file!
in that case it is not possible to use my CAT tools!

3) if not, it is possible to use an Adobe product which turns the PDF int
... See more
Hi,

I had several problems with PDFs.

When I quote, I always ask to receive the files before confirming that I'll take the job.

1) is the pdf correctly taken in account by Studio 2014 or MemoQ yes -> GO

2) the pdf is not correctly taken in account by Studio 2014 or MemoQ, the result is an empty file!
in that case it is not possible to use my CAT tools!

3) if not, it is possible to use an Adobe product which turns the PDF into a Word doct (paying, mensual fee); confidentiality issue because you must send the file!, so I don't do that.

you can also try Abby reader or analoguous tools. Personnally, I didn't get good results with that kind of tool (I followed a webinar and worked with a German agency on a specific case)

I was even suggested by an agency to translate a 12,000 words bunch of documents in English ( a terrible scanned PDF) from scratch and for a low rate, I refused. I don't want to type that much, create the layout, all that without the help of a TM.

IMO, an agency should always provide you with an exact word count and ready files.

Have a nice week-end

Madeleine
Collapse


 
finnword1
finnword1
United States
Local time: 22:41
English to Finnish
+ ...
OCR Dec 14, 2014

Just now working on a 44-page PDF (Pretty Darn Frustrating) document of so-so quality. OmniPage did a good job for me.

 
Miguel Carmona
Miguel Carmona  Identity Verified
United States
Local time: 19:41
English to Spanish
@MikeTrans Dec 14, 2014

Now I understand.

You are talking about perfectly scanned or created PDF files, perfectly square, not slanted, not dirty (no spurious marks or spots here and there), perfectly readable, no blurry areas, no "eroded" text, etc. In short, an ideal situation.

You basically want perfectly clean PDF files but with different degrees of layout complexity with headings, subheadings, header, footer, columns, tables, figures with caption and callouts, etc.

I hope I
... See more
Now I understand.

You are talking about perfectly scanned or created PDF files, perfectly square, not slanted, not dirty (no spurious marks or spots here and there), perfectly readable, no blurry areas, no "eroded" text, etc. In short, an ideal situation.

You basically want perfectly clean PDF files but with different degrees of layout complexity with headings, subheadings, header, footer, columns, tables, figures with caption and callouts, etc.

I hope I understood you well.

[Edited at 2014-12-15 16:04 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

PDF: I'm looking for trouble :)






Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »