Simple markups for bold and italics in TMX?
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Sep 3, 2022

I was quite impressed when I saw how complicated the markups for bold and italics in TMX look:

Screen Shot 2022-09-04 at 06.59.33

Question: Can this be simplified? E.g. is the markup numbering really necessary?


[Edited at 2022-09-04 05:00 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Some info Sep 3, 2022

Found this info:

http://xml.coverpages.org/TMX-SpecV13.html


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:47
Member (2006)
English to Afrikaans
+ ...
It's actually very simple Sep 3, 2022

Hans Lenting wrote:
I was quite impressed when I saw how complicated the markups for bold and italics in TMX look...

It looks complicated because you're reading it as a human. For a computer, it's reasonably simple. It allows the TMX file to specify which bits of text are to be considered "tags", without requiring that the tags be converted to some universal TMX format.

Broadly speaking, TMX recognizes two types of tags, namely standalone tags and paired tags. Paired tags is where you have a begin-tag and an end-tag.

Since your goal is to make it simpler for humans to read (and write?), you should start by not using escaped text in the tag. So, instead of using <b> as the bold tag (which in the TMX will be displayed as &lt;b&gt; ), you can use [b] as the bold tag. Heck, you can even use "B" and "b" as the opening and closing bold tags (see below).

If you use e.g. just "B" as the bold tag, then it becomes difficult to read when you view it in a program that doesn't show or recognize the tags. In other words, if you use just "B" as the starting bold tag and "b" as the closing bold tag, then you may find a sentence that looks like "The Bcatb sat on the Bmatb." when viewed in a program that doesn't colour the tags (even though a CAT tool will correctly recognize that the "B" and "b" are tags and not part of the text.

Using "B" and "b" as the tags, the TMX would look like this:
The <bpt>B</bpt>cat<ept>b</ept> sat on the <bpt>B</bpt>mat<ept>b</ept>.

I'm not even sure it's necessary that tags have unique text. So it may be that you can actually just have this:
The <bpt>#</bpt>cat<ept>#</ept> sat on the <bpt>#</bpt>mat<ept>#</ept>.
The unparsed clear text would then be "The #cat# sat on the #mat#.", but the TMX file (and the CAT tool that parses it) will know that the first # is an opening bold tag and the second # is a closing bold tag.

So, just to be clear, if your TMX file contains this:
The <bpt>B</bpt>cat<ept>b</ept> sat on the <bpt>B</bpt>mat<ept>b</ept>.

...and you are translating e.g. an HTML file with the text "The <b>cat</b> sat on the <b>mat</b>." and your CAT tool correctly identifies the <b> and </b> as tags in the HTML file, then it will be a 100% match for the TU shown above, despite the fact that the TU uses "B" and "b" for the tags and not <b> and </b>.

The problem for you, as you may have realized, is that you often do not have any control over the plaintext version of the tags that get encoded in the TMX file. For example, it may be that you have no choice but to use <b> and </b> because that is what the tags look like in the plaintext version of the text that you're adding to the TMX file. It may be that your particular CAT tool chooses (without allowing you to have any say in it) to write <b> into the TMX file whenever there is a bold tag (that's probably what's happening).

Added: if you want to keep the TMX file nice to read for humans, you can tell your CAT tool to write the TUs without formatting. In other words, if the text in your file is "The <b>cat</b> sat on the <b>mat</b>.", then it should write just "The <b>cat</b> sat on the <b>mat</b>." to the TMX file. But then if you translate another file with the text "The <b>cat</b> sat on the <b>mat</b>." in it and the CAT tool recognizes e.g. <b> as a bold tag, the TU in the TMX will not be a 100% match for it (it probably won't even be a 50% match for it).

[Edited at 2022-09-03 12:30 GMT]


Yaotl Altan
 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:47
Member (2006)
English to Afrikaans
+ ...
The eye Sep 3, 2022

Hans Lenting wrote:
Can this be simplified? E.g. is the markup numbering really necessary?

The "i" is mandatory. Numbering it allows the TMX file to identify which end-tag belongs to which start tag.

So, if you have this:
The <b>cat <b>sat on the mat</b></b>.
the numbers would help the CAT tool know whether the first </b> tag is a closing tag for the first <b> tag or for the second <b> tag.

In my illustrative examples in my first post I omitted the "i" attribute, to make it simpler to type here in the forum, but in reality the "i" attribute is mandatory. That said, if you always make it "1", it won't break anything.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Thank you! Sep 4, 2022

Thank you for all your info and the thorough explanation, Samuel.

I'll study it.

As you might have guessed, my question is related to this posting:

https://www.proz.com/forum/cat_tools_technical_help/358669-from_ms_word_table_to_tmx_file-page2.html#2967977

(I'm planning to add m
... See more
Thank you for all your info and the thorough explanation, Samuel.

I'll study it.

As you might have guessed, my question is related to this posting:

https://www.proz.com/forum/cat_tools_technical_help/358669-from_ms_word_table_to_tmx_file-page2.html#2967977

(I'm planning to add marking of bold and italics in the created TMX file.)

BTW: I posted this message in the CAT forum. A moderator moved it to the Transit forum.

Because it is a generic question about TMX, I have requested to move this thread back to the CAT forum.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:47
Member (2006)
English to Afrikaans
+ ...
@Hans Sep 5, 2022

Hans Lenting wrote:
As you might have guessed, my question is related to this posting:
https://www.proz.com/forum/cat_tools_technical_help/358669-from_ms_word_table_to_tmx_file-page2.html#2967977
(I'm planning to add marking of bold and italics in the created TMX file.)

In that case, I suggest you use a "tag" that most people will interpret as a tag if they view the TM in a very simple tool that doesn't show the BTP etc. tags. In other words, either [b] or {b} or <b>. And since using angled brackets cause the TMX file to be even less human readable, I would be inclined to go for [b] or {b}.

If I were to do this for use by translators in my own language, I would even be tempted to use « and » because these quotes are NEVER used in either of my languages, although it might make the macro less useful for users of languages that use these symbols are ordinary quotes.


 
Yaotl Altan
Yaotl Altan  Identity Verified
Mexico
Local time: 09:47
Member (2006)
English to Spanish
+ ...
Colors Sep 8, 2022

Samuel's reply is insurmountable.

Furthermore, the colors help a lot to identify tags. When they are monochrome, is a really hard work to progress with a good pace.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Good suggestion! Sep 10, 2022

Samuel Murray wrote:

So, just to be clear, if your TMX file contains this:
The «bpt»B«/bpt»cat«ept»b«/ept» sat on the «bpt»B«/bpt»mat«ept»b«/ept».

...and you are translating e.g. an HTML file with the text "The «b»cat«/b» sat on the «b»mat«/b»." and your CAT tool correctly identifies the «b» and «/b» as tags in the HTML file, then it will be a 100% match for the TU shown above, despite the fact that the TU uses "B" and "b" for the tags and not «b» and «/b».


I came up with this solution, for bold, italics, underlined, superscript and subscript:

Screen Shot 2022-09-10 at 12.32.32

Different CAT tools may have different requirements for the properties in the header, but I had to add «prop type="x-processing_tags"»true«/prop» in order to make full matches possible.

In VBA the marking up would look like:


Sub ReplaceCharacterFormattingWithMarkup()
'Replace character formatting with markup in TMX style

Selection.Find.ClearFormatting
Selection.Find.Font.Bold = True
Selection.Find.Replacement.ClearFormatting
With Selection.Find.Replacement.Font
.Bold = False
.Italic = False
End With
With Selection.Find
.Text = ""
.Replacement.Text = "«bpt»B«/bpt»^&«ept»b«/ept»"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Font.Italic = True
Selection.Find.Replacement.ClearFormatting
With Selection.Find.Replacement.Font
.Bold = False
.Italic = False
End With
With Selection.Find
.Text = ""
.Replacement.Text = "«bpt»I«/bpt»^&«ept»i«/ept»"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
With Selection.Find.Font
.Bold = False
.Italic = False
.Superscript = True
.Subscript = False
End With
Selection.Find.Replacement.ClearFormatting
With Selection.Find.Replacement.Font
.Bold = False
.Italic = False
.Superscript = False
.Subscript = False
End With
With Selection.Find
.Text = ""
.Replacement.Text = "«bpt»P«/bpt»^&«ept»p«/ept»"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
With Selection.Find.Font
.Superscript = False
.Subscript = True
End With
Selection.Find.Replacement.ClearFormatting
With Selection.Find.Replacement.Font
.Bold = False
.Italic = False
.Superscript = False
.Subscript = False
End With
With Selection.Find
.Text = ""
.Replacement.Text = "«bpt»S«/bpt»^&«ept»s«/ept»"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Font.Underline = wdUnderlineSingle
Selection.Find.Replacement.ClearFormatting
With Selection.Find.Replacement.Font
.Bold = False
.Italic = False
.Underline = False
.Superscript = False
.Subscript = False
End With
With Selection.Find
.Text = ""
.Replacement.Text = "«bpt»U«/bpt»^&«ept»u«/ept»"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub


[Edited at 2022-09-10 10:59 GMT]

 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Application Sep 11, 2022

See here for an application of this.

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Simple markups for bold and italics in TMX?







Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »