A Proposal for Bidi Isolates in Unicode
A Proposal for Bidi Isolates in Unicode
A Proposal for Bidi Isolates in Unicode
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
neutral character. This is <strong>in</strong> contrast to the exist<strong>in</strong>g embedd<strong>in</strong>g <strong>for</strong>matt<strong>in</strong>g characters (LRE, RLE,<br />
PDF) which have the effect of a strong character on their surround<strong>in</strong>gs. Otherwise, isolates are<br />
quite similar to embedd<strong>in</strong>gs: they declare a direction <strong>for</strong> the text <strong>in</strong>side it, and can be nested<br />
<strong>in</strong>side another isolate or embedd<strong>in</strong>g (and vice-versa). <strong>Isolates</strong> are just a k<strong>in</strong>der, gentler <strong>for</strong>m of<br />
embedd<strong>in</strong>g that prevents opposite-direction phrases from scrambl<strong>in</strong>g their surround<strong>in</strong>gs.<br />
To see the need <strong>for</strong> isolates, take the case of a Hebrew article title, מעניין“ ,”מאמר followed <strong>in</strong> our<br />
English document by a hyphen and the article’s date of publication, let’s say “14 July 2012”.<br />
When no <strong>for</strong>matt<strong>in</strong>g characters are used, the result is a mess:<br />
July 2012 מאמר מעניין - 14<br />
That’s because the UBA has no way of know<strong>in</strong>g whether the “14” is a part of the RTL phrase or<br />
of the LTR surround<strong>in</strong>gs, and happens to guess <strong>in</strong>correctly that it is a part of the RTL phrase.<br />
Un<strong>for</strong>tunately, and here we get to the real po<strong>in</strong>t, surround<strong>in</strong>g the Hebrew title with explicit<br />
embedd<strong>in</strong>g <strong>for</strong>matt<strong>in</strong>g characters (RLE and PDF) does not change this display at all, s<strong>in</strong>ce the<br />
explicit embedd<strong>in</strong>g still affects its surround<strong>in</strong>g the same as an RTL character.<br />
In contrast, surround<strong>in</strong>g the Hebrew title with the new isolate <strong>for</strong>matt<strong>in</strong>g characters will result <strong>in</strong><br />
the <strong>in</strong>tended order<strong>in</strong>g:<br />
14 July 2012 -מאמר מעניין<br />
(S<strong>in</strong>ce isolates are obviously not available yet, we achieved it here via traditional means<br />
discussed <strong>in</strong> the “<strong>Isolates</strong> vs Embedd<strong>in</strong>gs and Marks” section below.)<br />
Direction estimation <strong>for</strong> a nested phrase means that a part of a paragraph can be explicitly<br />
marked to be of an unknown direction, which is to be determ<strong>in</strong>ed from its content us<strong>in</strong>g the<br />
same first-strong algorithm already specified by the UBA <strong>for</strong> whole paragraphs. It makes perfect<br />
sense to limit this capability to a k<strong>in</strong>d of isolate. <strong>Isolates</strong> that specify an explicit direction would<br />
still be used where the direction of their content is known (and easily accessible) to the<br />
document author, but if that direction is not readily known, it can be specified to be guessed<br />
automatically. This is a very common scenario when generat<strong>in</strong>g documents on the basis of<br />
templates and data.<br />
Specifically, we propose four new <strong>Unicode</strong> <strong>for</strong>matt<strong>in</strong>g code po<strong>in</strong>ts:<br />
● LRI (LEFT-TO-RIGHT ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a left-to-right isolate.<br />
● RLI (RIGHT-TO-LEFT ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a right-to-left isolate.<br />
● FSI (FIRST-STRONG ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a first-strong isolate, i.e. one<br />
whose direction is determ<strong>in</strong>ed by apply<strong>in</strong>g rules P2 and P3 to the isolate’s content as if it<br />
were a separate paragraph.<br />
● PDI (POP DIRECTIONAL ISOLATE): marks the end of an isolate.