01.11.2014 Views

A Proposal for Bidi Isolates in Unicode

A Proposal for Bidi Isolates in Unicode

A Proposal for Bidi Isolates in Unicode

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

neutral character. This is <strong>in</strong> contrast to the exist<strong>in</strong>g embedd<strong>in</strong>g <strong>for</strong>matt<strong>in</strong>g characters (LRE, RLE,<br />

PDF) which have the effect of a strong character on their surround<strong>in</strong>gs. Otherwise, isolates are<br />

quite similar to embedd<strong>in</strong>gs: they declare a direction <strong>for</strong> the text <strong>in</strong>side it, and can be nested<br />

<strong>in</strong>side another isolate or embedd<strong>in</strong>g (and vice-versa). <strong>Isolates</strong> are just a k<strong>in</strong>der, gentler <strong>for</strong>m of<br />

embedd<strong>in</strong>g that prevents opposite-direction phrases from scrambl<strong>in</strong>g their surround<strong>in</strong>gs.<br />

To see the need <strong>for</strong> isolates, take the case of a Hebrew article title, מעניין“‏ ‏,”מאמר followed <strong>in</strong> our<br />

English document by a hyphen and the article’s date of publication, let’s say “14 July 2012”.<br />

When no <strong>for</strong>matt<strong>in</strong>g characters are used, the result is a mess:<br />

July 2012 מאמר מעניין - 14<br />

That’s because the UBA has no way of know<strong>in</strong>g whether the “14” is a part of the RTL phrase or<br />

of the LTR surround<strong>in</strong>gs, and happens to guess <strong>in</strong>correctly that it is a part of the RTL phrase.<br />

Un<strong>for</strong>tunately, and here we get to the real po<strong>in</strong>t, surround<strong>in</strong>g the Hebrew title with explicit<br />

embedd<strong>in</strong>g <strong>for</strong>matt<strong>in</strong>g characters (RLE and PDF) does not change this display at all, s<strong>in</strong>ce the<br />

explicit embedd<strong>in</strong>g still affects its surround<strong>in</strong>g the same as an RTL character.<br />

In contrast, surround<strong>in</strong>g the Hebrew title with the new isolate <strong>for</strong>matt<strong>in</strong>g characters will result <strong>in</strong><br />

the <strong>in</strong>tended order<strong>in</strong>g:<br />

14 July 2012 ‏-​מאמר מעניין<br />

(S<strong>in</strong>ce isolates are obviously not available yet, we achieved it here via traditional means<br />

discussed <strong>in</strong> the “<strong>Isolates</strong> vs Embedd<strong>in</strong>gs and Marks” section below.)<br />

Direction estimation <strong>for</strong> a nested phrase means that a part of a paragraph can be explicitly<br />

marked to be of an unknown direction, which is to be determ<strong>in</strong>ed from its content us<strong>in</strong>g the<br />

same first-strong algorithm already specified by the UBA <strong>for</strong> whole paragraphs. It makes perfect<br />

sense to limit this capability to a k<strong>in</strong>d of isolate. <strong>Isolates</strong> that specify an explicit direction would<br />

still be used where the direction of their content is known (and easily accessible) to the<br />

document author, but if that direction is not readily known, it can be specified to be guessed<br />

automatically. This is a very common scenario when generat<strong>in</strong>g documents on the basis of<br />

templates and data.<br />

Specifically, we propose four new <strong>Unicode</strong> <strong>for</strong>matt<strong>in</strong>g code po<strong>in</strong>ts:<br />

● LRI (LEFT-TO-RIGHT ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a left-to-right isolate.<br />

● RLI (RIGHT-TO-LEFT ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a right-to-left isolate.<br />

● FSI (FIRST-STRONG ISOLATE): marks the beg<strong>in</strong>n<strong>in</strong>g of a first-strong isolate, i.e. one<br />

whose direction is determ<strong>in</strong>ed by apply<strong>in</strong>g rules P2 and P3 to the isolate’s content as if it<br />

were a separate paragraph.<br />

● PDI (POP DIRECTIONAL ISOLATE): marks the end of an isolate.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!