Unicode Technical Note #???

Representing Hebrew in Unicode

Version	1 (first draft)
Authors	Peter Kirk
Date	2004-07-26
This Version	http://www.unicode.org/notes/tn99/tn99-1.html (Draft: http://qaya.org/academic/hebrew/tnHebrew-1-d1.html)
Previous Version	http://www.unicode.org/notes/tn99/tn99-87.html
Latest Version	http://www.unicode.org/notes/tn99/

Summary

This document provides an overview of how to represent the Hebrew language in Unicode, using Hebrew script. It also covers representation of Aramaic in biblical and other ancient and mediaeval texts when Hebrew script is used.

Status

This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes/.

1 Introduction

This document provides an overview of how to represent in Unicode the Hebrew language, using Hebrew script, and either unpointed or using the Tiberian pointing system. It also covers representation of Aramaic in biblical and other ancient and mediaeval texts when Hebrew script is used; the orthographic rules for such Aramaic are identical to those for Hebrew. Some of the principles given here also apply to other languages written in Hebrew script. However, this document explicitly does not cover the specific requirements of the Yiddish language. It also does not cover pointing systems other than the Tiberian system.

This document should be read in conjunction with the current version of The Unicode Standard [TUS], and especially with Section 8.1 of Version 4.0.0 of the Standard. This document is intended as further clarification of the Standard, and certainly not as superseding or contradicting it.

The document is divided into three main sections corresponding to three main forms of Hebrew as currently written: unpointed text, which is commonest in the modern state of Israel; pointed text with Tiberian pointing, which is widely used for educational, poetic and liturgical material; and pointed and accented text, which is used almost only for the Hebrew Bible. These three main forms are not entirely distinct; for example, sometimes a few points are added to otherwise unpointed texts to resolve ambiguous cases; and a few accents, especially METEG, may be added to otherwise unaccented texts e.g. to indicate stress. Nevertheless it is convenient to consider the three forms separately here.

2 Unpointed Hebrew Text

Representation in Unicode of unpointed Hebrew text causes few difficulties. Text is written from right to left, using the Hebrew letters in the range U+05D0 to U+05EA. The final forms of letters are encoded as distinct characters, rather than as contextual variants as in Arabic, and so there is no difficulty in representing isolated forms or words, of foreign origin, in which the normal final form rules are not observed. The punctuation marks U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM should be used to indicate abbreviations and foreign sounds in modern Hebrew; these two are spacing characters, and must not be confused with U+059C HEBREW ACCENT GERESH and U+059E HEBREW ACCENT GERSHAYIM, which are combining marks and part of the Hebrew accent system. U+05BE HEBREW PUNCTUATION MAQAF and U+05C3 HEBREW PUNCTUATION SOF PASUQ may also be used along with punctuation from the General Punctuation block. U+20AA NEW SHEQEL SIGN is also available for use where needed. The Yiddish digraphs U+05F0 to U+05F2 should not be used in Hebrew or Aramaic text.

The alphabetic presentation forms U+FB20 to U+FB29, which are variant forms of Hebrew letters, and the ligature U+FB4F should not be used except for special purposes. Substitution of variant forms and of ligatures should be left to rendering engine.

The Hebrew Bible is sometimes written without points or accents. This is a requirement for synagogue scrolls in which the only characters permitted are the basic Hebrew letters and the the upper and lower dots or puncta extraordinaria. These dots should be represented in Unicode by U+05C4 HEBREW MARK UPPER DOT and (proposed and accepted) U+05C5 HEBREW MARK LOWER DOT, combining marks which should follow their base characters.

2.1 Numerals in Hebrew Text

In modern Hebrew text numerals are written with “Arabic” digits, i.e. U+0030 to U+0039 as in Latin script text, and they are written as in Latin script text with the most significant digit at the left. These should be represented in Unicode with the most significant digit first. The Unicode Bidirectional Algorithm [BiDi] will normally handle the alternation between right-to-left and left-to-right rendering. This Algorithm can sometimes give unexpected results, e.g. when processing a list of numbers or left-to-right words separated by punctuation embedded in Hebrew text, or a list of Hebrew words separated by punctuation embedded in left-to-right text; in such cases the override mechanism defined in the Algorithm may need to be used.

Numerals in older Hebrew text were written with Hebrew letters, and sometimes with additional marks to indicate that the letters were to be interpreted as numerals. The commonest additional marks are U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM. In some texts, including the marginal notes of some Hebrew Bible editions, numerals are indicated by either one or two dots above the letter. These dots, which are distinct from other combining marks used in Hebrew, should be represented in Unicode by U+0307 COMBINING DOT ABOVE and U+0308 COMBINING DIAERESIS.

3 Pointed Hebrew Text

This section covers only the Tiberian pointing system.

Pointed Hebrew text consists of unpointed text with the addition of the Hebrew vowel points U+05B0 to U+05BB, also U+05BC HEBREW POINT DAGESH OR MAPIQ, U+05BF HEBREW POINT RAFE, U+05C1 HEBREW POINT SHIN DOT and U+05C2 HEBREW POINT SIN DOT. U+05BD HEBREW POINT METEG is sometimes written in pointed text but is actually better considered part of the Hebrew accent system. These vowel points should always come after their associated base characters in the Unicode character string.

The alphabetic presentation forms U+FB1D and U+FB2A to U+FB4E, precomposed combinations of Hebrew letters and points, should not be used except for special purposes. They all have canonical decompositions to separate letters and points, and these characters should always be used separately.

3.1 Issues with Holam

The Hebrew vowel Holam may be written either just as a dot, which is Holam Haser (“defective” Holam), or as a dot above a Vav glyph, which is Holam Male (full Holam). Holam Haser is normally written above the top left of the base character it is associated with. It is sometimes shifted to above the top right of a following Alef, although it is not logically associated with the Alef. In every case Holam Haser should be represented as U+05B9 HEBREW POINT HOLAM, following its logically associated base character. Also, Holam Haser is sometimes merged with a Shin or Sin dot, i.e. it is not written as a separate dot when its base character is Shin with Sin dot or when the following character is Shin with Shin dot. Nevertheless, it is preferable to include U+05B9 HEBREW POINT HOLAM in the Unicode representation, because it is logically present and because the decision to render it separately or not is a typographical one.

Holam Male (pronounced O) and the sequence Vav Haluma (consonantal Vav followed by Holam Haser, pronounced VO) are often rendered identically. If this is the intention, both should be represented in Unicode by the sequence <05D5, 05B9>, i.e. <VAV, HOLAM>. However, in more exact typography, especially for biblical, liturgical, poetic and educational texts, Holam Male and Vav Haluma are distinguished by different positionings of the Holam dot. When it is intended to represent this distinction in Unicode, Holam Male should be represented as (***TBD***), and Vav Haluma as (***TBD***).

3.2 Issues with Sheva, Dagesh and Qamats

U+05B0 HEBREW POINT SHEVA should be used for both Sheva Nah (silent Sheva) and Sheva Na (mobile Sheva). U+05BC HEBREW POINT DAGESH OR MAPIQ should be used equally for Dagesh Hazaq (strong Dagesh), for Dagesh Qal (weak Dagesh), for the Mapiq dot used with U+05D4 HEBREW LETTER HE and occasionally with U+05D0 HEBREW LETTER ALEF, and for the Shuruq dot used with U+05D5 HEBREW LETTER VAV. The occasionally made graphical distinctions between varieties of Sheva and Dagesh etc can be made in Unicode only with Private Use Area characters.

U+05B8 HEBREW POINT QAMATS should normally be used for both Qamats Gadol (or Qamats Rahav, with an A sound) and Qamats Qatan (or Qamats Hatuf, with an O sound). But because these letters are distinguished in print more regularly than varieties of Sheva and Dagesh (although still only in small subset of liturgical texts), Unicode has defined a separate character (proposed and accepted) U+05BA HEBREW POINT QAMATS QATAN, which should be used only when there is a specific need to distinguish Qamats Qatan from Qamats Gadol with a distinct glyph.

3.3 Ordering of Points

It is common for a single base character to carry more than one point. Most base characters commonly carry both a vowel point and either U+05BC HEBREW POINT DAGESH OR MAPIQ or U+05BF HEBREW POINT RAFE, or sometimes both of the latter. U+05E9 HEBREW LETTER SHIN additionally regularly carries U+05C1 HEBREW POINT SHIN DOT or U+05C2 HEBREW POINT SIN DOT. From a logical and psychological point of view, Shin dot and Sin dot are most closely associated with the base character, followed by Dagesh, Mapiq, or Rafe, and the vowel point is least closely associated, and this is the expected typing order for the Hebrew grapheme cluster. Thus the order to be expected in the Unicode character string is the base character, followed by SHIN DOT or SIN DOT, then DAGESH OR MAPIQ and/or RAFE, then the vowel point. Because of this, and also for ease of implementation, many existing Hebrew rendering engines and fonts are designed to expect this order of characters.

Unfortunately, when combining classes were allocated to the Hebrew points, they were allocated neither according to the standard positional system nor according to the logically and psychologically expected ordering. Each of the points, and also U+05BD HEBREW POINT METEG, was allocated to a unique combining class, and such that the vowels are in the lowest numbered classes and SHIN DOT and SIN DOT in the highest numbered, with DAGESH OR MAPIQ and RAFE in between – precisely the opposite of what is logical. The Unicode Stability Policy [Stability] dictates that the relative orders of these combining classes cannot be changed. The implication of this is that in any of the defined Unicode Normalization Forms [Normalization] the order of characters for a Hebrew grapheme cluster is the base character, then the vowel point, then DAGESH OR MAPIQ and/or RAFE, then SHIN DOT or SIN DOT. These normalization forms include the one recommended for use on the Internet, as well as the forms generated by many Unicode processes.

The consequence of this is that pointed Hebrew text is likely to be found with its points arranged in at least two very different orders. These orders are canonically equivalent, and so there can be no meaningful distinction between them. Indeed, according to The Unicode Standard [TUS], Version 4.0.0 Section 3.2, Conformance Requirement C9, “Ideally, an implementation would always interpret two canonical-equivalent character sequences identically.” Thus, all Unicode processes, including rendering engines and fonts, are expected to treat the two orders of points identically, including rendering them identically; but there is no requirement that this must happen.

The best advice that can be given to Hebrew users and to those providing services including fonts for them is that they should expect to find pointed Hebrew text written with both the logical and the normalized orders of points, and perhaps with other canonically equivalent orders, and that at least the two major orders should be rendered and otherwise processed identically. In practice the best way to do this may be to reorder the text before processing, either according to a Unicode normalization form or into a private canonically equivalent normalization form.

3.4 Multiple Vowel Points

On rare occasions, mostly in the Hebrew Bible text, a single base character carries

4 Pointed and Accented Hebrew Text

Again, this section covers only the Tiberian system of pointing and accents.

5 References

[BiDi]	Unicode Standard Annex #9: The Bidirectional Algorithm http://www.unicode.org/reports/tr9/
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Normalization]	Unicode Standard Annex #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/
[Stability]	Unicode Stability Policy http://www.unicode.org/standard/stability_policy.html
[TUS]	The Unicode Standard http://www.unicode.org/standard/versions/enumeratedversions.html#Latest

Template text:

This document contains an optional template for generating Unicode Technical Notes in HTML format, using a hypothetical example of Unicode Technical Note #99, with the current version set to 1. To use this template, create a copy and make the following changes:

Replace the #99 by the correct values.
The version number will start out as 1 and increment upwards as needed if the UTN is ever revised.
Replace [authors], [date], and [title] by the correct values.
Remember to change the document title (<title>...</title> in HTML) so that it shows up correctly.
Change the summary to have a short description of the document.
Don't touch the status.
Major sections are <h2> in HTML, subsections are <h3>, etc.
Each section, subsection, etc. should have an anchor (aka bookmark). Use short, descriptive phrases for the anchor, without reference to the number of the section. Try not to change the anchors from version to version.
Have one line in the Contents for each section and subsection. These are to be linked to the section/subsection.
Add the appropriate References (and remove ones that are not referenced). The standard style is: "For more information, see Unicode Frequently Asked Questions [FAQ]."
- Notice that the document name directly links to the outside document, while the abbreviation links to the Reference table. The latter is to allow the document to be used in printed form.
You can use the following style of HTML for images of the representative glyphs in the Standard, such as or :
             <img alt="03E2"
          src="http://www.unicode.org/cgi-bin/refglyph?24-03E2"
          style="vertical-align:middle">
(Note: The integer "24" refers to the point size of the representative glyph, but at this time, only 24-point glyphs are supported. "03E2" is the Unicode character hex value.)

1.1 Subsection

This subsection is inserted just to have a sample. Likewise, the References section below is populated with a few possibly useful entries, but is not required.

2 Style

When in doubt, follow the Chicago Manual of Style [CMS]. In particular,

Only use a single space after the end of a sentence.
Use real punctuation instead of the ASCII fallbacks. For example:

Publishing	ASCII
questions—and answers—about	questions--and answers--about
“phthisic”	"phthisic"
‘fish’	'fish', `fish'
can’t	can't

3 References

[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[CMS]	The Chicago Manual of Style: The Essential Guide for Writers, Editors, and Publishers (14th Edition) University of Chicago Press (Trd); ISBN: 0226103897 Also see their FAQ at http://www.press.uchicago.edu/Misc/Chicago/cmosfaq.html
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[U3.1]	Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/reports/tr27/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

The following summarizes modifications from the previous version of this document.

1	Sample version

Copyright © 2004 [authors] and Unicode, Inc. All Rights Reserved. The Unicode Consortium and [authors] make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.