| Version | 1 (first draft) |
| Authors | Peter Kirk |
| Date | 2004-07-26 |
| This Version | http://www.unicode.org/notes/tn99/tn99-1.html
(Draft: http://qaya.org/academic/hebrew/tnHebrew-1-d1.html) |
| Previous Version | http://www.unicode.org/notes/tn99/tn99-87.html |
| Latest Version | http://www.unicode.org/notes/tn99/ |
This document provides an overview of how to represent the Hebrew language in Unicode, using Hebrew script. It also covers representation of Aramaic in biblical and other ancient and mediaeval texts when Hebrew script is used.
This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes/.
This document provides an overview of how to represent in Unicode
the Hebrew
language, using Hebrew script, and either unpointed or using the
Tiberian pointing system. It also covers representation
of Aramaic in biblical and other ancient and mediaeval texts when
Hebrew script is used; the orthographic rules for such Aramaic are
identical to those for Hebrew. Some of the principles given here also
apply to
other languages written in Hebrew script. However, this document
explicitly does not cover the
specific requirements of the Yiddish language. It also does not cover
pointing systems other than the Tiberian system.
This document should be read in conjunction with the current version
of The
Unicode Standard [TUS], and especially with Section
8.1 of Version 4.0.0 of the Standard. This document is intended as
further clarification of the Standard, and certainly not as superseding
or contradicting it.
The document is divided into three main sections corresponding to
three main forms of Hebrew as currently written: unpointed text, which
is commonest
in the modern state of Israel; pointed text with Tiberian pointing,
which is widely used for
educational, poetic and liturgical material; and pointed and accented
text, which is used almost only for the Hebrew Bible. These three main
forms are not entirely distinct; for example, sometimes a few points
are added to otherwise unpointed texts to resolve ambiguous cases; and
a few accents, especially METEG, may be added to
otherwise unaccented texts e.g. to indicate stress. Nevertheless it is
convenient to consider the three forms separately here.
Representation in Unicode of unpointed Hebrew text causes few
difficulties. Text is written from right to left, using the Hebrew
letters in the range U+05D0 to U+05EA.
The final forms of letters are encoded as distinct characters, rather
than as contextual variants as in Arabic, and so there is no difficulty
in representing isolated forms or words, of foreign origin, in which
the normal final form rules are not observed.
The punctuation marks U+05F3 HEBREW PUNCTUATION GERESH
and U+05F4 HEBREW PUNCTUATION GERSHAYIM
should be used
to indicate abbreviations
and foreign sounds in modern Hebrew; these two are spacing characters,
and must not be confused with U+059C HEBREW ACCENT GERESH
and U+059E HEBREW ACCENT GERSHAYIM,
which are combining
marks and part of the Hebrew accent system. U+05BE HEBREW
PUNCTUATION
MAQAF and U+05C3 HEBREW PUNCTUATION SOF PASUQ
may also be used along with punctuation from the General Punctuation
block. U+20AA NEW SHEQEL SIGN is also
available for use
where needed. The Yiddish digraphs U+05F0
to U+05F2 should not be used in Hebrew
or Aramaic text.
The alphabetic presentation forms U+FB20 to U+FB29,
which are variant forms of Hebrew letters, and the ligature U+FB4F
should not be
used except for special purposes. Substitution of variant forms and of
ligatures should be left to rendering engine.
The Hebrew Bible is sometimes written without points or accents.
This is a requirement for synagogue scrolls in which the only
characters permitted are the basic Hebrew letters and the the upper and
lower dots or puncta extraordinaria.
These dots should be represented in Unicode by U+05C4 HEBREW
MARK UPPER DOT and (proposed and accepted) U+05C5
HEBREW MARK LOWER DOT, combining marks which should follow
their base characters.
In modern Hebrew text numerals are written with “Arabic”
digits, i.e. U+0030 to U+0039 as in
Latin script text, and they are written as in Latin script text with
the
most significant digit at the left. These should be represented in
Unicode with the most significant digit first. The Unicode Bidirectional
Algorithm [BiDi] will normally handle the
alternation between right-to-left and left-to-right rendering. This
Algorithm can sometimes give unexpected results, e.g. when processing a
list of numbers or left-to-right words separated by punctuation
embedded in Hebrew text, or a list of Hebrew words separated by
punctuation embedded in left-to-right text; in such cases the override
mechanism defined in the Algorithm may need to be used.
Numerals in older Hebrew text were written with Hebrew letters, and
sometimes with additional marks to indicate that the letters were to be
interpreted as numerals. The commonest additional marks are U+05F3
HEBREW PUNCTUATION GERESH
and U+05F4 HEBREW PUNCTUATION GERSHAYIM.
In some texts,
including the marginal notes of some Hebrew Bible editions, numerals
are indicated by either one or two dots above the letter. These dots,
which are distinct from other combining marks used in Hebrew, should be
represented in Unicode by U+0307 COMBINING DOT ABOVE
and U+0308 COMBINING DIAERESIS.
This section covers only the Tiberian pointing system.
Pointed Hebrew text consists of unpointed text with the addition of
the Hebrew vowel points U+05B0 to U+05BB,
also U+05BC HEBREW POINT DAGESH OR MAPIQ, U+05BF
HEBREW POINT RAFE, U+05C1 HEBREW POINT SHIN DOT
and U+05C2 HEBREW POINT SIN DOT. U+05BD HEBREW
POINT METEG is sometimes written in pointed text but is
actually better considered part of the Hebrew accent system. These
vowel points should always come after their associated base characters
in the Unicode character string.
The alphabetic presentation forms U+FB1D and U+FB2A
to U+FB4E, precomposed combinations of Hebrew letters
and points, should not be used except for special purposes. They all
have canonical decompositions to separate letters and points, and these
characters should always be used separately.
The Hebrew vowel Holam may be written either just as a dot, which is Holam Haser (“defective” Holam), or as a dot above a Vav glyph, which is Holam Male (full Holam). Holam Haser is normally written above the top left of the base character it is associated with. It is sometimes shifted to above the top right of a following Alef, although it is not logically associated with the Alef. In every case Holam Haser should be represented as U+05B9 HEBREW POINT HOLAM, following its logically associated base character. Also, Holam Haser is sometimes merged with a Shin or Sin dot, i.e. it is not written as a separate dot when its base character is Shin with Sin dot or when the following character is Shin with Shin dot. Nevertheless, it is preferable to include U+05B9 HEBREW POINT HOLAM in the Unicode representation, because it is logically present and because the decision to render it separately or not is a typographical one.
Holam Male (pronounced O)
and the sequence Vav Haluma
(consonantal Vav followed by Holam Haser, pronounced VO) are
often rendered identically. If this is the intention, both should be
represented in Unicode by the sequence <05D5, 05B9>,
i.e. <VAV, HOLAM>. However, in
more exact typography, especially for biblical, liturgical, poetic and
educational texts, Holam Male
and Vav Haluma are
distinguished by different positionings of the Holam dot. When it is intended to
represent this distinction in Unicode, Holam Male should be represented as
(***TBD***), and Vav Haluma
as (***TBD***).
U+05B0 HEBREW POINT SHEVA should be used for both Sheva Nah (silent Sheva) and Sheva Na (mobile Sheva). U+05BC HEBREW POINT
DAGESH OR MAPIQ should be used equally for Dagesh Hazaq (strong Dagesh), for Dagesh Qal (weak Dagesh), for the Mapiq dot used with U+05D4
HEBREW LETTER HE and occasionally with U+05D0 HEBREW
LETTER ALEF, and for the Shuruq
dot used with U+05D5 HEBREW LETTER VAV. The
occasionally made graphical distinctions between varieties of Sheva and Dagesh etc can be made in Unicode
only with Private Use Area characters.
U+05B8 HEBREW POINT QAMATS should normally be used
for both Qamats Gadol (or Qamats Rahav, with an A sound) and Qamats Qatan (or Qamats Hatuf, with an O sound). But
because these letters are distinguished in print more regularly than
varieties of Sheva and Dagesh (although still only in
small subset of liturgical texts), Unicode has defined a separate
character (proposed and accepted) U+05BA HEBREW POINT QAMATS
QATAN, which should be used only when there is a specific need
to distinguish Qamats Qatan
from Qamats Gadol with a
distinct glyph.
It is common for a single base character to carry more than one
point. Most base characters commonly carry both a vowel point and
either
U+05BC HEBREW POINT DAGESH OR MAPIQ or U+05BF
HEBREW POINT RAFE, or sometimes both of the latter. U+05E9
HEBREW LETTER SHIN additionally regularly carries U+05C1
HEBREW POINT SHIN DOT or U+05C2 HEBREW POINT SIN DOT.
From a logical and psychological point of view, Shin dot and Sin dot are most closely associated
with the base character, followed by Dagesh,
Mapiq, or Rafe, and the vowel point is least
closely associated, and this is the expected typing order for the
Hebrew grapheme cluster. Thus the order to be expected in the Unicode
character string is the
base character, followed by SHIN DOT or SIN DOT,
then DAGESH OR
MAPIQ and/or RAFE, then the vowel point.
Because of this, and also for ease of implementation,
many existing Hebrew rendering engines and fonts are designed to expect
this order of characters.
Unfortunately, when combining classes were allocated to the Hebrew
points, they were allocated neither according to the standard
positional system nor according to the logically and psychologically
expected ordering. Each of the points, and also U+05BD HEBREW
POINT METEG, was allocated to a unique combining class, and
such that the vowels are in the lowest numbered classes and SHIN
DOT and SIN DOT
in the highest numbered, with DAGESH OR
MAPIQ and RAFE in between – precisely the
opposite of what
is logical. The Unicode
Stability Policy [Stability] dictates that
the relative orders of these combining classes
cannot be changed. The implication of this is that in any of the
defined Unicode
Normalization Forms [Normalization]
the order of characters for a
Hebrew grapheme cluster is the base character, then the vowel point,
then DAGESH OR
MAPIQ and/or RAFE, then SHIN DOT
or SIN DOT. These normalization forms include the one
recommended for use on the Internet, as well as the forms generated by
many Unicode processes.
The consequence of this is that pointed Hebrew text is likely to be
found with its points arranged in at least two very different orders.
These orders are canonically equivalent, and so there can be no
meaningful distinction between them. Indeed, according to The Unicode
Standard [TUS], Version 4.0.0 Section
3.2, Conformance Requirement C9, “Ideally, an implementation would
always interpret two canonical-equivalent character sequences
identically.” Thus, all Unicode processes, including rendering engines
and fonts, are expected to treat the two orders of points identically,
including rendering them identically; but there is no requirement that
this must happen.
The best advice that can be given to Hebrew users and to those
providing services including fonts for them is that they should expect
to find pointed Hebrew text written with both the logical and the
normalized orders of points, and perhaps with other canonically
equivalent orders, and that at least the two major orders should be
rendered and otherwise processed identically. In practice the best way
to do this may be to reorder the text before processing, either
according to a Unicode normalization form or into a private canonically
equivalent normalization form.
On rare occasions, mostly in the Hebrew Bible text, a single base
character carries
Again, this section covers only the Tiberian system of pointing and accents.
| [BiDi] | Unicode Standard Annex #9: The
Bidirectional Algorithm http://www.unicode.org/reports/tr9/ |
| [FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
| [Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
| [Normalization] | Unicode Standard Annex #15: Unicode
Normalization Forms http://www.unicode.org/reports/tr15/ |
| [Stability] |
Unicode Stability Policy http://www.unicode.org/standard/stability_policy.html |
| [TUS] | The Unicode Standard http://www.unicode.org/standard/versions/enumeratedversions.html#Latest |
This document contains an optional template for generating Unicode Technical Notes in HTML format, using a hypothetical example of Unicode Technical Note #99, with the current version set to 1. To use this template, create a copy and make the following changes:
This subsection is inserted just to have a sample. Likewise, the References section below is populated with a few possibly useful entries, but is not required.
When in doubt, follow the Chicago Manual of Style [CMS]. In particular,
| Publishing | ASCII |
|---|---|
| questions—and answers—about | questions--and answers--about |
| “phthisic” | "phthisic" |
| ‘fish’ | 'fish', `fish' |
| can’t | can't |
| [FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
| [Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
| [CMS] | The Chicago Manual of Style: The Essential Guide for Writers, Editors, and Publishers (14th Edition) University of Chicago Press (Trd); ISBN: 0226103897 Also see their FAQ at http://www.press.uchicago.edu/Misc/Chicago/cmosfaq.html |
| [Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
| [U3.1] | Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/reports/tr27/ |
| [Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
The following summarizes modifications from the previous version of this document.
| 1 | Sample version |
Copyright © 2004 [authors] and Unicode, Inc. All Rights Reserved. The Unicode Consortium and [authors] make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.