Roots of Sourashtra: Devanagari Script Recognition

Devanagari Script Recognition

More than 300 million people around the world use Devanagari script. It is the base script of many languages in India, such as Hindi and Sanskrit, Marati, Sourashtra, Konkani, and Mythili. And there are other languages that use variants of this script. Its basic set of symbols consists of 34 consonants and 18 vowels, and though Devanagari has a native set of symbols for numerals, Arabic numbers are now commonly used. Optical Character Recognition for Devanagari is highly complex do to its rich set of conjuncts.

Devanagari is written from left to right along a horizontal line. Its basic set of symbols consists of 34 consonants or ('vyanjan') and 18 vowels ('svar'). Characters are joined by a horizontal bar that creates an imaginary line by which Devanagari text is suspended, and no spaces are used between words. A single or double vertical line called a Danda was traditionally used to indicate the end of phrase or sentence. Devanagari also has a native set of symbols for numerals, though Arabic numbers are typically used.

In part, Devanagari owes its complexity to its rich set of conjuncts. The language is partly phonetic in that a word written in Devanagari can only be pronounced in one way, but not all possible pronunciations can be written perfectly. A syllable ("akshar") is formed by a vowel alone or any combination of consonants with a vowel.

Here is a sample set of non-compound devanagari characters.

You can clearly see that some characters have upper and lower modifiers. Here is a sample of Devanagari modifiers.

Obviously, these modifiers make Optical Character Recognition (OCR) with Devanagari script very challenging. OCR is further complicated by compound characters that make character seperation and identification very difficult.

Examples of some compound charaters are illustrated below.

OCR for Devanagari script becomes even more difficult when compound character and modifier characteristics are combined in 'noisy' situations. The image below illustrates a Devanagari document with background noise. You can clearly see that compound characters and modifiers are difficult to detect in this image because the image background is not uniform in color, and marks are present that must be distinguished from characters.

Devanagari text can be represented in 2 ways - Transliteration and Unicode formats. Both formats are widely used, though each makes its own claim for having covered the entire devanagari character set. A transliteration map is shown below. Transliteration is used to convert english alphabets into devanagari characters, based upon phonetic translation.

Free software is available that can convert english into Devanagari based upon the transliteration format

Roots of Sourashtra

Wednesday, January 6, 2010

Devanagari Script Recognition

1 comment:

Followers

Blog Archive