Asian Language & Alphabet Support for C string with Diacritics and Bi-Directional texts
The C string preprocessor is an input and output text string converter inside IconEdit.
The input processor finds the C strings in text catalogs and makes fonts.
The output processor modifies the C strings to use the fonts on a normal left to right display with minimum embedded processor overhead.
The font and the converted C strings can be used directly by the compiler and the display drivers from RAMTEX.
The C string pre-processor has Diacritic and Directional support for the following languages and alphabets:
- Arabic, Assamese
- Bangla, Bengali, BodoDogri, Buginese
- Cambodian
- Dari, Devanagari
- Farsi
- Gujarati, Gurmukhi
- Hebrew, Hindi
- Kannada, Kashmiri, Khmer, Konkani, Kusunda
- Lao
- N'Ko
- Maithili, Malayalam, Marathi, Meitei, Myanmar
- Nepali, Nihali
- Odia, Oriya
- Persian, Punjabi
- Sanskrit, Santali, Sindhi, SinHala, Syriac
- Tamil, Telugu, Thaana, Thai, Tibetan
- Urdu
Input C string converter for hexadecimal characters, Asian alphabets and classic 8-bit texts
The input preprocessor converts C text strings to internal 16-bit Unicode in IconEdit.
- Convert UTF-8 hexadecimal text strings to 16-bit Unicode.
- Convert UTF-16 hexadecimal numbers in strings to 16-bit Unicode.
- Convert UTF-32 hexadecimal numbers for high plane emoji in strings to Unicode surrogate characters.
- Combine surrogate characters to find high plane characters such as emoji.
- Move high plane characters to the private area in 16-bit Unicode.
- Find combinations of characters, ligatures, and diacritics to make combined characters.
- Find and add Arabic presentation characters.
- Convert classic 8-bit encoded text strings to 16-bit Unicode.
After the input conversion, IconEdit creates all necessary characters for the text strings as one font.
In this example IconEdit read and convert a C like pseudocode file with only the two lines:
wchar32 szSmile[]={L"Smiley স্মাইলি \U0001F603 !"};
wchar32 szCable[]={L"Cable Car ಕೇಬಲ್ ಕಾರು \U0001F6A1"};
The input converter ignores anything outside the double quotes.
The resulting font optimized for the text strings:
The input converter creates combined characters in the private area E700 to F8FF in Unicode.
High plane Unicode characters such as emoji can either be addressed either as 16 bit or as 32 bit characters:
Addressing emoji as 16 bit characters will make your texts take up less memory.
IconEdit always orders the characters in the font alphabetically according to Unicode.
The new Unicode character value (code point) is shown above each character.
The input text is shown automatically with the font:
Only the text inside the string is in the font, the rest is there for orientation.
Output C string converter for diacritics and combined characters
Combinations of basic characters and diacritics in the input strings are substituted with the combined characters in the private area.
- Find and replace Asian entry characters and diacritics with combined characters.
- Display Asian combined characters simply with micro controller.
A basic character plus diacritic example:
Asiatic texts are written as basic characters followed by none or some diacritics.
This is how Asian texts are stored in a computer:
The combination of basic characters followed by diacritics are identified as groups:
The output converter search the font for matching combined characters:
The text is then converted by the IconEdit output converter for correct display:
Combinations of basic characters and diacritics are combined to one character.
The output file with the converted text strings is linked to the font and the two should be used together by the compiler and the display.
Output C string converter for presentation characters and bi-directional texts
Basic characters in the input strings are substituted with presentation characters and both text strings and symbols are reversed if necessary.
- Replace basic Arabic entry characters with presentation characters.
- Solve Middle Eastern bi-directional character and symbol direction issues.
- Display Arabic presentation characters in right to left strings.
A bi-directional example:
Arabic Text is Stored Left to Right but should be Displayed Right to Left:
The text is split up in types:
- Black is single digit number, no action is needed.
- Red is a mathematical operator, it should be mirrored.
- Blue is basic Arabic characters, they should be in reverse order.
- Blue is basic Arabic characters, should be substituted with presentation characters according to their position in the word.
- Green is a multi digit number, digits should keep their order.
Text strings with right to left characters are reversed for left to right displays:
Arabic Text with numbers Displayed Right to Left:
This is how the output converter stores the final string.
Output C string converter for hexadecimal characters
IconEdit can convert the input text Smiley স্মাইলি \U0001F603 ! to one of the following output formats:
- Smiley \xE700ই\xE701 \xE706 ! Pure Unicode with private characters as 16-bit hexadecimal. This makes the text string easier to read for humans but makes no difference to the compiler.
- Smiley \xE700\x0987\xE701 \xE706 ! UTF-16 hexadecimal for old editors that can not read Unicode. This is still Unicode to the compiler.
- Smiley \xEE\x9C\x80\xE0\xA6\x87\xEE\x9C\x81 \xEE\x9C\x86 ! UTF-8 hexadecimal for old 8-bit compilers that can not understand Unicode strings. To the compiler, this is an 8-bit classic text. Use the UTF-8 option in the RAMTEX driver library to display the text as Unicode. This way it is possible to use 16-bit Unicode texts and fonts by an 8-bit compiler.
Memory consumption for different string formats
UTF-16 hexadecimal and Unicode uses 2.0 byte per character ROM space for normal languages and 4.0 byte for emoji and rare Chinese and Japanese names.
UTF-8 hexadecimal take up different amounts of ROM space per character depending on language and alphabet:
- 1.0 byte per character: American English.
- 1.1 - 1.3 byte per character: Other languages written with the Latin alphabet.
- 2.0 - 2.2 byte per character: Other European and Middle Eastern languages except Arabic.
- 2.6 - 2.9 byte per character: Arabic and South Asiatic languages.
- 3.0 byte per character: Chinese, Japanese, and Korean.
- 4.0 byte per character: Emoji and rare Chinese and Japanese Names
The connection between C string formats and text file formats
Windows can save plain text in 4 different file formats:
- ANSI 8 Bit One byte per character classic 8-bit encoding of 256 characters for a few languages. Text is only portable to a limited number of countries.
- Unicode little endian 16 Bit Two byte per character with least significant byte first Unicode encoding of 65536 characters for all living languages. Text is portable anywhere.
- Unicode big endian 16 Bit Two byte per character with least significant byte last Unicode encoding of 65536 characters for all living languages. Text is portable anywhere.
- Unicode UTF-8 8-24 Bit Between one and three bytes per character Unicode encoding of 65536 characters for all living languages. Text is portable anywhere.
IconEdit can save C-source text strings in 4 different string formats:
- Unicode 16 Bit String texts and comments stay as 16-bit Unicode characters and are saved as Unicode text files. Both strings and comments are portable anywhere.
- Unicode 16 Bit Hexadecimal String texts are converted to 16-bit hexadecimal characters and comments are converted to a Classic 8-bit Windows or ISO-8859 encoding of your choice. The string texts are still Unicode to the compiler, but encapsulated in 7-bit ASCII. Strings are portable anywhere.
- Unicode UTF-8 8-24 Bit Hexadecimal String texts are converted to UTF-8 8-bit hexadecimal characters and comments are converted to a Classic 8-bit Windows or ISO-8859 encoding of your choice. The string texts are still Unicode to the compiler, but encapsulated in 7-bit ASCII. Strings are portable anywhere.
- Classic 8 Bit Both String texts and comments are converted to a Classic 8-bit DOS, Windows, or ISO-8859 encoding of your choice and saved as 8-bit text files. Strings, fonts, and comment texts are only portable to a limited number of countries.
Trace characters through the process with the mouse help and blue marks
Blue marks can be set by the mouse and follow the character through all windows.
Use mouse help in all windows to see how the character is created and used.
Font with blue high-light for selected character:
Text with blue frame around selected character. Mouse help has an additional text length indicator so you can see if the text will fit the target display:
Output text with private characters as 16-bit hexadecimal:
UTF-8 hexadecimal output text, selected character takes up 3 byte:
Both mouse help and blue marks can be turned off and on at any time.
More about Middle Eastern and South Asian fonts.