This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unicode width data inconsistent/outdated


Hi Brian,

Am 07.08.2017 um 21:07 schrieb Brian Inglis:
...
Implementation considerations for handling the Unicode tables described in
	http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
and implemented in
	https://www.strchr.com/multi-stage_tables

ICU icu4[cj] uses a folded trie of the properties, where the unique property
combinations are indexed, strings of those indices are generated for fixed size
groups of character codes, unique values of those strings are then indexed, and
those indices assigned to each character code group. The result is a multi-level
indexing operation that returns the required property combination for each
character.

https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode

The FOX Toolkit uses a similar approach, splitting the 21 bit character code
into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
eliminate redundancy.

ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf

Thanks for the interesting links, I'll chech them out.
But such multi-level tables don't really help without a given procedure how to update them (that's only available for the lowest level, not for the code-embedded levels). Also, as I've demonstrated, my more straight-forward and more efficient approach will even use less total space than the multi-level approach if packed table entries are used.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]