This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Grepping Unicode files?


On 05/14/2015 11:14 AM, Vince Rice wrote:

Your mails are hard to read:
https://cygwin.com/acronyms/#PCYMTWLL

>>
>> None.  UTF16 is not a valid locale.  It is a valid encoding (wide
>> character), but locales must operate on multi-byte sequences, not wide
>> characters.  So you HAVE to convert from wide character to multi-byte
>> before you can do anything that requires a locale to work correctly.
> 
> Oh my, the rabbit-hole gets deeper. I donât know the difference between wide character and multi-byte. A little searching appears to indicate that Unicode is a type of wide-character, while multi-byte is â well, I still donât know what multi-byte is. :) But, weâre definitely out in the weeds of non-cygwinness here, and my file is UTF16, so I can learn what multi-byte is and the difference later.

First, you need to learn the difference between a character (which has a
name, a glyph when represented in a font, and a code point for what
order the character appears when listed in a set) and an encoding (which
describes how many bytes and the values of those bytes represent a code
point).  An encoding should have a mapping back to the character set,
but it is possible for some byte values to not have an assigned
character; it is also possible to require more than one byte to
represent a character.  A single character set can have more than one
encoding, and a character can exist in more than one character set.

Unicode is a definition of a character set (it covers the range u+00000
to u+10fff, although not all of those values have a character assigned).
 It is a superset of most other character definitions (ASCII being a
common one; other names you might have heard are Latin-1 and Latin-15).
 In fact, it aims to someday be a character set that IS a superset of
all others (but it is constantly being amended and more characters
defined, as people point out useful? characters that have not yet been
incorporated).  Conversely, for any other character set out there, there
is a character that is defined in Unicode but not defined in the weaker set.

Unicode has multiple encodings; among them, the more popular encodings
are UTF-32 (also called UCS-4) (every character occupies exactly 4
bytes), UTF-16 (most characters occupy 2 bytes each, but some characters
require 4 bytes because they are represented as surrogate pairs), UTF-8
(characters occupy a variable number of bytes, where ASCII characters
are 1 byte, and the maximum space required is 4 bytes), and the Java
variant of UTF-8 (like UTF-8, except that u+0000 is encoded specially
and surrogate pairs are encoded literally requiring 6 bytes rather than
4 for characters above u+0ffff).  Other encodings are also mentioned
here: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Meanwhile, a single-byte encoding is one that has at most 256
characters; many older character sets meet this property (ASCII,
Latin-1, etc).  And there are more character sets than Unicode that
require multi-byte encodings (such as Shift-JIS, Big5), but as they
encode fewer characters than Unicode, they tend to be not as popular
today.  Which means the character set of choice if you need to
communicate internationally is Unicode.

More concretely, consider these examples (assuming your email client is
set to read UTF-8 email, because that's what I'm sending):

'a' (the character named "lowercase a"): defined in ASCII (code point
0x61, single-byte encoding '\x61'), defined in Latin-1 (code point 0x61,
single-byte encoding '\x61', defined in Latin-15 (code point 0x61,
single-byte encoding '\x61'), defined in Unicode (code point u+00061,
single-byte UTF-8 encoding '\x61', single-byte Java encoding '\x61',
2-byte UTF-16 encoding '\x00\x61', four-byte UTF-32 encoding
'\x00\x00\x00\x61')

'â' (the character named "euro sign"): not defined in ASCII, not defined
in Latin-1, defined in Latin-15 (code point 0xa4, single-byte encoding
'\xa4'), defined in Unicode (code point u+020ac, 3-byte UTF-8 encoding
'\xe2\x82\xac', 3-byte Java encoding '\xe2\x82\xac', 2-byte UTF-16
encoding '\x20\xac', 4-byte UTF-32 encoding '\x00\x00\x20\xac')

and my favorite, from
http://www.fileformat.info/info/unicode/char/1F4A9/index.htm

'ð' (the character named "pile of poo") (if your system font has a
rendering for this font, consider yourself lucky! - or is that cursed?):
not defined in ASCII, not defined in Latin-1, not defined in Latin-15,
defined in Unicode (code point u+1f4a9, 4-byte UTF-8 encoding
'\xf0\x9f\x92\xa9',  6-byte Java encoding '\xed\xa0\xbd\xed\xb2\xa9',
4-byte UTF-16 encoding '\xd8\x3d\xdc\xa9',4-byte UTF-32 encoding
'\x00\x01\xf4\xa9').

One more piece of information: on Cygwin, wchar_t is 2 bytes (for
compatibility with windows); that means that cygwin prefers wide
operations in UTF-16, and has to use surrogate pairs for characters over
u+ffff. On Linux, glibc sets wchar_t to 4 bytes, and prefers wide
operations in UCS-4.

>> grep cannot handle UTF16 natively.  iconv exists to do encoding
>> transformations, so that the rest of the system can live in multi-byte
>> world instead of worrying about wide-character encodings.
> 
> â grep canât handle unicode files. Good to know. iconv it is.

No, grep can't handle UTF-16 or any other wide-character format.  But it
CAN handle unicode files, provided those files are encoded in multibyte
UTF-8.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]