From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
Subject: Unicode and wchar_t
To: R core team <R-core@r-project.org>
Date: Wed, 23 Dec 2020 17:02:48 +0000

I have been mulling over some issues and trying some fixes but I think a 
wider review is necessary.

The support of UTF-8 and other MBCSs dates back to 2003 and is based on 
contributions from Ei-Ji Nakama (whose English is sketchy and so I never 
understood some of the rationale).  At that time Windows used UCS-2 
internally without support for surrogate pairs (UTF-16) and Unicode 
allowed up to 32-bit Unicode points, since limited to 21-bit.  I doubt 
that points beyond the BMP (16-bit) are much used even now, certainly 
not beyond representing characters, although people seem to be testing 
them in weird ways (e.g. G. Csardi).

Much of the support for e.g. nchar, tolower, grep is based on converting 
to wide strings (wchar_t *), manipulating and converting back, most 
often to UTF-8.

A fundamental issue is that wchar_t is 16-bit UTF-16 using surrogate 
pairs on recent Windows, whereas it is 32-bit on all other known 
platforms -- most of the latter declare that the representation is by 
the number of the Unicode point (UCS-4) but BSD-derived systems (notably 
macOS) and Solaris do not but seem to use UCS-4.  I think it is safe to 
assume that the representation is UCS-4 everywhere, maybe with a 
configure test.

Support for surrogate pairs has been introduced since 2003 but it has 
lots of problems (which got me started on looking into this).  Not to 
mention unpaired surrogate points ....   Surrogate points can be 
represented in UTF-8 but it has (since 2003) been made clear that they 
are invalid and most people think they are also invalid UCS-4.

Currently we use a mixture of system functions such as mbstowcs/wcstombs 
which have varying degrees of correctness (and support of surrogate 
points), and our own functions used to convert to/from UTF-8 (which 
allow marked encoding, at least of UTF-8).  Not all of the code allowed 
for marked encoding, more for UTF-8 than Latin-1.  Then system 
functions/macros are used for some of the operations on wide characters
(the list checked for is in m4/R.m4) although Ei-ji Nakama wrote 
replacements for some of these used on some platforms. A few things 
(notably using PCRE) are done by converting to UTF-8.

However, those modified functions are not used by TRE.

We have almost no idea of the quality of those system functions on 'new' 
plaforms, e.g. musl-based Linux.  And for *BSD I suspect they are as bad 
as they were in 2003.

Now most platforms use UTF-8 (and maybe Windows will soon after Tomas' 
efforts) we have other possibilities.  If almost all character strings 
were in UTF-8 we could convert the remainder into UTF-8 and work in 
UTF-8, converting (by our own functions) to/from UCS-4 if needed.  This 
could avoid surrogate points and dodgy system functions entirely.

One possibility is to explore using more of ICU, which would I expect to 
mean introducing a C++11 dependence (which seems OK now).  Another is to 
use our own versions of those system functions, taken from glibc or 
elsewhere.  Also, both provide well-tested regex implementations we 
could consider.

My aim is during the CRAN shutdown and in a lull in work on M1 Macs to 
get this documented, some holes filled and bugs fixed: I've convinced 
myself that is worth the effort.  But I expect a better longer-term 
solution would be to use ICU where available (and maybe, or not, make it 
available on Windows).

Brian

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford