From: Prof Brian Ripley Subject: Unicode and wchar_t To: R core team Date: Wed, 23 Dec 2020 17:02:48 +0000 I have been mulling over some issues and trying some fixes but I think a wider review is necessary. The support of UTF-8 and other MBCSs dates back to 2003 and is based on contributions from Ei-Ji Nakama (whose English is sketchy and so I never understood some of the rationale). At that time Windows used UCS-2 internally without support for surrogate pairs (UTF-16) and Unicode allowed up to 32-bit Unicode points, since limited to 21-bit. I doubt that points beyond the BMP (16-bit) are much used even now, certainly not beyond representing characters, although people seem to be testing them in weird ways (e.g. G. Csardi). Much of the support for e.g. nchar, tolower, grep is based on converting to wide strings (wchar_t *), manipulating and converting back, most often to UTF-8. A fundamental issue is that wchar_t is 16-bit UTF-16 using surrogate pairs on recent Windows, whereas it is 32-bit on all other known platforms -- most of the latter declare that the representation is by the number of the Unicode point (UCS-4) but BSD-derived systems (notably macOS) and Solaris do not but seem to use UCS-4. I think it is safe to assume that the representation is UCS-4 everywhere, maybe with a configure test. Support for surrogate pairs has been introduced since 2003 but it has lots of problems (which got me started on looking into this). Not to mention unpaired surrogate points .... Surrogate points can be represented in UTF-8 but it has (since 2003) been made clear that they are invalid and most people think they are also invalid UCS-4. Currently we use a mixture of system functions such as mbstowcs/wcstombs which have varying degrees of correctness (and support of surrogate points), and our own functions used to convert to/from UTF-8 (which allow marked encoding, at least of UTF-8). Not all of the code allowed for marked encoding, more for UTF-8 than Latin-1. Then system functions/macros are used for some of the operations on wide characters (the list checked for is in m4/R.m4) although Ei-ji Nakama wrote replacements for some of these used on some platforms. A few things (notably using PCRE) are done by converting to UTF-8. However, those modified functions are not used by TRE. We have almost no idea of the quality of those system functions on 'new' plaforms, e.g. musl-based Linux. And for *BSD I suspect they are as bad as they were in 2003. Now most platforms use UTF-8 (and maybe Windows will soon after Tomas' efforts) we have other possibilities. If almost all character strings were in UTF-8 we could convert the remainder into UTF-8 and work in UTF-8, converting (by our own functions) to/from UCS-4 if needed. This could avoid surrogate points and dodgy system functions entirely. One possibility is to explore using more of ICU, which would I expect to mean introducing a C++11 dependence (which seems OK now). Another is to use our own versions of those system functions, taken from glibc or elsewhere. Also, both provide well-tested regex implementations we could consider. My aim is during the CRAN shutdown and in a lull in work on M1 Macs to get this documented, some holes filled and bugs fixed: I've convinced myself that is worth the effort. But I expect a better longer-term solution would be to use ICU where available (and maybe, or not, make it available on Windows). Brian -- Brian D. Ripley, ripley@stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford