Issues with 2020-12 code handling encodings =========================================== 1) In many cases inputs marked as Latin-1 were not considered or only considered in MBCS locales. This should be easy to fix as Latin-1 is a subset of UTF-8 and so marked-as-Latin-1 can be treated as marked-as-UTF-8. However, 'Latin-1' often really means CP1252 on Windows so it would be better to translate explicitly to UTF-8 (translateCharUTF8 does so and may well suffice). 2) Much code assumes that advancing a wchar_t* string is character-by-character and that each element represents a single character. This is a C99/POSIX assumption for wcwidth and the classification functions iswctype/iswalpha/.... That works for UCS-4 and UCS-2 (as used by all Windows pre-W2000 and by default for a while thereafter) but not for UTF-16. Examples are the TRE-based regex and agrep code. abbrev() A smaller amount of code translates everything not in a native SBCS to UTF-8 and advances character-by-character there. This includes the 'fixed' and PCRE regex code. 3) There are (at least on some platforms) substitutes for the single-wide-character functions in rlocale.[ch] but these are not used for the TRE-based regexp engine which uses the system versions. 4) There is an assumption that on a Unix-alike wchar_t is encoded in UCS-4. Some platforms declare this via C99 define __STDC_ISO_10646__ but macOS and Solaris do not, and most likely *BSD do not (at least FreeBSD, from which macOS is derived, used not to). It is assumed that conversions to/from UTF-8 are compatible with those to/from the native encoding and with the system classification/width functions. At some times warnings have been given but with no known exceptions are currently silenced. 5) Conversion functions have proliferated over the years. We currently use (in a Unix-alike, checked for macOS) system functions: mbrtowc mbstowcs wcrtomb wcstombs It is unclear what mbrtowc does on Windows if a MBCS character needs to be stored as a surrogate pair -- which probably cannot happen until we get UTF-8 locales. wcrtomb is used to check validity in plotmath.c and to convert to native in rlocale.c. wcstombs is used when reading from pipes on Windows, and to convert back to native in make.names(), tolower() and chartr(). Rf_mbcsToUcs2 converts from native or marked UTF-8 to R_ucs2_t (unsigned short) by iconv. Used in grDevices Rf_mbrtowc (aka Mbrtowc) converts native to wchar_t* (a wrapper for system mbrtowc that reports failure as error.) Used in character.c connections.c, parser, grep.c, plotmath.c Rf_mbtoucs converts from native to UCS-4 by iconv (should be mbs) Used in parser (with WC_NOT_UNICODE) and engine.c (converting pch to glyph). Rf_ucstomb converts from UCS-4 to native by iconv (should be mbs) Used in devX11.c Rf_ucstoutf8 converts a single char from UCS-4 (unsigned int) to UTF-8 by iconv. Used in grDevices and exported. Rf_utf8toucs converts a single UTF-8 to to wchar_t (high surrogate) Used in character.c, engine.c, printutils.c Rf_utf8toucs32 converts a single UTF-8 to UCS-4 (to R_wchar_t, defined in rlocale.h to be unsigned int) Used in character.c, engine.c, printutils.c, sysutils.c Rf_utf8towcs (to wchar_t) converts a UTF-8 string to wchar_t*, possibly to UTF-16. Used in grDevices, character.c, connections.c, grep.c and on Windows. Rf_wcstoutf8 converts by iconv from UCS-4 (even with surrogate points) or UTF-16. Used in parser, grep.c, character.c and on Windows. Rf_wtransChar converts a CHARSXP including marked encodings, to wchar_t by iconv. Assumes UCS-4 on Unix, UCS-2 on Windows (so fails with surrogate pairs). Used in agrep.c grep.c (for TRE engine) character.c and on Windows for system functions in sysutils.c and in package utils. Windows only: Rmbrtowc (aka mbrtowc) Rwcrtomb (aka wcrtomb) wrappers that allow FAKE_UTF8 to assume current locale is UTF-8. Date back to 2005, could be removed. ---------------------------------------------------------------------- Inventory: character.c ----------- nchar() nchar(, "w") looks up character widths and on Windows that is limited to the BMP. substr() substr<-() These skip along by character in a MBCS or for a UTF-8-marked string. startsWith() Works in bytes except in a non-UTF-8 MBCS. abbrev() makenames() translates to the native encoding, works in wchar_t in a MBCS. tolower/upper() chartr() translate UTF-8 and Latin-1 to wchar (which needs Unicode wide characters), rest to current charset strtrim() Converts all inputs to native encoding, then to wchar_t* to get display widths of chars. Could do Latin-1 and UTF-8 directly. strtoi() Assumes ASCII input. strrep() Works in bytes. gram.y ------ Encodings come up in a few places; - tokenizing in a MBCS locale, which uses mbrtowc to read by char. - finding not-necessarily-ASCII 'spaces', which relies on __STDC_ISO_10646__ on a Unix-alike. - handling \u and \U character-string escapes, which creates UTF-8-marked CHARSXPs. This has a non __STDC_ISO_10646__ branch, probably never used. grep.c ------ There are native modes for all three options: otherwise the 'fixed' and PCRE options work in UTF-8, the TRE one in wchar_t*. use_UTF8 is set for any for UTF-8-marked inputs and for PCRE is in a MBCS locale. For do_grep/gsub/regexpr use_WC is set when use_UTF8 would be. printutil.c ----------- EncodeString/Rstrlen/Rstrwid use UTF-8 for some output on Windows and wchar_t for \u and \U output escaping. FIXME: Casts to wint_t when using iswprint, or calls with half of a surrogate pair, and that may not be big enough. rlocale.c --------- This supplies iswxxxxx functions for values up to 0x10fffd, but wint_t is unsigned short on Windows (at least on MinGW-W64). character.c uses iswspace iwsdigit iswalpha gram.c uses blankwct, iswalpha iswalnmum plotmath.c uses iswdigit util.c uses iswspace gnuwin32/{dos_wglob.c,extra.c] use iswalpha modules/X11/dataentry.c uses iswspace __STDC_ISO_10646__ is defined for macOS, FreeBSD and Solaris in character.c, gram.c and printutils.c but currently unused in the first and suppresses warnings in the other two. ---------------------------------------------------------------------- case tables