Issues with 2020-12 code handling encodings
===========================================

1) In many cases inputs marked as Latin-1 were not considered or only
considered in MBCS locales.  This should be easy to fix as Latin-1 is
a subset of UTF-8 and so marked-as-Latin-1 can be treated as
marked-as-UTF-8.  However, 'Latin-1' often really means CP1252 on
Windows so it would be better to translate explicitly to UTF-8
(translateCharUTF8 does so and may well suffice).


2) Much code assumes that advancing a wchar_t* string is
character-by-character and that each element represents a single
character.  This is a C99/POSIX assumption for wcwidth and the
classification functions iswctype/iswalpha/....  That works for UCS-4
and UCS-2 (as used by all Windows pre-W2000 and by default for a while
thereafter) but not for UTF-16.

Examples are

the TRE-based regex and agrep code.
abbrev()

A smaller amount of code translates everything not in a native SBCS to
UTF-8 and advances character-by-character there.  This includes the
'fixed' and PCRE regex code.


3) There are (at least on some platforms) substitutes for the
single-wide-character functions in rlocale.[ch] but these are not used
for the TRE-based regexp engine which uses the system versions.


4) There is an assumption that on a Unix-alike wchar_t is encoded in
UCS-4.  Some platforms declare this via C99 define __STDC_ISO_10646__
but macOS and Solaris do not, and most likely *BSD do not (at least
FreeBSD, from which macOS is derived, used not to).  It is assumed
that conversions to/from UTF-8 are compatible with those to/from the
native encoding and with the system classification/width functions.
At some times warnings have been given but with no known exceptions
are currently silenced.


5) Conversion functions have proliferated over the years. We currently
use (in a Unix-alike, checked for macOS)

system functions: mbrtowc mbstowcs wcrtomb wcstombs

It is unclear what mbrtowc does on Windows if a MBCS character needs
to be stored as a surrogate pair -- which probably cannot happen until
we get UTF-8 locales.

wcrtomb is used to check validity in plotmath.c and to convert to
native in rlocale.c.

wcstombs is used when reading from pipes on Windows, and to convert
back to native in make.names(), tolower() and chartr().


Rf_mbcsToUcs2
converts from native or marked UTF-8 to R_ucs2_t (unsigned short) by iconv.
Used in grDevices

Rf_mbrtowc (aka Mbrtowc)
converts native to wchar_t*
(a wrapper for system mbrtowc that reports failure as error.)
Used in character.c connections.c, parser, grep.c, plotmath.c

Rf_mbtoucs
converts from native to UCS-4 by iconv (should be mbs)
Used in parser (with WC_NOT_UNICODE) and engine.c (converting pch to glyph).

Rf_ucstomb
converts from UCS-4 to native by iconv (should be mbs)
Used in devX11.c

Rf_ucstoutf8
converts a single char from UCS-4 (unsigned int) to UTF-8 by iconv.
Used in grDevices and exported.

Rf_utf8toucs
converts a single UTF-8 to to wchar_t (high surrogate)
Used in character.c, engine.c, printutils.c

Rf_utf8toucs32
converts a single UTF-8 to UCS-4
(to R_wchar_t, defined in rlocale.h to be unsigned int)
Used in character.c, engine.c, printutils.c, sysutils.c

Rf_utf8towcs (to wchar_t)
converts a UTF-8 string to wchar_t*, possibly to UTF-16.
Used in grDevices, character.c, connections.c, grep.c and on Windows.

Rf_wcstoutf8
converts by iconv from UCS-4 (even with surrogate points) or UTF-16.
Used in parser, grep.c, character.c and on Windows.

Rf_wtransChar
converts a CHARSXP including marked encodings, to wchar_t by iconv.
Assumes UCS-4 on Unix, UCS-2 on Windows (so fails with surrogate pairs).
Used in agrep.c grep.c (for TRE engine) character.c and on Windows for
system functions in sysutils.c and in package utils.

Windows only:
Rmbrtowc (aka mbrtowc) Rwcrtomb (aka wcrtomb)
wrappers that allow FAKE_UTF8 to assume current locale is UTF-8.  Date
back to 2005, could be removed.

----------------------------------------------------------------------
Inventory:

character.c
-----------

nchar()
nchar(, "w") looks up character widths and on Windows that is limited
to the BMP.

substr() substr<-()
These skip along by character in a MBCS or for a UTF-8-marked string.

startsWith()
Works in bytes except in a non-UTF-8 MBCS.

abbrev()
makenames()
translates to the native encoding, works in wchar_t in a MBCS.

tolower/upper()
chartr()
translate UTF-8 and Latin-1 to wchar (which needs Unicode wide
  characters), rest to current charset
  
strtrim()
Converts all inputs to native encoding, then to wchar_t* to get
display widths of chars.  Could do Latin-1 and UTF-8 directly.

strtoi()
Assumes ASCII input.

strrep()
Works in bytes.


gram.y
------

Encodings come up in a few places;
- tokenizing in a MBCS locale, which uses mbrtowc to read by char.
- finding not-necessarily-ASCII 'spaces', which relies on
  __STDC_ISO_10646__ on a Unix-alike.
- handling \u and \U character-string escapes, which creates
  UTF-8-marked CHARSXPs.  This has a non __STDC_ISO_10646__ branch,
  probably never used.


grep.c
------

There are native modes for all three options: otherwise the 'fixed'
and PCRE options work in UTF-8, the TRE one in wchar_t*.

use_UTF8 is set for any for UTF-8-marked inputs and for PCRE is in a MBCS
locale.

For do_grep/gsub/regexpr use_WC is set when use_UTF8 would be.


printutil.c
-----------

EncodeString/Rstrlen/Rstrwid use UTF-8 for some output on Windows and
wchar_t for \u and \U output escaping.

FIXME: Casts to wint_t when using iswprint, or calls with half of a
surrogate pair, and that may not be big enough.


rlocale.c
---------

This supplies iswxxxxx functions for values up to 0x10fffd, but wint_t
is unsigned short on Windows (at least on MinGW-W64).

character.c uses iswspace iwsdigit iswalpha
gram.c uses blankwct, iswalpha iswalnmum
plotmath.c uses iswdigit
util.c uses iswspace
gnuwin32/{dos_wglob.c,extra.c] use iswalpha
modules/X11/dataentry.c uses iswspace


__STDC_ISO_10646__ is defined for macOS, FreeBSD and Solaris in
character.c, gram.c and printutils.c but currently unused in the first
and suppresses warnings in the other two.

----------------------------------------------------------------------
case tables