[bt-devel] The unicode path issue in SWORD/Windows

Matthew Talbert ransom1982 at gmail.com
Wed Oct 28 11:36:16 MST 2009


Sorry folks for not replying to this earlier. I may have time later
today to reply in more detail, but I just wanted to clear this up.

> Last I heard, this was not fixed.  It has to do with the C runtime
> library that MSVC uses.  It does not support non-ASCII characters with
> the fopen (or whatever SWORD is using) function, whereas libc on Linux
> and Mac does, and even the runtimes with some other Windows compilers
> do, such as Borland.  This is why the bug did not appear for users of
> The SWORD Project for Windows.

This is not really accurate. It has to do with the C runtime that
*everything* on Windows ultimately uses, msvcrt.dll (and its later
variations). This runtime, just like libc, is actually completely
oblivious to Unicode. Using Borland makes no difference, as BibleCS
was found to have the same issue. There is simply no escaping it if
you are writing software on Windows.

In a nutshell, the issue is that, for backwards compatibility, and
also for the very good reason that UTF-8 hadn't yet been created when
Microsoft's new filesystem NTFS was created, the filesystem stores
path names in two ways. One is in an 8-bit encoding, the other is
UTF-16 (this also applies to other system API's, such as retrieving
environment variables, getting lists of font names, etc). When you use
the standard C functions for retrieving lists of file names from the
OS, you get the results in 8-bit format. This works most of the time,
but it will *not* work if the filename has characters that are
unrepresentable in the 8-bit encoding (which is not Unicode). In cases
like this, the *only* way to retrieve the correct filename is to ask
for it via the 'wide-character' API.

The reason the fix has not gone into the library yet, is (I imagine)
because it is non-trivial. Plus there are hard decisions to make, like
continuing support for 9x versions of Windows (which don't have the
wide-character API). If you want to support both 9x and NT in the same
dll, then you will have to dynamically call the functions at runtime.
The far simpler solution would be to allow this as a compile-time
option, but that would mean creating two dll's, one for 9x and one for
NT. For Xiphos, we don't even have the option of supporting 9x, so I'd
be fine with compile-time options.

Matthew

PS I'll just mention here as food for thought, that the problem
Microsoft is attempting to solve here, is dealing with tons of legacy
files that may have filenames in who knows what encoding, and making
them consistently available via UTF-16. This is a problem that linux
avoids partially by using UTF-8 for filesystems, but basically it just
ignores the problem. While the libc functions like "open" are agnostic
to the encoding, software that displays or retrieves filenames to/from
users is not. This makes it impossible, for example, to move a file
that has a non-UTF8 filename containing non-ASCII characters via
Nautilus (GNOME file browser). Yes, I've tried it. It also means that
if you get the user to type in a filename, and they type it in UTF-8
(which will be standard for most distros), but the filename is
actually in some other encoding, that "open" will *not* be able to
open the file, because it is looking for an entirely different
(non-existent) file.



More information about the bt-devel mailing list