std.encoding
Classes and functions for handling and transcoding between various encodings.
For cases where the encoding is known at compile-time, functions are provided
for arbitrary encoding and decoding of characters, arbitrary transcoding
between strings of different type, as well as validation and sanitization.
Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1
(also known as LATIN-1), and WINDOWS-1252.
- The type AsciiChar represents an ASCII character.
- The type AsciiString represents an ASCII string.
- The type Latin1Char represents an ISO-8859-1 character.
- The type Latin1String represents an ISO-8859-1 string.
- The type Windows1252Char represents a Windows-1252 character.
- The type Windows1252String represents a Windows-1252 string.
For cases where the encoding is not known at compile-time, but is
known at run-time, we provide the abstract class
EncodingScheme
and its subclasses. To construct a run-time encoder/decoder, one does
e.g.
auto e = EncodingScheme.create("utf-8");
This library supplies
EncodingScheme subclasses for ASCII,
ISO-8859-1 (also known as LATIN-1), WINDOWS-1252, UTF-8, and (on
little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian
architectures) UTF-16BE and UTF-32BE.
This library provides a mechanism whereby other modules may add
EncodingScheme subclasses for any other encoding.
License:Boost License 1.0.
Authors:Janice Caron
Source:
std/encoding.d
- Special value returned by safeDecode
enum
AsciiChar;
alias
AsciiString;
- Defines various character sets.
- Defines an Latin1-encoded character.
- Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).
- Defines a Windows1252-encoded character.
- Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).
bool
isValidCodePoint(dchar
c);
- Returns true if c is a valid code point
Note that this includes the non-character code points U+FFFE and U+FFFF,
since these are valid code points (even though they are not valid
characters).
Supercedes:
This function supercedes std.utf.startsValidDchar().
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:dchar c |
the code point to be tested |
string
encodingName(T)();
- Returns the name of an encoding.
The type of encoding cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding type.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Examples:
assert(encodingName!(Latin1Char) == "ISO-8859-1");
bool
canEncode(E)(dchar
c);
- Returns true iff it is possible to represent the specifed codepoint
in the encoding.
The type of encoding cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding type.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Examples:
assert(canEncode!(Latin1Char)('A'));
bool
isValidCodeUnit(E)(E
c);
- Returns true if the code unit is legal. For example, the byte 0x80 would
not be legal in ASCII, because ASCII code units must always be in the range
0x00 to 0x7F.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:c |
the code unit to be tested |
bool
isValid(E)(const(E)[]
s);
- Returns true if the string is encoded correctly
Supercedes:
This function supercedes std.utf.validate(), however note that this
function returns a bool indicating whether the input was valid or not,
wheras the older funtion would throw an exception.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be tested |
size_t
validLength(E)(const(E)[]
s);
- Returns the length of the longest possible substring, starting from
the first code unit, which is validly encoded.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be tested |
immutable(E)[]
sanitize(E)(immutable(E)[]
s);
- Sanitizes a string by replacing malformed code unit sequences with valid
code unit sequences. The result is guaranteed to be valid for this encoding.
If the input string is already valid, this function returns the original,
otherwise it constructs a new string by replacing all illegal code unit
sequences with the encoding's replacement character, Invalid sequences will
be replaced with the Unicode replacement character (U+FFFD) if the
character repertoire contains it, otherwise invalid sequences will be
replaced with '?'.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be sanitized |
size_t
firstSequence(E)(const(E)[]
s);
- Returns the length of the first encoded sequence.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be sliced |
size_t
lastSequence(E)(const(E)[]
s);
- Returns the length the last encoded sequence.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be sliced |
sizediff_t
index(E)(const(E)[]
s, int
n);
- Returns the array index at which the (n+1)th code point begins.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Supercedes:
This function supercedes std.utf.toUTFindex().
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be counted |
dchar
decode(S)(ref S
s);
- Decodes a single code point.
This function removes one or more code units from the start of a string,
and returns the decoded code point which those code units represent.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Supercedes:
This function supercedes std.utf.decode(), however, note that the
function codePoints() supercedes it more conveniently.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string whose first code point is to be decoded |
dchar
decodeReverse(E)(ref const(E)[]
s);
- Decodes a single code point from the end of a string.
This function removes one or more code units from the end of a string,
and returns the decoded code point which those code units represent.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string whose first code point is to be decoded |
dchar
safeDecode(S)(ref S
s);
- Decodes a single code point. The input does not have to be valid.
This function removes one or more code units from the start of a string,
and returns the decoded code point which those code units represent.
This function will accept an invalidly encoded string as input.
If an invalid sequence is found at the start of the string, this
function will remove it, and return the value INVALID_SEQUENCE.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string whose first code point is to be decoded |
size_t
encodedLength(E)(dchar
c);
- Returns the number of code units required to encode a single code point.
The input to this function MUST be a valid code point.
This is enforced by the function's in-contract.
The type of the output cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding as a template parameter.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:c |
the code point to be encoded |
- Encodes a single code point.
This function encodes a single code point into one or more code units.
It returns a string containing those code units.
The input to this function MUST be a valid code point.
This is enforced by the function's in-contract.
The type of the output cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding as a template parameter.
Supercedes:
This function supercedes std.utf.encode(), however, note that the
function codeUnits() supercedes it more conveniently.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:c |
the code point to be encoded |
size_t
encode(E)(dchar
c, E[]
array);
- Encodes a single code point into an array.
This function encodes a single code point into one or more code units
The code units are stored in a user-supplied fixed-size array,
which must be passed by reference.
The input to this function MUST be a valid code point.
This is enforced by the function's in-contract.
The type of the output cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding as a template parameter.
Supercedes:
This function supercedes std.utf.encode(), however, note that the
function codeUnits() supercedes it more conveniently.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:c |
the code point to be encoded |
Returns:
the number of code units written to the array
void
encode(E)(dchar
c, void delegate(E)
dg);
- Encodes a single code point to a delegate.
This function encodes a single code point into one or more code units.
The code units are passed one at a time to the supplied delegate.
The input to this function MUST be a valid code point.
This is enforced by the function's in-contract.
The type of the output cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding as a template parameter.
Supercedes:
This function supercedes std.utf.encode(), however, note that the
function codeUnits() supercedes it more conveniently.
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:c |
the code point to be encoded |
CodePoints!(E)
codePoints(E)(immutable(E)[]
s);
- Returns a foreachable struct which can bidirectionally iterate over all
code points in a string.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
You can foreach either
with or without an index. If an index is specified, it will be initialized
at each iteration with the offset into the string at which the code point
begins.
Supercedes:
This function supercedes std.utf.decode().
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the string to be decoded |
Examples:
string s = "hello world";
foreach(c;codePoints(s))
{
}
Note that, currently, foreach(c:codePoints(s)) is superior to foreach(c;s)
in that the latter will fall over on encountering U+FFFF.
CodeUnits!(E)
codeUnits(E)(dchar
c);
- Returns a foreachable struct which can bidirectionally iterate over all
code units in a code point.
The input to this function MUST be a valid code point.
This is enforced by the function's in-contract.
The type of the output cannot be deduced. Therefore, it is necessary to
explicitly specify the encoding type in the template parameter.
Supercedes:
This function supercedes std.utf.encode().
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:d |
the code point to be encoded |
Examples:
dchar d = '\u20AC';
foreach(c;codeUnits!(char)(d))
{
writefln("%X",c)
}
size_t
encode(Tgt, Src, R)(in const(Src[])
s, R
range);
- Encodes c in units of type E and writes the result to the
output range R. Returns the number of Es written.
void
transcode(Src, Dst)(immutable(Src)[]
s, out immutable(Dst)[]
r);
- Convert a string from one encoding to another. (See also to!() below).
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Supercedes:
This function supercedes std.utf.toUTF8(), std.utf.toUTF16() and
std.utf.toUTF32()
(but note that to!() supercedes it more conveniently).
Standards:
Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252
Parameters:s |
the source string |
r |
the destination string |
Examples:
wstring ws;
transcode("hello world",ws);
Latin1String ls;
transcode(ws, ls);
class
EncodingException: object.Exception;
- The base class for exceptions thrown by this module
abstract class
EncodingScheme;
- Abstract base class of all encoding schemes
static void
register(string
className);
- Registers a subclass of EncodingScheme.
This function allows user-defined subclasses of EncodingScheme to
be declared in other modules.
Examples:
class Amiga1251 : EncodingScheme
{
shared static this()
{
EncodingScheme.register("path.to.Amiga1251");
}
}
static EncodingScheme
create(string
encodingName);
- Obtains a subclass of EncodingScheme which is capable of encoding
and decoding the named encoding scheme.
This function is only aware of EncodingSchemes which have been
registered with the register() function.
Examples:
auto scheme = EncodingScheme.create("Amiga-1251");
abstract const const string
toString();
- Returns the standard name of the encoding scheme
abstract const const string[]
names();
- Returns an array of all known names for this encoding scheme
abstract const const bool
canEncode(dchar
c);
- Returns true if the character c can be represented
in this encoding scheme.
abstract const const size_t
encodedLength(dchar
c);
- Returns the number of ubytes required to encode this code point.
The input to this function MUST be a valid code point.
Parameters:
dchar c |
the code point to be encoded |
Returns:
the number of ubytes required.
abstract const const size_t
encode(dchar
c, ubyte[]
buffer);
- Encodes a single code point into a user-supplied, fixed-size buffer.
This function encodes a single code point into one or more ubytes.
The supplied buffer must be code unit aligned.
(For example, UTF-16LE or UTF-16BE must be wchar-aligned,
UTF-32LE or UTF-32BE must be dchar-aligned, etc.)
The input to this function MUST be a valid code point.
Parameters:
dchar c |
the code point to be encoded |
Returns:
the number of ubytes written.
abstract const const dchar
decode(ref const(ubyte)[]
s);
- Decodes a single code point.
This function removes one or more ubytes from the start of an array,
and returns the decoded code point which those ubytes represent.
The input to this function MUST be validly encoded.
Parameters:
const(ubyte)[] s |
the array whose first code point is to be decoded |
abstract const const dchar
safeDecode(ref const(ubyte)[]
s);
- Decodes a single code point. The input does not have to be valid.
This function removes one or more ubytes from the start of an array,
and returns the decoded code point which those ubytes represent.
This function will accept an invalidly encoded array as input.
If an invalid sequence is found at the start of the string, this
function will remove it, and return the value INVALID_SEQUENCE.
Parameters:
const(ubyte)[] s |
the array whose first code point is to be decoded |
abstract const const @property immutable(ubyte)[]
replacementSequence();
- Returns the sequence of ubytes to be used to represent
any character which cannot be represented in the encoding scheme.
Normally this will be a representation of some substitution
character, such as U+FFFD or '?'.
bool
isValid(const(ubyte)[]
s);
- Returns true if the array is encoded correctly
Parameters:
const(ubyte)[] s |
the array to be tested |
size_t
validLength(const(ubyte)[]
s);
- Returns the length of the longest possible substring, starting from
the first element, which is validly encoded.
Parameters:
const(ubyte)[] s |
the array to be tested |
immutable(ubyte)[]
sanitize(immutable(ubyte)[]
s);
- Sanitizes an array by replacing malformed ubyte sequences with valid
ubyte sequences. The result is guaranteed to be valid for this
encoding scheme.
If the input array is already valid, this function returns the
original, otherwise it constructs a new array by replacing all illegal
sequences with the encoding scheme's replacement sequence.
Parameters:
immutable(ubyte)[] s |
the string to be sanitized |
size_t
firstSequence(const(ubyte)[]
s);
- Returns the length of the first encoded sequence.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Parameters:
const(ubyte)[] s |
the array to be sliced |
size_t
count(const(ubyte)[]
s);
- Returns the total number of code points encoded in a ubyte array.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Parameters:
const(ubyte)[] s |
the string to be counted |
sizediff_t
index(const(ubyte)[]
s, size_t
n);
- Returns the array index at which the (n+1)th code point begins.
The input to this function MUST be validly encoded.
This is enforced by the function's in-contract.
Parameters:
const(ubyte)[] s |
the string to be counted |
class
EncodingSchemeASCII: std.encoding.EncodingScheme;
- EncodingScheme to handle ASCII
This scheme recognises the following names:
"ANSI_X3.4-1968",
"ANSI_X3.4-1986",
"ASCII",
"IBM367",
"ISO646-US",
"ISO_646.irv:1991",
"US-ASCII",
"cp367",
"csASCII"
"iso-ir-6",
"us"
class
EncodingSchemeLatin1: std.encoding.EncodingScheme;
- EncodingScheme to handle Latin-1
This scheme recognises the following names:
"CP819",
"IBM819",
"ISO-8859-1",
"ISO_8859-1",
"ISO_8859-1:1987",
"csISOLatin1",
"iso-ir-100",
"l1",
"latin1"
class
EncodingSchemeWindows1252: std.encoding.EncodingScheme;
- EncodingScheme to handle Windows-1252
This scheme recognises the following names:
"windows-1252"
class
EncodingSchemeUtf8: std.encoding.EncodingScheme;
- EncodingScheme to handle UTF-8
This scheme recognises the following names:
"UTF-8"
class
EncodingSchemeUtf16Native: std.encoding.EncodingScheme;
- EncodingScheme to handle UTF-16 in native byte order
This scheme recognises the following names:
"UTF-16LE" (little-endian architecture only)
"UTF-16BE" (big-endian architecture only)
class
EncodingSchemeUtf32Native: std.encoding.EncodingScheme;
- EncodingScheme to handle UTF-32 in native byte order
This scheme recognises the following names:
"UTF-32LE" (little-endian architecture only)
"UTF-32BE" (big-endian architecture only)