Class Codec.

Inherits Garbage. Inherited by AsciiCodec, Cp932Codec, Cp949Codec, Cp950Codec, EucJpCodec, Gb2312Codec, GbkCodec, Iso2022JpCodec, Iso88591Codec, TableCodec, Utf16BeCodec, Utf16Codec, Utf16LeCodec, Utf7Codec and Utf8Codec.

The Codec class describes a mapping between UString and anything else.

Unicode is used as the native character set and encoding in Warehouse. All other encodings are mapped to or from that: To unicode when e.g. parsing a mail message, from when storing data in the database (as utf-8).

A Codec is responsible for one such mapping. The Codec class also contains a factory to create an instance of the right subclass based on a name.

The source code for the codecs includes a number of generated files, e.g. the list of MIME character set names and map from Unicode to ISO-8859-2. We choose to regard them as source files, because we may want to sever the link between the source and our version. For example, if the source is updated, we may or may not want to follow along.

Codec::Codec( const char * cs )

Constructs an empty Codec for character set cs, setting its state to Valid.

The construction of a codec sets it to its default state, whatever that is for each codec.

static EStringList * Codec::allCodecNames()

Returns a list of all canonical codec names. Aliases are not included in the list.

void Codec::append( UString & u, uint c )

Appends c to u. If c isn't a legal codepoint or there are other errors, this codec's state is modifed appropriately.

static Codec * Codec::byName( const EString & s )

Looks up s in our list of MIME character set names and returns a Codec suitable for mapping that to/from Unicode.

If s is unknown, byName() returns 0.

static Codec * Codec::byString( const EString & s )

Returns a codec likely to describe the encoding for s. This uses words lists and many other strategies.

Its assumptions:

If s contains a Unicode Byte Order Mark, it probably is a UTF-16BE or UTF-16LE string.

If s is a Russian string, it probably contains lots of common Russian words, and we have can identify the character encoding by scanning for KOI8-R and ISO-8859-5 forms of some common words. Ditto for other languages.

If s uses typical Windows punctation and is mostly ASCII, it's in a typical Windows encoding.

This function is a little slower than it could be, since it creates a largish number of short EString objects.

static Codec * Codec::byString( const UString & u )

Returns a codec suitable for encoding the unicode string u in such a way that the largest possible number of mail readers will understand the message.

EString Codec::error() const

Returns an error message describing why the codec is in Invalid state. If the codec is in Valid or BadlyFormed states, error() returns an empty string.

EString Codec::fromUnicode( const UString & u )

This pure virtual function maps u from Unicode to the codec's other encoding, and returns a EString containing the result.

Each reimplementation must decide how to handle codepoints that cannot be represented in the target encoding.

void Codec::mangleTrailingSurrogate( UString & u )

Checks whether the last codepoint in u is a leading surrogate, and flags an error if so.

EString Codec::name() const

Returns the name of the codec, as supplied to the constructor.

void Codec::recordError( const EString & s )

Records that the error s occurred. This is meant for errors other than invalid or undefined codepoints, and should be needed only by a stateful Codec. Also sets the state() to Invalid.

void Codec::recordError( uint pos )

Records that at octet index pos, an error happened and no code point could be found. This also sets the state() to Invalid.

void Codec::recordError( uint pos, const EString & input )

Records that at octet index pos in input, an error happened and no code point could be found. This also sets the state() to Invalid.

void Codec::recordError( uint pos, uint codepoint )

Records that codepoint (at octet index pos) is not valid and could not be converted to Unicode. This also sets the state() to Invalid.

void Codec::reset()

This virtual function resets the codec. After calling reset(), the codec again reports that the input was wellformed() and valid(), and any codec state must have been set to the default state.

void Codec::setState( State st )

Sets the codec's state to st, which is one of Valid, BadlyFormed and Invalid.

Valid is the initial setting, and means that the Codec has seen only valid input. BadlyFormed means that the Codec has seen something it did not like, but was able to determine the meaning of that input. Invalid means that the Codec has seen input whose meaning could not be determined.

State Codec::state() const

Returns the current state of the codec, reflecting the codec's input up to this point.

UString Codec::toUnicode( const EString & s )

This pure virtual function maps s from codec's encoding to Uncode, and returns a UString containing the result.

Reimplementations are expected to handle errors only by calling setState(). Each reimplementation is free to recover as seems suitable for its encoding.

bool Codec::valid() const

Returns true if this codec's input has not yet seen any syntax errors, and false if it has.

bool Codec::wellformed() const

Returns true if this codec's input has so far been well-formed, and false if not. The definition of wellformedness is left to each subclass. As general guidance, to be wellformed, the input must avoid features that are discouraged or obsoleted by the relevant standard.

Codec::~Codec()

Destroys the Codec.

This web page based on source code belonging to The Archiveopteryx Developers. All rights reserved.