WebStack

docs/CHARSET.txt

400:5b276bbcbbb5
2005-07-16 paulb [project @ 2005-07-16 20:32:38 by paulb] Changed virtual path info in sub-resources so that it may be an empty string.
     1 Unicode and Character Sets in WebStack
     2 --------------------------------------
     3 
     4 Unicode text should be converted to the chosen character set (encoding) when
     5 written to the response stream.
     6 
     7 Classic Python strings are written directly to the response stream without
     8 encoding.
     9 
    10 Character Set Semantics in WebStack
    11 -----------------------------------
    12 
    13 Character sets (or encodings) are relevant in two areas:
    14 
    15  * The encoding of output data.
    16  * The processing of input data.
    17 
    18 When producing HTML pages containing form fields and interpreting the values of
    19 such fields from a request body, it is necessary to know...
    20 
    21  * The character set used to encode the values sent by the browser. This is
    22    typically determined by...
    23 
    24  * The character set used to encode the HTML page from which the field values
    25    originated.
    26 
    27 It is therefore also necessary to remain consistent in the usage of character
    28 sets when specifying content types. WebStack enforces the following rules:
    29 
    30  * Where the request content type specifies a character set, this is used to
    31    decode the request body parameters unless explicitly overridden.
    32 
    33  * Where the request content type does not specify a character set, a default
    34    character set is used to decode the request body parameters unless
    35    overridden.
    36 
    37  * No conversion is done at the request stream level, since information about
    38    the character set may be missing and the application may wish to override
    39    any default explicitly at a higher level (such as when it gets request body
    40    parameters).
    41 
    42  * Where the response content type specifies a character set, this is used to
    43    encode Unicode response data (eg. HTML pages).
    44 
    45  * Where the response content type does not specify a character set, a default
    46    character set is used to encode Unicode response data (eg. HTML pages).
    47 
    48 Restrictions in and Omissions from Standards
    49 --------------------------------------------
    50 
    51 The encoding of character sets such as UTF-16 in HTTP POST request body
    52 messages of content/media type application/x-www-form-urlencoded is not
    53 properly standardised. Therefore, it is highly recommended that UTF-8 be used
    54 as an encoding should the various single byte encodings (eg. ISO-8859-1) not
    55 cover the range of characters to be displayed and received.
    56 
    57 Framework Behaviour
    58 -------------------
    59 
    60 The Java Servlet API imposes restrictions on decoding request body parameters
    61 by stating that the character encoding (ServletRequest.setCharacterEncoding)
    62 must be set before any reading of the request body is attempted.