WebStack

docs/CHARSET.txt

289:43e23cde36c2
2004-09-21 paulb [project @ 2004-09-21 17:59:03 by paulb] Fixed path field discovery by removing parameters with empty names. This may not be totally correct, however.
     1 Unicode and Character Sets in WebStack
     2 --------------------------------------
     3 
     4 Unicode text should be converted to the chosen character set (encoding) when
     5 written to the response stream.
     6 
     7 Classic Python strings are written directly to the response stream without
     8 encoding.
     9 
    10 Character Set Semantics in WebStack
    11 -----------------------------------
    12 
    13 Character sets (or encodings) are relevant in two areas:
    14 
    15  * The encoding of output data.
    16  * The processing of input data.
    17 
    18 When producing HTML pages containing form fields and interpreting the values of
    19 such fields from a request body, it is necessary to know...
    20 
    21  * The character set used to encode the values sent by the browser. This is
    22    typically determined by...
    23 
    24  * The character set used to encode the HTML page from which the field values
    25    originated.
    26 
    27 It is therefore also necessary to remain consistent in the usage of character
    28 sets when specifying content types. WebStack enforces the following rules:
    29 
    30  * Where the request content type specifies a character set, this is used to
    31    decode the request body parameters unless explicitly overridden.
    32 
    33  * Where the request content type does not specify a character set, a default
    34    character set is used to decode the request body parameters unless
    35    overridden.
    36 
    37  * Where the response content type specifies a character set, this is used to
    38    encode Unicode response data (eg. HTML pages).
    39 
    40  * Where the response content type does not specify a character set, a default
    41    character set is used to encode Unicode response data (eg. HTML pages).
    42 
    43 Restrictions in and Omissions from Standards
    44 --------------------------------------------
    45 
    46 The encoding of character sets such as UTF-16 in HTTP POST request body
    47 messages of content/media type application/x-www-form-urlencoded is not
    48 properly standardised. Therefore, it is highly recommended that UTF-8 be used
    49 as an encoding should the various single byte encodings (eg. ISO-8859-1) not
    50 cover the range of characters to be displayed and received.
    51 
    52 Framework Behaviour
    53 -------------------
    54 
    55 The Java Servlet API imposes restrictions on decoding request body parameters
    56 by stating that the character encoding (ServletRequest.setCharacterEncoding)
    57 must be set before any reading of the request body is attempted.