WebStack (annotate docs/CHARSET.txt in 74ed715c5455)

Unicode and Character Sets in WebStack

paulb@230

2

--------------------------------------

Unicode text should be converted to the chosen character set (encoding) when

paulb@230

5

written to the response stream.

Classic Python strings are written directly to the response stream without

paulb@230

8

encoding.

Character Set Semantics in WebStack

paulb@225

11

-----------------------------------

Character sets (or encodings) are relevant in two areas:

 * The encoding of output data.

paulb@225

16

 * The processing of input data.

When producing HTML pages containing form fields and interpreting the values of

paulb@225

19

such fields from a request body, it is necessary to know...

 * The character set used to encode the values sent by the browser. This is

paulb@225

22

   typically determined by...

 * The character set used to encode the HTML page from which the field values

paulb@225

25

   originated.

It is therefore also necessary to remain consistent in the usage of character

paulb@230

28

sets when specifying content types. WebStack enforces the following rules:

 * Where the request content type specifies a character set, this is used to

paulb@230

31

   decode the request body parameters unless explicitly overridden.

 * Where the request content type does not specify a character set, a default

paulb@230

34

   character set is used to decode the request body parameters unless

paulb@230

35

   overridden.

 * No conversion is done at the request stream level, since information about

paulb@298

38

   the character set may be missing and the application may wish to override

paulb@298

39

   any default explicitly at a higher level (such as when it gets request body

paulb@298

40

   parameters).

 * Where the response content type specifies a character set, this is used to

paulb@230

43

   encode Unicode response data (eg. HTML pages).

 * Where the response content type does not specify a character set, a default

paulb@230

46

   character set is used to encode Unicode response data (eg. HTML pages).

Restrictions in and Omissions from Standards

paulb@232

49

--------------------------------------------

The encoding of character sets such as UTF-16 in HTTP POST request body

paulb@232

52

messages of content/media type application/x-www-form-urlencoded is not

paulb@232

53

properly standardised. Therefore, it is highly recommended that UTF-8 be used

paulb@232

54

as an encoding should the various single byte encodings (eg. ISO-8859-1) not

paulb@232

55

cover the range of characters to be displayed and received.

Framework Behaviour

paulb@230

58

-------------------

The Java Servlet API imposes restrictions on decoding request body parameters

paulb@230

61

by stating that the character encoding (ServletRequest.setCharacterEncoding)

paulb@230

62

must be set before any reading of the request body is attempted.

paulb@230	1	Unicode and Character Sets in WebStack
paulb@230	2	--------------------------------------
paulb@230	3
paulb@230	4	Unicode text should be converted to the chosen character set (encoding) when
paulb@230	5	written to the response stream.
paulb@230	6
paulb@230	7	Classic Python strings are written directly to the response stream without
paulb@230	8	encoding.
paulb@230	9
paulb@225	10	Character Set Semantics in WebStack
paulb@225	11	-----------------------------------
paulb@225	12
paulb@225	13	Character sets (or encodings) are relevant in two areas:
paulb@225	14
paulb@225	15	* The encoding of output data.
paulb@225	16	* The processing of input data.
paulb@225	17
paulb@225	18	When producing HTML pages containing form fields and interpreting the values of
paulb@225	19	such fields from a request body, it is necessary to know...
paulb@225	20
paulb@225	21	* The character set used to encode the values sent by the browser. This is
paulb@225	22	typically determined by...
paulb@225	23
paulb@225	24	* The character set used to encode the HTML page from which the field values
paulb@225	25	originated.
paulb@225	26
paulb@225	27	It is therefore also necessary to remain consistent in the usage of character
paulb@230	28	sets when specifying content types. WebStack enforces the following rules:
paulb@230	29
paulb@230	30	* Where the request content type specifies a character set, this is used to
paulb@230	31	decode the request body parameters unless explicitly overridden.
paulb@230	32
paulb@230	33	* Where the request content type does not specify a character set, a default
paulb@230	34	character set is used to decode the request body parameters unless
paulb@230	35	overridden.
paulb@230	36
paulb@298	37	* No conversion is done at the request stream level, since information about
paulb@298	38	the character set may be missing and the application may wish to override
paulb@298	39	any default explicitly at a higher level (such as when it gets request body
paulb@298	40	parameters).
paulb@298	41
paulb@230	42	* Where the response content type specifies a character set, this is used to
paulb@230	43	encode Unicode response data (eg. HTML pages).
paulb@230	44
paulb@230	45	* Where the response content type does not specify a character set, a default
paulb@230	46	character set is used to encode Unicode response data (eg. HTML pages).
paulb@230	47
paulb@232	48	Restrictions in and Omissions from Standards
paulb@232	49	--------------------------------------------
paulb@232	50
paulb@232	51	The encoding of character sets such as UTF-16 in HTTP POST request body
paulb@232	52	messages of content/media type application/x-www-form-urlencoded is not
paulb@232	53	properly standardised. Therefore, it is highly recommended that UTF-8 be used
paulb@232	54	as an encoding should the various single byte encodings (eg. ISO-8859-1) not
paulb@232	55	cover the range of characters to be displayed and received.
paulb@232	56
paulb@230	57	Framework Behaviour
paulb@230	58	-------------------
paulb@230	59
paulb@230	60	The Java Servlet API imposes restrictions on decoding request body parameters
paulb@230	61	by stating that the character encoding (ServletRequest.setCharacterEncoding)
paulb@230	62	must be set before any reading of the request body is attempted.

WebStack

Annotated docs/CHARSET.txt