paulb@654 | 1 | <?xml version="1.0" encoding="iso-8859-1"?> |
paulb@358 | 2 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
paulb@436 | 3 | <html xmlns="http://www.w3.org/1999/xhtml"><head> |
paulb@654 | 4 | <title>Character Encodings</title> |
paulb@436 | 5 | <link href="styles.css" rel="stylesheet" type="text/css" /></head> |
paulb@335 | 6 | <body> |
paulb@335 | 7 | <h1>Character Encodings</h1> |
paulb@358 | 8 | <p>When writing applications with WebStack, you should try and use |
paulb@358 | 9 | Python's Unicode objects as much as possible. However, there are a |
paulb@358 | 10 | number of places where plain Python strings can be involved:</p> |
paulb@335 | 11 | <ul> |
paulb@436 | 12 | <li><a href="parameters-headers.html">Inspecting query strings</a></li> |
paulb@360 | 13 | <li><a href="responses.html">Sending output in a response</a></li> |
paulb@360 | 14 | <li><a href="parameters.html">Receiving uploaded content</a></li> |
paulb@360 | 15 | <li><a href="state.html">Accessing cookie information</a></li> |
paulb@525 | 16 | <li><a href="sessions.html">Accessing session information</a> (see the <a href="sessions-usage.html#Limitations">"Session Limitations and Guidelines"</a>)</li> |
paulb@335 | 17 | </ul> |
paulb@358 | 18 | <p>When Web pages (and other types of content) are sent to and from |
paulb@358 | 19 | users of your application, the text will be in some kind of character |
paulb@358 | 20 | encoding. For example, in English-speaking environments, the US-ASCII |
paulb@358 | 21 | encoding is common and contains the basic letters, numbers and symbols |
paulb@654 | 22 | used in English, whereas in Western Europe encodings like |
paulb@654 | 23 | ISO-8859-1 and ISO-8859-15 are typically used, since they contain |
paulb@358 | 24 | additional letters and symbols in order to support other languages. |
paulb@358 | 25 | Often, UTF-8 is used to encode text because it covers most languages |
paulb@358 | 26 | simultaneously and is therefore flexible enough for many applications.</p> |
paulb@358 | 27 | <p>When URLs are received in applications, in order for some of the |
paulb@358 | 28 | request parameters to be interpreted, the situation is a bit more |
paulb@358 | 29 | awkward. The original text is encoded in US-ASCII but will contain |
paulb@654 | 30 | special numeric codes that indicate character values in the |
paulb@654 | 31 | original text encoding - see the <a href="parameters.html">description |
paulb@358 | 32 | of query strings</a> for more information.</p> |
paulb@335 | 33 | <h2>Recommendations</h2> |
paulb@358 | 34 | <dl> |
paulb@358 | 35 | <dt>The following recommendations should help you avoid issues with |
paulb@358 | 36 | incorrect characters in the Web pages (and other content) that you |
paulb@358 | 37 | produce:</dt> |
paulb@358 | 38 | </dl> |
paulb@358 | 39 | <h3>Use Unicode Objects for Textual Content</h3> |
paulb@358 | 40 | <p>Handling text in specific encodings using normal Python strings can |
paulb@358 | 41 | be difficult, and handling text in multiple encodings in the same |
paulb@358 | 42 | application can be highly error-prone. Fortunately, Python has support |
paulb@358 | 43 | for Unicode objects which let you think of letters, numbers, symbols |
paulb@358 | 44 | and all other characters in an abstract way.</p> |
paulb@358 | 45 | <ul> |
paulb@629 | 46 | <li>Convert textual content to Unicode as soon as possible.</li> |
paulb@358 | 47 | <li>If you must include hard-coded messages in your application code, |
paulb@436 | 48 | make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a> |
paulb@358 | 49 | at the top of your source file.</li> |
paulb@654 | 50 | <li>Remember that the standard library <code>codecs</code> |
paulb@358 | 51 | module contains useful functions to access streams as if Unicode |
paulb@358 | 52 | objects were being transmitted; for example:</li> |
paulb@358 | 53 | </ul> |
paulb@442 | 54 | <pre>import codecs<br /><br />class MyResource:<br /><br /> encoding = "utf-8"<br /><br /> def respond(self, trans):<br /> stream = trans.get_request_stream() # only reads strings<br /> unicode_stream = codecs.getreader(self.encoding)(stream) # reads Unicode objects<br /><br /> [Some activity...]<br /><br /> out = trans.get_response_stream() # writes strings and Unicode objects<br /></pre> |
paulb@358 | 55 | <h3>Use Strings for Binary Content</h3> |
paulb@358 | 56 | <p>If you are reading and writing binary content, Unicode objects are |
paulb@358 | 57 | inappropriate. Make sure to open files in binary mode, where necessary.</p> |
paulb@358 | 58 | <h3>Use Explicit Encodings and Be Consistent</h3> |
paulb@358 | 59 | <p>Although WebStack has some support for detecting character encodings |
paulb@358 | 60 | used |
paulb@358 | 61 | in requests, it is often best for your application to exercise control |
paulb@358 | 62 | over |
paulb@358 | 63 | which encoding is used when <a href="parameters.html">inspecting |
paulb@358 | 64 | request |
paulb@358 | 65 | parameters</a> and when <a href="responses.html">producing responses</a>. |
paulb@358 | 66 | The |
paulb@358 | 67 | best way to do this is to decide which encoding is most suitable for |
paulb@358 | 68 | the data |
paulb@358 | 69 | presented and received in your application and then to use it |
paulb@629 | 70 | throughout.</p><p>One |
paulb@629 | 71 | approach which works acceptably for smaller applications is to define |
paulb@629 | 72 | an attribute (or a global) which is conveniently accessible and which |
paulb@629 | 73 | can be used directly with various transaction methods. Here is an |
paulb@629 | 74 | outline of code which does this:</p> |
paulb@358 | 75 | <pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br /> encoding = "utf-8" # We decide on "utf-8" as our chosen<br /> # encoding.<br /> def respond(self, trans):<br /> [Do various things.]<br /><br /> fields = trans.get_fields_from_body(encoding=self.encoding) # Explicitly use the encoding.<br /><br /> [Do other things with the Unicode values from the fields.]<br /><br /> trans.set_content_type(ContentType("text/html", self.encoding)) # The output Web page uses the encoding.<br /><br /> [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre> |
paulb@654 | 76 | <h3>Use EncodingSelector to Set the Default Encoding</h3><p>An arguably better approach is to use selectors (as described in <a href="selectors.html">"Selectors - Components for Dispatching to Resources"</a>), typically in a "site map" arrangement (as described in <a href="deploying.html">"Deploying a WebStack Application"</a>), specifically using the <code>EncodingSelector</code>:</p><pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br /> def respond(self, trans):<br /> [Do various things.]<br /><br /> fields = trans.get_fields_from_body() # Encoding set by EncodingSelector.<br /><br /> [Do other things with the Unicode values from the fields.]<br /><br /> trans.set_content_type(ContentType("text/html")) # The output Web page uses the default encoding.<br /><br /> [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]<br /><br />def get_site_map():<br /><br /> return EncodingSelector(MyResource(), "utf-8")</pre><h3>Tell Encodings to Other Components</h3> |
paulb@436 | 77 | <p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may |
paulb@358 | 78 | be the case that such components will just write the generated content |
paulb@654 | 79 | straight to a normal stream (rather than one wrapped by a <code>codecs</code> |
paulb@358 | 80 | module function). In such cases, it is likely that for textual content |
paulb@358 | 81 | such as XML or related formats (XHTML, SVG, HTML) you will need to |
paulb@358 | 82 | instruct the component to use your chosen encoding; for example:</p> |
paulb@358 | 83 | <pre> # In the respond method, xml_document is an xml.dom.minidom.Document object...<br /> xml_document.toxml(self.encoding)</pre> |
paulb@436 | 84 | <p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding |
paulb@358 | 85 | for the XML document.</p> |
paulb@654 | 86 | </body></html> |