WebStack

Annotated docs/encodings.html

732:7f1f02b485f8
2007-11-12 paulb [project @ 2007-11-12 00:50:03 by paulb] Introduced base classes for common authentication activities. Made cookie usage "safe" for usernames containing ":" characters. Added support for OpenID signatures.
paulb@654 1
<?xml version="1.0" encoding="iso-8859-1"?>
paulb@358 2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
paulb@436 3
<html xmlns="http://www.w3.org/1999/xhtml"><head>
paulb@654 4
  <title>Character Encodings</title>
paulb@436 5
  <link href="styles.css" rel="stylesheet" type="text/css" /></head>
paulb@335 6
<body>
paulb@335 7
<h1>Character Encodings</h1>
paulb@358 8
<p>When writing applications with WebStack, you should try and use
paulb@358 9
Python's Unicode objects as much as possible. However, there are a
paulb@358 10
number of places where plain Python strings can be involved:</p>
paulb@335 11
<ul>
paulb@436 12
  <li><a href="parameters-headers.html">Inspecting query strings</a></li>
paulb@360 13
  <li><a href="responses.html">Sending output in a response</a></li>
paulb@360 14
  <li><a href="parameters.html">Receiving uploaded content</a></li>
paulb@360 15
  <li><a href="state.html">Accessing cookie information</a></li>
paulb@525 16
  <li><a href="sessions.html">Accessing session information</a> (see the <a href="sessions-usage.html#Limitations">"Session Limitations and Guidelines"</a>)</li>
paulb@335 17
</ul>
paulb@358 18
<p>When Web pages (and other types of content) are sent to and from
paulb@358 19
users of your application, the text will be in some kind of character
paulb@358 20
encoding. For example, in English-speaking environments, the US-ASCII
paulb@358 21
encoding is common and contains the basic letters, numbers and symbols
paulb@654 22
used in English, whereas in Western Europe encodings like
paulb@654 23
ISO-8859-1 and ISO-8859-15 are typically used, since they contain
paulb@358 24
additional letters and symbols in order to support other languages.
paulb@358 25
Often, UTF-8 is used to encode text because it covers most languages
paulb@358 26
simultaneously and is therefore flexible enough for many applications.</p>
paulb@358 27
<p>When URLs are received in applications, in order for some of the
paulb@358 28
request parameters to be interpreted, the situation is a bit more
paulb@358 29
awkward. The original text is encoded in US-ASCII but will contain
paulb@654 30
special numeric codes that indicate character values in the
paulb@654 31
original text encoding - see the <a href="parameters.html">description
paulb@358 32
of query strings</a> for more information.</p>
paulb@335 33
<h2>Recommendations</h2>
paulb@358 34
<dl>
paulb@358 35
  <dt>The following recommendations should help you avoid issues with
paulb@358 36
incorrect characters in the Web pages (and other content) that you
paulb@358 37
produce:</dt>
paulb@358 38
</dl>
paulb@358 39
<h3>Use Unicode Objects for Textual Content</h3>
paulb@358 40
<p>Handling text in specific encodings using normal Python strings can
paulb@358 41
be difficult, and handling text in multiple encodings in the same
paulb@358 42
application can be highly error-prone. Fortunately, Python has support
paulb@358 43
for Unicode objects which let you think of letters, numbers, symbols
paulb@358 44
and all other characters in an abstract way.</p>
paulb@358 45
<ul>
paulb@629 46
  <li>Convert textual content to Unicode as soon as possible.</li>
paulb@358 47
  <li>If you must include hard-coded messages in your application code,
paulb@436 48
make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a>
paulb@358 49
at the top of your source file.</li>
paulb@654 50
  <li>Remember that the standard library <code>codecs</code>
paulb@358 51
module contains useful functions to access streams as if Unicode
paulb@358 52
objects were being transmitted; for example:</li>
paulb@358 53
</ul>
paulb@442 54
<pre>import codecs<br /><br />class MyResource:<br /><br />    encoding = "utf-8"<br /><br />    def respond(self, trans):<br />        stream = trans.get_request_stream()                         # only reads strings<br />        unicode_stream = codecs.getreader(self.encoding)(stream)    # reads Unicode objects<br /><br />        [Some activity...]<br /><br />        out = trans.get_response_stream()                           # writes strings and Unicode objects<br /></pre>
paulb@358 55
<h3>Use Strings for Binary Content</h3>
paulb@358 56
<p>If you are reading and writing binary content, Unicode objects are
paulb@358 57
inappropriate. Make sure to open files in binary mode, where necessary.</p>
paulb@358 58
<h3>Use Explicit Encodings and Be Consistent</h3>
paulb@358 59
<p>Although WebStack has some support for detecting character encodings
paulb@358 60
used
paulb@358 61
in requests, it is often best for your application to exercise control
paulb@358 62
over
paulb@358 63
which encoding is used when <a href="parameters.html">inspecting
paulb@358 64
request
paulb@358 65
parameters</a> and when <a href="responses.html">producing responses</a>.
paulb@358 66
The
paulb@358 67
best way to do this is to decide which encoding is most suitable for
paulb@358 68
the data
paulb@358 69
presented and received in your application and then to use it
paulb@629 70
throughout.</p><p>One
paulb@629 71
approach which works acceptably for smaller applications is to define
paulb@629 72
an attribute (or a global) which is conveniently accessible and which
paulb@629 73
can be used directly with various transaction methods. Here is an
paulb@629 74
outline of code which does this:</p>
paulb@358 75
<pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    encoding = "utf-8"                                                     # We decide on "utf-8" as our chosen<br />                                                                           # encoding.<br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body(encoding=self.encoding)        # Explicitly use the encoding.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html", self.encoding))    # The output Web page uses the encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>
paulb@654 76
<h3>Use EncodingSelector to Set the Default Encoding</h3><p>An arguably better approach is to use selectors (as described in <a href="selectors.html">"Selectors - Components for Dispatching to Resources"</a>), typically in a "site map" arrangement (as described in <a href="deploying.html">"Deploying a WebStack Application"</a>), specifically using the <code>EncodingSelector</code>:</p><pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body()                       # Encoding set by EncodingSelector.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html"))            # The output Web page uses the default encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]<br /><br />def get_site_map():<br /><br />    return EncodingSelector(MyResource(), "utf-8")</pre><h3>Tell Encodings to Other Components</h3>
paulb@436 77
<p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may
paulb@358 78
be the case that such components will just write the generated content
paulb@654 79
straight to a normal stream (rather than one wrapped by a <code>codecs</code>
paulb@358 80
module function). In such cases, it is likely that for textual content
paulb@358 81
such as XML or related formats (XHTML, SVG, HTML) you will need to
paulb@358 82
instruct the component to use your chosen encoding; for example:</p>
paulb@358 83
<pre>        # In the respond method, xml_document is an xml.dom.minidom.Document object...<br />        xml_document.toxml(self.encoding)</pre>
paulb@436 84
<p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding
paulb@358 85
for the XML document.</p>
paulb@654 86
</body></html>