WebStack

Annotated docs/encodings.html

383:74ed715c5455
2005-05-01 paulb [project @ 2005-05-01 18:16:52 by paulb] Added missing example for Zope.
paulb@358 1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
paulb@335 2
<html xmlns="http://www.w3.org/1999/xhtml">
paulb@335 3
<head>
paulb@335 4
  <title>Character Encodings</title>
paulb@358 5
  <meta name="generator"
paulb@358 6
 content="amaya 8.1a, see http://www.w3.org/Amaya/" />
paulb@335 7
  <link href="styles.css" rel="stylesheet" type="text/css" />
paulb@335 8
</head>
paulb@335 9
<body>
paulb@335 10
<h1>Character Encodings</h1>
paulb@358 11
<p>When writing applications with WebStack, you should try and use
paulb@358 12
Python's Unicode objects as much as possible. However, there are a
paulb@358 13
number of places where plain Python strings can be involved:</p>
paulb@335 14
<ul>
paulb@360 15
  <li><a href="parameters.html">Inspecting request parameters</a></li>
paulb@360 16
  <li><a href="responses.html">Sending output in a response</a></li>
paulb@360 17
  <li><a href="parameters.html">Receiving uploaded content</a></li>
paulb@360 18
  <li><a href="state.html">Accessing cookie information</a></li>
paulb@360 19
  <li><a href="sessions.html">Accessing session information</a></li>
paulb@335 20
</ul>
paulb@358 21
<p>When Web pages (and other types of content) are sent to and from
paulb@358 22
users of your application, the text will be in some kind of character
paulb@358 23
encoding. For example, in English-speaking environments, the US-ASCII
paulb@358 24
encoding is common and contains the basic letters, numbers and symbols
paulb@358 25
used in English, whereas in Western Europe&nbsp;encodings like
paulb@358 26
ISO-8859-1 and ISO-8859-15 are typically used, since they&nbsp;contain
paulb@358 27
additional letters and symbols in order to support other languages.
paulb@358 28
Often, UTF-8 is used to encode text because it covers most languages
paulb@358 29
simultaneously and is therefore flexible enough for many applications.</p>
paulb@358 30
<p>When URLs are received in applications, in order for some of the
paulb@358 31
request parameters to be interpreted, the situation is a bit more
paulb@358 32
awkward. The original text is encoded in US-ASCII but will contain
paulb@358 33
special numeric codes that indicate&nbsp;character values in the
paulb@358 34
original text encoding -&nbsp;see the <a href="parameters.html">description
paulb@358 35
of query strings</a> for more information.</p>
paulb@335 36
<h2>Recommendations</h2>
paulb@358 37
<dl>
paulb@358 38
  <dt>The following recommendations should help you avoid issues with
paulb@358 39
incorrect characters in the Web pages (and other content) that you
paulb@358 40
produce:</dt>
paulb@358 41
</dl>
paulb@358 42
<h3>Use Unicode Objects for Textual Content</h3>
paulb@358 43
<p>Handling text in specific encodings using normal Python strings can
paulb@358 44
be difficult, and handling text in multiple encodings in the same
paulb@358 45
application can be highly error-prone. Fortunately, Python has support
paulb@358 46
for Unicode objects which let you think of letters, numbers, symbols
paulb@358 47
and all other characters in an abstract way.</p>
paulb@358 48
<ul>
paulb@358 49
  <li>Convert textual content to Unicode as soon as possible (see below
paulb@358 50
for choosing encodings).</li>
paulb@358 51
  <li>If you must include hard-coded messages in your application code,
paulb@358 52
make sure to specify the encoding using the <a
paulb@358 53
 href="http://www.python.org/peps/pep-0263.html">standard declaration</a>
paulb@358 54
at the top of your source file.</li>
paulb@358 55
  <li>Remember that the standard library&nbsp;<code>codecs</code>
paulb@358 56
module contains useful functions to access streams as if Unicode
paulb@358 57
objects were being transmitted; for example:</li>
paulb@358 58
</ul>
paulb@358 59
<pre>import codecs<br /><br />class MyResource:<br /><br />    encoding = "utf-8"<br /><br />    def respond(self, trans):<br />        stream = trans.get_request_stream()                         # only reads strings<br />        unicode_stream = codecs.getreader(self.encoding)(stream)    # reads Unicode objects<br /><br />        [Some activity...]<br /><br />        out = trans.get_response_stream()                           # only writes strings<br />        unicode_out = codecs.getwriter(self.encoding)(out)          # writes Unicode objects</pre>
paulb@358 60
<h3>Use Strings for Binary Content</h3>
paulb@358 61
<p>If you are reading and writing binary content, Unicode objects are
paulb@358 62
inappropriate. Make sure to open files in binary mode, where necessary.</p>
paulb@358 63
<h3>Use Explicit Encodings and Be Consistent</h3>
paulb@358 64
<p>Although WebStack has some support for detecting character encodings
paulb@358 65
used
paulb@358 66
in requests, it is often best for your application to exercise control
paulb@358 67
over
paulb@358 68
which encoding is used when <a href="parameters.html">inspecting
paulb@358 69
request
paulb@358 70
parameters</a> and when <a href="responses.html">producing responses</a>.
paulb@358 71
The
paulb@358 72
best way to do this is to decide which encoding is most suitable for
paulb@358 73
the data
paulb@358 74
presented and received in your application and then to use it
paulb@358 75
throughout.
paulb@335 76
Here is an outline of code which does this:</p>
paulb@358 77
<pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    encoding = "utf-8"                                                     # We decide on "utf-8" as our chosen<br />                                                                           # encoding.<br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body(encoding=self.encoding)        # Explicitly use the encoding.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html", self.encoding))    # The output Web page uses the encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>
paulb@358 78
<h3>Tell Encodings to Other Components</h3>
paulb@358 79
<p>When using other components to generate content (see <a
paulb@358 80
 href="integrating.html">"Integrating with Other Systems"</a>), it may
paulb@358 81
be the case that such components will just write the generated content
paulb@358 82
straight to a normal stream (rather than one wrapped by a&nbsp;<code>codecs</code>
paulb@358 83
module function). In such cases, it is likely that for textual content
paulb@358 84
such as XML or related formats (XHTML, SVG, HTML) you will need to
paulb@358 85
instruct the component to use your chosen encoding; for example:</p>
paulb@358 86
<pre>        # In the respond method, xml_document is an xml.dom.minidom.Document object...<br />        xml_document.toxml(self.encoding)</pre>
paulb@358 87
<p>This will then generate the appropriate characters in the output <span
paulb@358 88
 style="font-style: italic;">and</span> specify the correct encoding
paulb@358 89
for the XML document.</p>
paulb@335 90
</body>
paulb@335 91
</html>