1.1 --- a/docs/encodings.html Tue Apr 26 18:33:06 2005 +0000
1.2 +++ b/docs/encodings.html Sat Apr 30 00:21:44 2005 +0000
1.3 @@ -1,48 +1,90 @@
1.4 -<?xml version="1.0" encoding="iso-8859-1"?>
1.5 -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
1.6 - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
1.7 +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
1.8 <html xmlns="http://www.w3.org/1999/xhtml">
1.9 <head>
1.10 <title>Character Encodings</title>
1.11 - <meta name="generator" content="amaya 8.1a, see http://www.w3.org/Amaya/" />
1.12 + <meta name="generator"
1.13 + content="amaya 8.1a, see http://www.w3.org/Amaya/" />
1.14 <link href="styles.css" rel="stylesheet" type="text/css" />
1.15 </head>
1.16 -
1.17 <body>
1.18 <h1>Character Encodings</h1>
1.19 -
1.20 -<p>WebStack tries to let applications work with Unicode as much as possible,
1.21 -but there are two places where plain Python strings can be involved:</p>
1.22 +<p>When writing applications with WebStack, you should try and use
1.23 +Python's Unicode objects as much as possible. However, there are a
1.24 +number of places where plain Python strings can be involved:</p>
1.25 <ul>
1.26 - <li>When <a href="responses.html">output is prepared</a> - for example, Web
1.27 - pages.</li>
1.28 <li>When <a href="parameters.html">inspecting request parameters</a>.</li>
1.29 + <li>When <a href="responses.html">sending output in a response</a>.</li>
1.30 + <li>When <a href="parameters.html">receiving uploaded content</a>.</li>
1.31 + <li>When <a href="state.html">accessing cookie information</a>.</li>
1.32 </ul>
1.33 -
1.34 +<p>When Web pages (and other types of content) are sent to and from
1.35 +users of your application, the text will be in some kind of character
1.36 +encoding. For example, in English-speaking environments, the US-ASCII
1.37 +encoding is common and contains the basic letters, numbers and symbols
1.38 +used in English, whereas in Western Europe encodings like
1.39 +ISO-8859-1 and ISO-8859-15 are typically used, since they contain
1.40 +additional letters and symbols in order to support other languages.
1.41 +Often, UTF-8 is used to encode text because it covers most languages
1.42 +simultaneously and is therefore flexible enough for many applications.</p>
1.43 +<p>When URLs are received in applications, in order for some of the
1.44 +request parameters to be interpreted, the situation is a bit more
1.45 +awkward. The original text is encoded in US-ASCII but will contain
1.46 +special numeric codes that indicate character values in the
1.47 +original text encoding - see the <a href="parameters.html">description
1.48 +of query strings</a> for more information.</p>
1.49 <h2>Recommendations</h2>
1.50 -
1.51 -<p>Although WebStack has some support for detecting character encodings used
1.52 -in requests, it is often best for your application to exercise control over
1.53 -which encoding is used when <a href="parameters.html">inspecting request
1.54 -parameters</a> and when <a href="responses.html">producing responses</a>. The
1.55 -best way to do this is to decide which encoding is most suitable for the data
1.56 -presented and received in your application and then to use it throughout.
1.57 +<dl>
1.58 + <dt>The following recommendations should help you avoid issues with
1.59 +incorrect characters in the Web pages (and other content) that you
1.60 +produce:</dt>
1.61 +</dl>
1.62 +<h3>Use Unicode Objects for Textual Content</h3>
1.63 +<p>Handling text in specific encodings using normal Python strings can
1.64 +be difficult, and handling text in multiple encodings in the same
1.65 +application can be highly error-prone. Fortunately, Python has support
1.66 +for Unicode objects which let you think of letters, numbers, symbols
1.67 +and all other characters in an abstract way.</p>
1.68 +<ul>
1.69 + <li>Convert textual content to Unicode as soon as possible (see below
1.70 +for choosing encodings).</li>
1.71 + <li>If you must include hard-coded messages in your application code,
1.72 +make sure to specify the encoding using the <a
1.73 + href="http://www.python.org/peps/pep-0263.html">standard declaration</a>
1.74 +at the top of your source file.</li>
1.75 + <li>Remember that the standard library <code>codecs</code>
1.76 +module contains useful functions to access streams as if Unicode
1.77 +objects were being transmitted; for example:</li>
1.78 +</ul>
1.79 +<pre>import codecs<br /><br />class MyResource:<br /><br /> encoding = "utf-8"<br /><br /> def respond(self, trans):<br /> stream = trans.get_request_stream() # only reads strings<br /> unicode_stream = codecs.getreader(self.encoding)(stream) # reads Unicode objects<br /><br /> [Some activity...]<br /><br /> out = trans.get_response_stream() # only writes strings<br /> unicode_out = codecs.getwriter(self.encoding)(out) # writes Unicode objects</pre>
1.80 +<h3>Use Strings for Binary Content</h3>
1.81 +<p>If you are reading and writing binary content, Unicode objects are
1.82 +inappropriate. Make sure to open files in binary mode, where necessary.</p>
1.83 +<h3>Use Explicit Encodings and Be Consistent</h3>
1.84 +<p>Although WebStack has some support for detecting character encodings
1.85 +used
1.86 +in requests, it is often best for your application to exercise control
1.87 +over
1.88 +which encoding is used when <a href="parameters.html">inspecting
1.89 +request
1.90 +parameters</a> and when <a href="responses.html">producing responses</a>.
1.91 +The
1.92 +best way to do this is to decide which encoding is most suitable for
1.93 +the data
1.94 +presented and received in your application and then to use it
1.95 +throughout.
1.96 Here is an outline of code which does this:</p>
1.97 -<pre>from WebStack.Generic import ContentType
1.98 -
1.99 -class MyResource:
1.100 -
1.101 - encoding = "utf-8" # We decide on "utf-8" as our chosen
1.102 - # encoding.
1.103 - def respond(self, trans):
1.104 - [Do various things.]
1.105 -
1.106 - fields = trans.get_fields_from_body(encoding=self.encoding) # Explicitly use the encoding.
1.107 -
1.108 - [Do other things with the Unicode values from the fields.]
1.109 -
1.110 - trans.set_content_type(ContentType("text/html", self.encoding)) # The output Web page uses the encoding.
1.111 -
1.112 - [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.] </pre>
1.113 +<pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br /> encoding = "utf-8" # We decide on "utf-8" as our chosen<br /> # encoding.<br /> def respond(self, trans):<br /> [Do various things.]<br /><br /> fields = trans.get_fields_from_body(encoding=self.encoding) # Explicitly use the encoding.<br /><br /> [Do other things with the Unicode values from the fields.]<br /><br /> trans.set_content_type(ContentType("text/html", self.encoding)) # The output Web page uses the encoding.<br /><br /> [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>
1.114 +<h3>Tell Encodings to Other Components</h3>
1.115 +<p>When using other components to generate content (see <a
1.116 + href="integrating.html">"Integrating with Other Systems"</a>), it may
1.117 +be the case that such components will just write the generated content
1.118 +straight to a normal stream (rather than one wrapped by a <code>codecs</code>
1.119 +module function). In such cases, it is likely that for textual content
1.120 +such as XML or related formats (XHTML, SVG, HTML) you will need to
1.121 +instruct the component to use your chosen encoding; for example:</p>
1.122 +<pre> # In the respond method, xml_document is an xml.dom.minidom.Document object...<br /> xml_document.toxml(self.encoding)</pre>
1.123 +<p>This will then generate the appropriate characters in the output <span
1.124 + style="font-style: italic;">and</span> specify the correct encoding
1.125 +for the XML document.</p>
1.126 </body>
1.127 </html>