Removed recoding to UTF-8 since this failed for ISO-8859-15, causing UTF-8 recodings of byte sequences to occur, not producing such undesirable data for ISO-8859-1 only because of it being special-cased. This change may break other ASCII-incompatible encodings because UTF-8 is likely to be the safe form of such data, permitting the parser to understand it, and without such recoding the parser will no longer recognise the grammar's tokens.

     1.1 --- a/compiler/transformer.py	Fri Feb 03 23:25:00 2017 +0100
     1.2 +++ b/compiler/transformer.py	Sat Feb 04 00:10:47 2017 +0100
     1.3 @@ -669,11 +669,6 @@
     1.4  
     1.5      def decode_literal(self, lit):
     1.6          if self.encoding:
     1.7 -            # this is particularly fragile & a bit of a
     1.8 -            # hack... changes in compile.c:parsestr and
     1.9 -            # tokenizer.c must be reflected here.
    1.10 -            if self.encoding not in ['utf-8', 'iso-8859-1']:
    1.11 -                lit = unicode(lit, 'utf-8').encode(self.encoding)
    1.12              return eval("# coding: %s\n%s" % (self.encoding, lit))
    1.13          else:
    1.14              return eval(lit)

     2.1 --- a/pyparser/pyparse.py	Fri Feb 03 23:25:00 2017 +0100
     2.2 +++ b/pyparser/pyparse.py	Sat Feb 04 00:10:47 2017 +0100
     2.3 @@ -1,13 +1,6 @@
     2.4  from pyparser import parser, pytokenizer, pygram, error
     2.5  from pyparser import consts
     2.6  
     2.7 -def recode_to_utf8(bytes, encoding):
     2.8 -    text = bytes.decode(encoding)
     2.9 -    if not isinstance(text, unicode):
    2.10 -        raise error.SyntaxError("codec did not return a unicode object")
    2.11 -    recoded = text.encode("utf-8")
    2.12 -    return recoded
    2.13 -
    2.14  def _normalize_encoding(encoding):
    2.15      """returns normalized name for <encoding>
    2.16  
    2.17 @@ -103,17 +96,6 @@
    2.18                                          filename=compile_info.filename)
    2.19          else:
    2.20              enc = _normalize_encoding(_check_for_encoding(textsrc))
    2.21 -            if enc is not None and enc not in ('utf-8', 'iso-8859-1'):
    2.22 -                try:
    2.23 -                    textsrc = recode_to_utf8(textsrc, enc)
    2.24 -                except LookupError as e:
    2.25 -                    # if the codec is not found, LookupError is raised.
    2.26 -                    raise error.SyntaxError("Unknown encoding: %s" % enc,
    2.27 -                                            filename=compile_info.filename)
    2.28 -                # Transform unicode errors into SyntaxError
    2.29 -                except UnicodeDecodeError as e:
    2.30 -                    message = str(e)
    2.31 -                    raise error.SyntaxError(message)
    2.32  
    2.33          flags = compile_info.flags
    2.34
2017-02-04	Paul Boddie	raw files shortlog changelog graph	Removed recoding to UTF-8 since this failed for ISO-8859-15, causing UTF-8 recodings of byte sequences to occur, not producing such undesirable data for ISO-8859-1 only because of it being special-cased. This change may break other ASCII-incompatible encodings because UTF-8 is likely to be the safe form of such data, permitting the parser to understand it, and without such recoding the parser will no longer recognise the grammar's tokens.
			compiler/transformer.py (file) pyparser/pyparse.py (file)