Lichen

docs/wiki/History

905:709a26b53318
2019-06-02 Paul Boddie Merged changes from the default branch. trailing-data
     1 = History =     2      3 Lichen is one of a number of experiments focused on the analysis of Python     4 source code with potential reliability and performance improvements in mind.     5 There have been a number of projects that can be regarded as its predecessors,     6 the first of these being a project with the name "Analysis" that was inspired     7 by the then-recent announcement of the     8 [[https://dspace.mit.edu/handle/1721.1/16688|Starkiller]] tool, which     9 attempted to compile Python to C++ (and was never publicly released). With the    10 emergence of the [[http://pypy.org/|PyPy]] project, the notion of writing a    11 Python implementation in a dialect of Python was also considered attractive.    12     13 == Initial Experiments ==    14     15 After some experience gained, a successor to "Analysis" was attempted with the    16 uninspired, but distinct, name of "analysis". This attempted to apply type    17 inference and also generate C programs that could be compiled, albeit with a    18 very simple run-time model that was not really amenable to further    19 improvement. However, some useful concepts were explored, such as abstract    20 syntax tree (AST) node annotations that could be used to direct code    21 generation activities and to provide annotated source code and documentation    22 summaries.    23     24 Various experiences with specialising functions and methods, plus general    25 experience working with the AST led to a successor called "simplify", which    26 attempted to extend specialisation to classes, thus introducing the notion of    27 multiple forms of a single class whose attributes have differing    28 characteristics. It also introduced the idea of a simplified instruction set    29 for program operations, in contrast to Python bytecode whose instructions can    30 represent relatively complicated operations, along with a simplified program    31 representation that was intended to make program analysis easier. Issues    32 around "data polymorphism" led to the realisation that attempting to identify    33 specific types, in order to facilitate insights into program behaviour or    34 optimisations of programs, was costly and not necessarily effective.    35     36 (Conversations with the author of [[http://shedskin.github.io/|Shed Skin]], a    37 tool already available at this time, which compiles Python programs complying    38 to certain restrictions to C++, were very useful in considering data    39 polymorphism issues and others. Shed Skin must effectively be what Starkiller    40 promised to be, and it has even been available in Debian for a number of years    41 so that others can evaluate it for their own needs.)    42     43 == Further Experiences ==    44     45 An alternative approach was taken in the next successor, "micropython" (not to    46 be confused with any other, more popular, project of the same name), where the    47 focus of analysis shifted to defining the attributes provided by program    48 objects and then determining which objects might be used in different places    49 in the program according to the attributes mentioned. It introduced a strategy    50 for representing program objects at run-time, and continued the work on a    51 simplified instruction set, defining its instructions in terms of the run-time    52 structures. The ability to generate annotated source code summaries was    53 enhanced, and the software could also generate programs in the simplified    54 instruction set that could be executed using an interpreter. Although    55 substantial progress was made, the increasing resistance experienced in    56 adapting a growing system to further forms of optimisations, and the need to    57 implement and test run-time functionality on a somewhat unique instruction set    58 architecture meant that progress became slower over time.    59     60 The dependence of "micropython" on costly whole-program analysis to identify    61 optimisations led to some further realisations. The notion of "attribute    62 usage" had allowed program namespaces to have their names characterised, and    63 the notion of well-defined structures had allowed the identification of    64 specific object types throughout programs, albeit with a degree of    65 understandable inaccuracy. Meanwhile, the run-time data model had relied on    66 restrictions on structures to permit relatively efficient structure access    67 operations even when object types cannot be identified. It thus appeared that    68 such techniques might be almost as potent on their own as the combination of    69 more costly techniques explored in earlier projects.    70     71 == Current Work ==    72     73 It was with such realisations that a new project was effectively born.    74 Tentatively called "!PythonLight" but renamed to "Lichen" as the code matured,    75 the objectives now involved a simpler processing framework that merely    76 attempted to catalogue structure members, to determine the origins of such    77 members, and to record data flow within namespaces in order to determine    78 attribute usage on names. The name "Lichen" can be pronounced in a way that is    79 a contraction of "like Python", indicating that although the language being    80 analysed is similar to Python, restrictions apply that make it not identical    81 to it.    82     83 Unlike previous efforts, instead of annotating the AST built by the Python    84 parser, separate catalogues would record details of program operations for    85 more convenient consumption by potentially separate analysis tools. It is    86 worth emphasising that much of the processing work done to characterise    87 program behaviour is effectively database work: searching for information that    88 matches particular criteria. Emitting tabular data for perusal and potential    89 use by other tools is a recognition of this, and it would not be unreasonable    90 to expect future versions of tools of this nature to actually employ more    91 conventional database management systems to facilitate program analysis.    92     93 Instead of targeting a project-specific virtual machine, the ultimate    94 objective of processing was amended to generate a C program that could be    95 compiled using existing tools. Although care must be taken to generate C    96 source code, delegating the low-level work of actually producing an executable    97 reduces the scope of the project and its associated demands.    98     99 Despite reasonable progress being made, certain features retained from Python   100 proved to be a burden, notably the behaviour of Python's import system and the   101 way in which programs can quickly accumulate superfluous code through that   102 system's operation. To focus on genuinely important functionality and on   103 producing working code, the project was [[../Restarted|restarted]] with   104 various additional simplifications made to the supported language.   105 Fortunately, this helped to drive progress, resulting in the development of a   106 toolchain that can generate C programs for compilation and execution.   107    108 Lichen attempts to balance the goals of requiring relatively inexpensive   109 analysis, offering meaningful insights into program behaviour, providing   110 output in the form of working, translated, standalone programs, and   111 encouraging further experimentation in the areas of language design, program   112 analysis and optimisation. Much of the standard library is itself written in   113 Lichen, even functionality that would not be written in Python for CPython,   114 thus realising one of the original ambitions for this work and its   115 predecessors.