Lichen

Annotated docs/wiki/History

905:709a26b53318
2019-06-02 Paul Boddie Merged changes from the default branch. trailing-data
paul@810 1
= History =
paul@810 2
paul@861 3
Lichen is one of a number of experiments focused on the analysis of Python
paul@861 4
source code with potential reliability and performance improvements in mind.
paul@861 5
There have been a number of projects that can be regarded as its predecessors,
paul@861 6
the first of these being a project with the name "Analysis" that was inspired
paul@861 7
by the then-recent announcement of the
paul@861 8
[[https://dspace.mit.edu/handle/1721.1/16688|Starkiller]] tool, which
paul@861 9
attempted to compile Python to C++ (and was never publicly released). With the
paul@861 10
emergence of the [[http://pypy.org/|PyPy]] project, the notion of writing a
paul@861 11
Python implementation in a dialect of Python was also considered attractive.
paul@810 12
paul@810 13
== Initial Experiments ==
paul@810 14
paul@861 15
After some experience gained, a successor to "Analysis" was attempted with the
paul@861 16
uninspired, but distinct, name of "analysis". This attempted to apply type
paul@861 17
inference and also generate C programs that could be compiled, albeit with a
paul@861 18
very simple run-time model that was not really amenable to further
paul@861 19
improvement. However, some useful concepts were explored, such as abstract
paul@861 20
syntax tree (AST) node annotations that could be used to direct code
paul@861 21
generation activities and to provide annotated source code and documentation
paul@861 22
summaries.
paul@810 23
paul@861 24
Various experiences with specialising functions and methods, plus general
paul@861 25
experience working with the AST led to a successor called "simplify", which
paul@861 26
attempted to extend specialisation to classes, thus introducing the notion of
paul@861 27
multiple forms of a single class whose attributes have differing
paul@861 28
characteristics. It also introduced the idea of a simplified instruction set
paul@861 29
for program operations, in contrast to Python bytecode whose instructions can
paul@861 30
represent relatively complicated operations, along with a simplified program
paul@861 31
representation that was intended to make program analysis easier. Issues
paul@861 32
around "data polymorphism" led to the realisation that attempting to identify
paul@861 33
specific types, in order to facilitate insights into program behaviour or
paul@861 34
optimisations of programs, was costly and not necessarily effective.
paul@810 35
paul@861 36
(Conversations with the author of [[http://shedskin.github.io/|Shed Skin]], a
paul@861 37
tool already available at this time, which compiles Python programs complying
paul@861 38
to certain restrictions to C++, were very useful in considering data
paul@861 39
polymorphism issues and others. Shed Skin must effectively be what Starkiller
paul@861 40
promised to be, and it has even been available in Debian for a number of years
paul@861 41
so that others can evaluate it for their own needs.)
paul@810 42
paul@810 43
== Further Experiences ==
paul@810 44
paul@861 45
An alternative approach was taken in the next successor, "micropython" (not to
paul@861 46
be confused with any other, more popular, project of the same name), where the
paul@861 47
focus of analysis shifted to defining the attributes provided by program
paul@861 48
objects and then determining which objects might be used in different places
paul@861 49
in the program according to the attributes mentioned. It introduced a strategy
paul@861 50
for representing program objects at run-time, and continued the work on a
paul@861 51
simplified instruction set, defining its instructions in terms of the run-time
paul@861 52
structures. The ability to generate annotated source code summaries was
paul@861 53
enhanced, and the software could also generate programs in the simplified
paul@861 54
instruction set that could be executed using an interpreter. Although
paul@861 55
substantial progress was made, the increasing resistance experienced in
paul@861 56
adapting a growing system to further forms of optimisations, and the need to
paul@861 57
implement and test run-time functionality on a somewhat unique instruction set
paul@861 58
architecture meant that progress became slower over time.
paul@810 59
paul@861 60
The dependence of "micropython" on costly whole-program analysis to identify
paul@861 61
optimisations led to some further realisations. The notion of "attribute
paul@861 62
usage" had allowed program namespaces to have their names characterised, and
paul@861 63
the notion of well-defined structures had allowed the identification of
paul@861 64
specific object types throughout programs, albeit with a degree of
paul@861 65
understandable inaccuracy. Meanwhile, the run-time data model had relied on
paul@861 66
restrictions on structures to permit relatively efficient structure access
paul@861 67
operations even when object types cannot be identified. It thus appeared that
paul@861 68
such techniques might be almost as potent on their own as the combination of
paul@861 69
more costly techniques explored in earlier projects.
paul@810 70
paul@810 71
== Current Work ==
paul@810 72
paul@861 73
It was with such realisations that a new project was effectively born.
paul@861 74
Tentatively called "!PythonLight" but renamed to "Lichen" as the code matured,
paul@861 75
the objectives now involved a simpler processing framework that merely
paul@861 76
attempted to catalogue structure members, to determine the origins of such
paul@861 77
members, and to record data flow within namespaces in order to determine
paul@861 78
attribute usage on names. The name "Lichen" can be pronounced in a way that is
paul@861 79
a contraction of "like Python", indicating that although the language being
paul@861 80
analysed is similar to Python, restrictions apply that make it not identical
paul@861 81
to it.
paul@810 82
paul@861 83
Unlike previous efforts, instead of annotating the AST built by the Python
paul@861 84
parser, separate catalogues would record details of program operations for
paul@861 85
more convenient consumption by potentially separate analysis tools. It is
paul@861 86
worth emphasising that much of the processing work done to characterise
paul@861 87
program behaviour is effectively database work: searching for information that
paul@861 88
matches particular criteria. Emitting tabular data for perusal and potential
paul@861 89
use by other tools is a recognition of this, and it would not be unreasonable
paul@861 90
to expect future versions of tools of this nature to actually employ more
paul@861 91
conventional database management systems to facilitate program analysis.
paul@810 92
paul@861 93
Instead of targeting a project-specific virtual machine, the ultimate
paul@861 94
objective of processing was amended to generate a C program that could be
paul@861 95
compiled using existing tools. Although care must be taken to generate C
paul@861 96
source code, delegating the low-level work of actually producing an executable
paul@861 97
reduces the scope of the project and its associated demands.
paul@810 98
paul@861 99
Despite reasonable progress being made, certain features retained from Python
paul@861 100
proved to be a burden, notably the behaviour of Python's import system and the
paul@861 101
way in which programs can quickly accumulate superfluous code through that
paul@861 102
system's operation. To focus on genuinely important functionality and on
paul@861 103
producing working code, the project was [[../Restarted|restarted]] with
paul@861 104
various additional simplifications made to the supported language.
paul@861 105
Fortunately, this helped to drive progress, resulting in the development of a
paul@861 106
toolchain that can generate C programs for compilation and execution.
paul@810 107
paul@861 108
Lichen attempts to balance the goals of requiring relatively inexpensive
paul@861 109
analysis, offering meaningful insights into program behaviour, providing
paul@861 110
output in the form of working, translated, standalone programs, and
paul@861 111
encouraging further experimentation in the areas of language design, program
paul@861 112
analysis and optimisation. Much of the standard library is itself written in
paul@861 113
Lichen, even functionality that would not be written in Python for CPython,
paul@861 114
thus realising one of the original ambitions for this work and its
paul@861 115
predecessors.