Lichen (annotate docs/wiki/History in 709a26b53318)

Lichen

Annotated docs/wiki/History

933:520036942e1e

905:709a26b53318

861:f745d151a441

2019-06-02

Paul Boddie

Merged changes from the default branch.

trailing-data

paul@810	1	= History =
paul@810	2
paul@861	3	Lichen is one of a number of experiments focused on the analysis of Python
paul@861	4	source code with potential reliability and performance improvements in mind.
paul@861	5	There have been a number of projects that can be regarded as its predecessors,
paul@861	6	the first of these being a project with the name "Analysis" that was inspired
paul@861	7	by the then-recent announcement of the
paul@861	8	[[https://dspace.mit.edu/handle/1721.1/16688\|Starkiller]] tool, which
paul@861	9	attempted to compile Python to C++ (and was never publicly released). With the
paul@861	10	emergence of the [[http://pypy.org/\|PyPy]] project, the notion of writing a
paul@861	11	Python implementation in a dialect of Python was also considered attractive.
paul@810	12
paul@810	13	== Initial Experiments ==
paul@810	14
paul@861	15	After some experience gained, a successor to "Analysis" was attempted with the
paul@861	16	uninspired, but distinct, name of "analysis". This attempted to apply type
paul@861	17	inference and also generate C programs that could be compiled, albeit with a
paul@861	18	very simple run-time model that was not really amenable to further
paul@861	19	improvement. However, some useful concepts were explored, such as abstract
paul@861	20	syntax tree (AST) node annotations that could be used to direct code
paul@861	21	generation activities and to provide annotated source code and documentation
paul@861	22	summaries.
paul@810	23
paul@861	24	Various experiences with specialising functions and methods, plus general
paul@861	25	experience working with the AST led to a successor called "simplify", which
paul@861	26	attempted to extend specialisation to classes, thus introducing the notion of
paul@861	27	multiple forms of a single class whose attributes have differing
paul@861	28	characteristics. It also introduced the idea of a simplified instruction set
paul@861	29	for program operations, in contrast to Python bytecode whose instructions can
paul@861	30	represent relatively complicated operations, along with a simplified program
paul@861	31	representation that was intended to make program analysis easier. Issues
paul@861	32	around "data polymorphism" led to the realisation that attempting to identify
paul@861	33	specific types, in order to facilitate insights into program behaviour or
paul@861	34	optimisations of programs, was costly and not necessarily effective.
paul@810	35
paul@861	36	(Conversations with the author of [[http://shedskin.github.io/\|Shed Skin]], a
paul@861	37	tool already available at this time, which compiles Python programs complying
paul@861	38	to certain restrictions to C++, were very useful in considering data
paul@861	39	polymorphism issues and others. Shed Skin must effectively be what Starkiller
paul@861	40	promised to be, and it has even been available in Debian for a number of years
paul@861	41	so that others can evaluate it for their own needs.)
paul@810	42
paul@810	43	== Further Experiences ==
paul@810	44
paul@861	45	An alternative approach was taken in the next successor, "micropython" (not to
paul@861	46	be confused with any other, more popular, project of the same name), where the
paul@861	47	focus of analysis shifted to defining the attributes provided by program
paul@861	48	objects and then determining which objects might be used in different places
paul@861	49	in the program according to the attributes mentioned. It introduced a strategy
paul@861	50	for representing program objects at run-time, and continued the work on a
paul@861	51	simplified instruction set, defining its instructions in terms of the run-time
paul@861	52	structures. The ability to generate annotated source code summaries was
paul@861	53	enhanced, and the software could also generate programs in the simplified
paul@861	54	instruction set that could be executed using an interpreter. Although
paul@861	55	substantial progress was made, the increasing resistance experienced in
paul@861	56	adapting a growing system to further forms of optimisations, and the need to
paul@861	57	implement and test run-time functionality on a somewhat unique instruction set
paul@861	58	architecture meant that progress became slower over time.
paul@810	59
paul@861	60	The dependence of "micropython" on costly whole-program analysis to identify
paul@861	61	optimisations led to some further realisations. The notion of "attribute
paul@861	62	usage" had allowed program namespaces to have their names characterised, and
paul@861	63	the notion of well-defined structures had allowed the identification of
paul@861	64	specific object types throughout programs, albeit with a degree of
paul@861	65	understandable inaccuracy. Meanwhile, the run-time data model had relied on
paul@861	66	restrictions on structures to permit relatively efficient structure access
paul@861	67	operations even when object types cannot be identified. It thus appeared that
paul@861	68	such techniques might be almost as potent on their own as the combination of
paul@861	69	more costly techniques explored in earlier projects.
paul@810	70
paul@810	71	== Current Work ==
paul@810	72
paul@861	73	It was with such realisations that a new project was effectively born.
paul@861	74	Tentatively called "!PythonLight" but renamed to "Lichen" as the code matured,
paul@861	75	the objectives now involved a simpler processing framework that merely
paul@861	76	attempted to catalogue structure members, to determine the origins of such
paul@861	77	members, and to record data flow within namespaces in order to determine
paul@861	78	attribute usage on names. The name "Lichen" can be pronounced in a way that is
paul@861	79	a contraction of "like Python", indicating that although the language being
paul@861	80	analysed is similar to Python, restrictions apply that make it not identical
paul@861	81	to it.
paul@810	82
paul@861	83	Unlike previous efforts, instead of annotating the AST built by the Python
paul@861	84	parser, separate catalogues would record details of program operations for
paul@861	85	more convenient consumption by potentially separate analysis tools. It is
paul@861	86	worth emphasising that much of the processing work done to characterise
paul@861	87	program behaviour is effectively database work: searching for information that
paul@861	88	matches particular criteria. Emitting tabular data for perusal and potential
paul@861	89	use by other tools is a recognition of this, and it would not be unreasonable
paul@861	90	to expect future versions of tools of this nature to actually employ more
paul@861	91	conventional database management systems to facilitate program analysis.
paul@810	92
paul@861	93	Instead of targeting a project-specific virtual machine, the ultimate
paul@861	94	objective of processing was amended to generate a C program that could be
paul@861	95	compiled using existing tools. Although care must be taken to generate C
paul@861	96	source code, delegating the low-level work of actually producing an executable
paul@861	97	reduces the scope of the project and its associated demands.
paul@810	98
paul@861	99	Despite reasonable progress being made, certain features retained from Python
paul@861	100	proved to be a burden, notably the behaviour of Python's import system and the
paul@861	101	way in which programs can quickly accumulate superfluous code through that
paul@861	102	system's operation. To focus on genuinely important functionality and on
paul@861	103	producing working code, the project was [[../Restarted\|restarted]] with
paul@861	104	various additional simplifications made to the supported language.
paul@861	105	Fortunately, this helped to drive progress, resulting in the development of a
paul@861	106	toolchain that can generate C programs for compilation and execution.
paul@810	107
paul@861	108	Lichen attempts to balance the goals of requiring relatively inexpensive
paul@861	109	analysis, offering meaningful insights into program behaviour, providing
paul@861	110	output in the form of working, translated, standalone programs, and
paul@861	111	encouraging further experimentation in the areas of language design, program
paul@861	112	analysis and optimisation. Much of the standard library is itself written in
paul@861	113	Lichen, even functionality that would not be written in Python for CPython,
paul@861	114	thus realising one of the original ambitions for this work and its
paul@861	115	predecessors.