Lichen

Annotated docs/wiki/Translation

905:709a26b53318
2019-06-02 Paul Boddie Merged changes from the default branch. trailing-data
paul@810 1
= Translating Programs =
paul@810 2
paul@861 3
The Lichen toolchain currently targets the C programming language, generating
paul@861 4
programs that are then compiled using existing C compilers. The process of
paul@861 5
translation involves the optimisation and generation of program structures,
paul@861 6
and the translation of Lichen language constructs into equivalent C language
paul@861 7
constructs.
paul@810 8
paul@810 9
<<TableOfContents(2,3)>>
paul@810 10
paul@810 11
== Attribute Access Plans ==
paul@810 12
paul@861 13
During [[../Inspection|inspection]], each attribute access is associated with
paul@861 14
a unique location and its details stored. During [[../Deduction|deduction]],
paul@861 15
such accesses are resolved and characterised. During
paul@861 16
[[../Optimisation|optimisation]], such accesses are encoded as '''instruction
paul@861 17
plans''' indicating the operations or instructions required to achieve the
paul@861 18
access. During translation, these instruction plans are generated as part of
paul@861 19
the final program.
paul@810 20
paul@810 21
== Structure Optimisation ==
paul@810 22
paul@861 23
All object structures need to support the attributes defined for them, and
paul@861 24
although it may be possible to determine the identity of attributes in advance
paul@861 25
of a program being run, there generally remains a need to provide mechanisms
paul@861 26
to determine the nature of an object at run-time and to access its attributes
paul@861 27
dynamically. To achieve this, an [[../Optimisation|optimisation]] process
paul@861 28
catalogues the presence of each attribute name on all program objects, then
paul@861 29
deciding upon a position for the structure member corresponding to that
paul@861 30
attribute name in all objects.
paul@810 31
paul@810 32
== Structure Generation ==
paul@810 33
paul@861 34
Each program consists of a number of structures providing the program's
paul@861 35
objects and defining classes, functions, modules and any predefined instances.
paul@810 36
paul@810 37
=== Attribute Representation ===
paul@810 38
paul@861 39
Attributes most typically provide references to objects within a certain
paul@861 40
optional context. Since attributes are employed to encode various other kinds
paul@861 41
of information, the data structure may be interpreted in other ways within
paul@861 42
suitably controlled environments. For example, the `__data__` attribute found
paul@861 43
on instances of certain built-in classes will encode references to other kinds
paul@861 44
of information that are used within native functions.
paul@810 45
paul@810 46
=== Object Representation ===
paul@810 47
paul@861 48
Objects provide collections of attributes, and the object representation is
paul@861 49
meant to support efficient computed access to attributes whilst also allowing
paul@861 50
a reasonably efficient means of testing and accessing attributes at run-time.
paul@810 51
paul@810 52
=== Structure Constants ===
paul@810 53
paul@861 54
For clarity, several classes of symbolic constant are defined in the
paul@861 55
translated program to help define structures or to refer to structure members.
paul@810 56
paul@810 57
|| '''Constant Class''' || '''Purpose''' || '''Example''' || '''Application''' ||
paul@810 58
|| `csize` || Indicates the number of attributes in a class || `__csize___builtins___list_list` ||<|3> Structure definition ||
paul@810 59
|| `isize` || Indicates the number of attributes in an instance || `__isize___builtins___list_list` ||
paul@810 60
|| `msize` || Indicates the number of attributes in a module || `__msize___builtins___list` ||
paul@810 61
|| `pminsize` || Indicates the minimum number of parameters expected by a callable || `__pminsize___builtins___list_list_append` ||<|2> Invocation ||
paul@810 62
|| `pmaxsize` || Indicates the maximum number of parameters expected by a callable || `__pmaxsize___builtins___list_list_append` ||
paul@810 63
|| `pcode` || Assigns a code to a particular parameter name || `__pcode_value` ||<|2> Keyword argument positioning ||
paul@810 64
|| `ppos` || Indicates the table position used by a particular parameter name || `__ppos_value` ||
paul@810 65
|| `code` || Assigns a code to a particular attribute name || `__code_value` ||<|2> Attribute access ||
paul@810 66
|| `pos` || Indicates the table position used by a particular attribute name || `__pos_value` ||
paul@810 67
paul@861 68
Such constants are encoded using helper functions, producing names resembling
paul@861 69
those above that are intended to be distinct and not to conflict with other
paul@861 70
defined names in the final program.
paul@810 71
paul@810 72
=== Instance Structures ===
paul@810 73
paul@861 74
Like all program objects, instances are defined in C according to a generic
paul@861 75
structure, meaning that all instances have the same generic members from the
paul@861 76
perspective of a C program. However, specific structures are defined for each
paul@861 77
class in order to conveniently describe the dimensions of instances of that
paul@861 78
class. For example, for the built-in list class, a declaration resembling the
paul@861 79
following is issued:
paul@810 80
paul@810 81
{{{
paul@810 82
typedef struct {
paul@810 83
    const __table * table;
paul@810 84
    unsigned int pos;
paul@810 85
    __attr attrs[__isize___builtins___list_list___init__];
paul@810 86
} __obj___builtins___list_list___init__;
paul@810 87
}}}
paul@810 88
paul@810 89
== Program Translation ==
paul@810 90
paul@810 91
=== Useful C Language Features ===
paul@810 92
paul@861 93
The translated code relies on the availability of certain useful features of
paul@861 94
C, some of them supported only in modern compilers:
paul@810 95
paul@861 96
 * Array and structure literals: these are used to define values within
paul@861 97
   expressions that would otherwise require separate definition; invocation
paul@861 98
   arguments are defined using inline expressions and the `__ARGS` macro (for
paul@861 99
   unknown invocation targets and for keyword arguments)
paul@861 100
 * [[https://gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html#Variadic-Macros|
paul@861 101
   Variadic macros]]: such macros permit variable numbers of arguments, thus
paul@861 102
   making the `__ARGS` macro possible
paul@861 103
 * [[WikiPedia:Comma operator]] sequences: these are used to construct
paul@861 104
   instruction sequences within expressions, providing a mechanism for
paul@861 105
   formulating attribute accesses
paul@810 106
paul@810 107
=== Function Naming ===
paul@810 108
paul@861 109
A naming scheme is applied to functions defined in the final C program to
paul@861 110
support different kinds of callables present in the original program.
paul@810 111
paul@810 112
|| '''Name Class''' || '''Purpose''' || '''Example''' ||
paul@810 113
|| `fn` || Indicates a plain function || `__fn___builtins___str_str` ||
paul@810 114
|| `main` || Indicates a module main program || `__main_sys` ||
paul@810 115
|| `new` || Indicates a class instantiator || `__new___builtins___list_list` ||
paul@810 116
|| `newliteral` || Provides an instantiator for literals mentioned in programs || `__newliteral___builtins___list_list` ||
paul@810 117
paul@810 118
=== Generic Program Functionality ===
paul@810 119
paul@861 120
Given the representation of the fundamental program objects, a suite of
paul@861 121
generic operations is provided to support the activities necessary to inspect
paul@861 122
and update program data. Such operations are largely concerned with generic
paul@861 123
object and attribute representations and are practically identical from
paul@861 124
program to program.
paul@810 125
paul@861 126
For example, an attribute provided by an object can be accessed in several
paul@861 127
different ways depending on the information known about the object when the
paul@861 128
program is translated. Consequently, a suite of attribute access operations is
paul@861 129
provided to support each different scenario:
paul@810 130
paul@810 131
||<-2|2> ||<-2> '''Attribute provider''' ||
paul@810 132
|| '''Class''' || '''Instance''' ||
paul@810 133
||<|3> '''Accessor''' || '''Class''' || `load_via_object` || ||
paul@810 134
|| '''Instance''' || `load_via_class` || `load_via_object` ||
paul@810 135
|| '''Class or instance''' || `get_class_and_load` ||  `check_and_load_via_any` ||
paul@810 136
paul@810 137
=== Specific Program Functionality ===
paul@810 138
paul@861 139
Since objects at the program level have their layouts
paul@861 140
[[../Optimisation|optimised]], and since these layouts may differ from program
paul@861 141
to program, a suite of specific operations is provided to support a range of
paul@861 142
activities within programs, such operations being configured with
paul@861 143
program-specific information in order to do the right thing within a given
paul@861 144
program.
paul@810 145
paul@810 146
|| '''Function''' || '''Purpose''' ||
paul@810 147
|| `__invoke` || A generic invocation function that populates the argument array and obtains the appropriate target ||
paul@810 148
|| `__new` || Creates a new object, setting the `__class__` attribute to the indicated class identity ||
paul@810 149
|| `__BOOL` || Tests the identity of an attribute value against the `True` constant's address ||
paul@810 150
|| `__GETDEFAULT` || Obtains a function default from additional members of a function instance beyond the core members ||
paul@810 151
|| `__SETDEFAULT` || Updates a function default from additional members of a function instance beyond the core members ||
paul@810 152
paul@810 153
=== Module Files ===
paul@810 154
paul@861 155
Each program module is defined in its own file in the translated program.
paul@861 156
Since C has no notion of general module-level code, a special main function is
paul@861 157
generated for each module to perform the operations defined at the module
paul@861 158
level in the original program. These main functions are called in turn by the
paul@861 159
principal `main` function of the final C program, and the order of invocation
paul@861 160
is defined by the module initialisation ordering established when resolving
paul@861 161
inter-module dependencies.
paul@810 162
paul@810 163
=== Statement Translation ===
paul@810 164
paul@861 165
Statements in the original program are generally translated to statements in
paul@861 166
the final C program. During inspection and translation, certain constructs are
paul@861 167
transformed to produce the operations necessary to implement such constructs.
paul@861 168
For instance, operators need to be rewritten as function invocations, and thus
paul@861 169
the form of the original program changes before it is translated to C.
paul@810 170
paul@861 171
When translating certain operations that may appear within a statement, a
paul@861 172
sequence of instructions is generated, and such instructions must be performed
paul@861 173
in turn. Fortunately, C supports a wide variety of operations - such as
paul@861 174
assignments - within expressions, and it also provides the comma operator
paul@861 175
which permits sequences of operations to be performed in order and to yield a
paul@861 176
result that can then be used in the context of an expression. Such sequences
paul@861 177
of operations are employed extensively to realise attribute access instruction
paul@861 178
plans.
paul@810 179
paul@810 180
=== Name References ===
paul@810 181
paul@861 182
Names can be translated in different ways for the final C program. Function
paul@861 183
locals and parameters correspond directly to locals in C functions.
paul@861 184
Module-level globals and class-level locals are translated to structure
paul@861 185
accesses. However, temporary names that are used in sequence assignment are
paul@861 186
translated to C function locals, either in functions corresponding to their
paul@861 187
equivalents in the original program, or in module initialisation functions
paul@861 188
corresponding to the module scope in the original program.
paul@810 189
paul@861 190
Certain special names that are formulated during [[../Inspection|inspection]]
paul@861 191
are converted to constant or static references.
paul@810 192
paul@810 193
=== Attribute Accesses ===
paul@810 194
paul@861 195
Each attribute access is associated with a particular access location,
paul@861 196
corresponding to an operation occurring in the original program identified
paul@861 197
during inspection. This location can be used as a key to access a
paul@861 198
[[#Attribute_Access_Plans|plan]] that describes how the access may be realised
paul@861 199
in the final program. Since most of the work computing the nature of the
paul@861 200
access is done during the preceding deduction and optimisation activities,
paul@861 201
what remains is mostly the emission of the plan into the generated program
paul@861 202
text with some substitutions performed.
paul@810 203
paul@810 204
=== Invocations ===
paul@810 205
paul@861 206
Each invocation involves an object whose identity may or may not be known at
paul@861 207
compile-time, plus a collection of arguments. The translation process is
paul@861 208
concerned with obtaining a target to be invoked in the generated program,
paul@861 209
populating an argument array appropriately for the target, and providing the
paul@861 210
means of invocation.
paul@810 211
paul@861 212
Where the identity of the object is not known at all, the information provided
paul@861 213
is effectively passed on to the special `__invoke` function which is then
paul@861 214
required to do all the work at run-time. However, one of the goals of
paul@861 215
analysing and compiling the program is to avoid doing such work at run-time
paul@861 216
and to take advantage of those cases where the identity of the object can be
paul@861 217
determined.
paul@810 218
paul@861 219
Thus, where the identity of the called object is known, details of the
paul@861 220
object's signature - its parameters and their positions - are used to populate
paul@861 221
the argument array and to determine whether this can be done successfully.
paul@861 222
Some callables require a context to be provided, and knowledge about the
paul@861 223
callable can be used to obtain or omit such a context from the arguments.
paul@810 224
paul@861 225
Finally, the invocation target must be obtained. Where some knowledge of the
paul@861 226
identity of the object is available, it may be sufficient to access the
paul@861 227
special `__fn__` member directly (potentially testing for its presence if this
paul@861 228
is not certain) and to call the appropriate function with the assembled
paul@861 229
argument array. Where a definite identity is available, the actual function
paul@861 230
itself may be named and thus called with the arguments. Such operations are
paul@861 231
mere fragments of the total work performed by the `__invoke` function in the
paul@861 232
least optimal case.
paul@810 233
paul@810 234
=== Assignments ===
paul@810 235
paul@861 236
Assignments of objects to names occur for the following kinds of statements:
paul@861 237
assignments, definition statements (`class` and `def`), import statements
paul@861 238
(`from` and `import`).
paul@810 239
paul@810 240
=== Guards ===
paul@810 241
paul@861 242
Guards are identified during [[../Deduction|deduction]] as being necessary to
paul@861 243
uphold deductions about the nature of objects referenced via names that may be
paul@861 244
undermined at run-time by potentially erroneous program behaviour.
paul@810 245
paul@810 246
=== Exceptions ===
paul@810 247
paul@861 248
The C programming language does not support exceptions in a form supported by
paul@861 249
more modern programming languages. Consequently, lower-level mechanisms need
paul@861 250
to be employed to reproduce the semantics of the `try` statement, its `except`
paul@861 251
and `finally` clauses, and the `raise` statement, as well as considering the
paul@861 252
effect of exceptions on the `return` statement.
paul@810 253
paul@861 254
Fortunately, some consideration has been given to this topic before. The
paul@861 255
[[http://www.nicemice.net/cexcept/|cexcept]] library - really a collection of
paul@861 256
convenience macros - provides a way of expressing exception constructs in a
paul@861 257
natural way in C while translating those constructs to usage of the `setjmp`
paul@861 258
and `longjmp` functions. Although direct usage of such functions could be
paul@861 259
generated during program translation, for convenience and to focus on other
paul@861 260
implementation concerns, it was decided to employ the cexcept macros and to
paul@861 261
build any additional functionality in terms of their usage.
paul@810 262
paul@810 263
A structure is defined to hold exception-related information:
paul@810 264
paul@810 265
{{{
paul@810 266
typedef struct
paul@810 267
{
paul@810 268
    __attr arg;
paul@810 269
    int raising;
paul@810 270
    int raising_else;
paul@810 271
    int completing;
paul@810 272
} __exc;
paul@810 273
}}}
paul@810 274
paul@861 275
Here, the `arg` member holds a raised exception object reference. Meanwhile,
paul@861 276
the remaining members interact with language constructs as follows:
paul@810 277
paul@810 278
|| '''Member''' || '''Description''' || '''Output Program Operation''' ||
paul@810 279
|| `raising` || Indicates an exception that has been raised and not yet handled || `__Raise` ||
paul@810 280
|| `raising_else` || Indicates an exception that must be re-raised because it occurred in an `else` clause || `__RaiseElse` ||
paul@810 281
|| `completing` || Indicates the completion of, or return from, a `try` clause requiring entry into a `finally` clause || `__Complete` and `__Return` ||
paul@810 282
paul@810 283
=== Memory Allocation and Garbage Collection ===
paul@810 284
paul@861 285
To avoid having to write a garbage collector, the
paul@861 286
[[http://www.hboehm.info/gc/|Boehm-Demers-Weiser garbage collector]] is
paul@861 287
employed by programs to allocate and free memory needed for its objects.