Lichen

docs/wiki/Inspection

934:2989aab1b4f7
2021-06-29 Paul Boddie Renamed the utf8string class to unicode, eliminating the unicode function. This means that the simple case of merely returning an object if it is already a Unicode object no longer occurs when using the unicode callable, but such behaviour might be better supported with more general customised instantiation functionality.
     1 = Inspecting Programs =     2      3 A program's source code is inspected module by module. As     4 [[../Imports|imports]] of modules are encountered, other modules are added to     5 the program. The inspection process for each module involves top-to-bottom     6 traversal of the code using depth-first processing of the abstract syntax     7 tree, with a stack of namespaces being employed to record definitions and     8 names within namespaces are they are visited. Inspection continues until all     9 modules in the program have been imported and inspected.    10     11 <<TableOfContents(2,3)>>    12     13 == The Inspection Process ==    14     15 The inspection process is focused on two primary tasks:    16     17  * Populating namespaces    18    * Identifying the names defined by each namespace    19    * Recording relationships between namespaces    20  * Describing the operations in each namespace    21    * Identifying the names in each namespace and the attributes used by those    22      names    23    * Recording details of assignments and invocations    24     25 The results of inspection are written out to [[../Cache|cache]] files, one for    26 each module in the program.    27     28 === Program Units for Inspection ===    29     30 The `inspector` module performs much of the inspection work, relying on the    31 `common` module for certain tasks, with the `modules` module providing the    32 relevant module abstractions including those writing to and reading from cache    33 files, and the `resolving` module completing each module's inspection. The    34 `importer` module coordinates inspection across the whole program.    35     36 === Name Identification ===    37     38 Names can be introduced to a namespace via several different mechanisms:    39     40  * Assignments to local names    41  * References to global names    42  * References to built-in names (defined in the built-in modules)    43  * Importing of modules    44  * Importing of names from modules    45     46 The origins of names are discovered by considering local, global and built-in    47 namespaces. Where a name is '''local''', it is defined in the same namespace.    48 Where a name is '''global''', it is defined in the same module at the module    49 level. Where a name is '''built-in''', it is defined in a module in the    50 `__builtins__` package and exposed via the `__builtins__` namespace.    51     52 Initial tentative identification of names will sort names into two categories:    53 locals and external names. Global variables employed in function (or method)    54 namespaces may not be defined when a function body is inspected, either within    55 a module or being imported from another module, and so it is not initially    56 possible to more specifically determine the nature of a name.    57     58 These categories are later refined to distinguish between globals and built-in    59 names as external names. The built-in origin of a name is only suggested when    60 no locals or globals can provide the name, and the final identification of    61 such names, plus other external names introduced as locals or globals via    62 imports, will occur at the end of the inspection activity. Names that are    63 unrecognised by then may be treated like unrecognised built-ins.    64     65 === Name Restrictions and Resolution ===    66     67 '''Static''' objects, forming the fundamental [[../Structure|structure]] of    68 the program, expose their names through the general structure hierarchy.    69 Classes, which are defined as part of this structure, depend on well-defined    70 names for any base classes they employ. For example:    71     72 {{{#!python numbers=disable    73 class C(A): # depends on A being a name that can be resolved (and being a class)    74     ...    75     76 class D(A.B.C): # depends on A.B.C being resolvable (and being a class)    77     ...    78 }}}    79     80 Base classes must be identifiable and unambiguous. Since base classes may be    81 imported, their identification may not necessarily occur immediately, but it    82 must be possible once all information about a program is known and when all    83 such dependencies are resolved.    84     85 === Attribute Identification ===    86     87 Attributes are introduced to namespaces as follows:    88     89  * For classes and modules, the definition of names defines attributes in    90    those namespaces    91  * Functions do not generally support attributes, although    92    [[../Representations#Special_Members|representation-specific attributes]]    93    do exist on functions    94  * For instances, assignments to attributes of the special `self` name,    95    performed in methods, define attributes in all instances of the method's    96    class    97     98 Attributes are only supported on program objects if they have been defined or    99 '''bound''' as described above. Any attempt to access or set an attribute on   100 an object using a name that was not determined through the above process is   101 considered an invalid operation. Thus, augmenting the attributes available on   102 an object (so-called "monkeypatching") is not possible.   103    104 When an attribute is accessed, the location of its use is recorded and   105 ultimately associated with assignments of the name involved, just as is done   106 for accesses of the plain name itself, but the attribute details are   107 subsequently collected together for each assignment or '''version''' of the   108 name. This is discussed below.   109    110 === Inherited Attributes ===   111    112 In order to support inheritance, a process of propagation makes class and   113 instance attributes available to any given class from its ancestor classes   114 according to a depth-first traversal of a class's base class hierarchy. Thus,   115 for each class, given the definition of its base classes, a complete   116 collection of class and instance attribute names can be determined. The actual   117 mechanism of propagation occurs during the consolidation phase of inspection,   118 principally because class bases are not generally immediately identifiable   119 upon completing the inspection of any given class.   120    121 === Name Assignments and Accesses ===   122    123 Each assignment of a name defines a '''version''' of the name within a   124 namespace. The location of this definition consists of the following:   125    126  * The namespace in which the assignment appears   127  * The name itself   128  * The version or assignment instance number   129    130 When a name is used, the location of its use is recorded and is ultimately   131 associated with the assignments that may be providing the name at that   132 location. This permits information about the type of the name to become   133 available for each usage location. The location of each name usage consists of   134 the following:   135    136  * The namespace in which the name appears   137  * The name itself   138  * The access instance number   139    140 Name accesses are described as special cases of attribute accesses: where   141 attributes would be indicated, none are specified.   142    143 === Attribute Accesses ===   144    145 Attribute '''accesses''' are operations involving attributes where those   146 attributes are obtained or set in conjunction with an '''accessor'''   147 expression. They are recorded in terms of location tuples, each describing an   148 access as follows:   149    150  * The namespace in which the access occurs   151  * Any named accessor or an anonymous accessor   152  * The attribute (or chain of attributes) involved (omitted for name accesses)   153  * The access instance number   154    155 As with name accesses, the access instance number distinguishes between   156 different accesses employing the same details.   157    158 Consider the following example:   159    160 {{{{#!table   161 <style="vertical-align: top">   162 {{{#!python numbers=disable   163    164 def f():   165     p = ...        # p version 0   166     p.a            # p.a access 0   167     if fn().a:     # anonymous a access 0   168         q = ...    # q version 0   169         q.a        # q.a access 0   170         p          # p access 0   171     else:   172         q = ...    # q version 1   173         q.a        # q.a access 1   174     q.b            # q.b access 0   175     p              # p access 1   176 }}}   177 ||   178 {{{#!graphviz   179 //format=svg   180 //transform=notugly   181 digraph accesses {   182   node [shape=box,fontsize="13.0",fontname="sans-serif",tooltip="Names and accesses"];   183   edge [tooltip="Names and accesses"];   184   rankdir=TB;   185    186   p [label="p = ...",style=filled,fillcolor=gold];   187    188   subgraph {   189     rank=same;   190     pa0expr [label="p.a",style=filled,fillcolor=cyan];   191     pa0 [label="p.a #0",style=filled,fillcolor=red];   192   }   193    194   subgraph {   195     rank=same;   196     if [label="if",shape=ellipse];   197     fnaexpr [label="fn().a",style=filled,fillcolor=cyan];   198     fna [label="{}.a #0",style=filled,fillcolor=red];   199   }   200    201   q0 [label="q = ...",style=filled,fillcolor=gold];   202    203   subgraph {   204     rank=same;   205     qa0expr [label="q.a",style=filled,fillcolor=cyan];   206     qa0 [label="q.a #0",style=filled,fillcolor=red];   207   }   208    209   subgraph {   210     rank=same;   211     p0expr [label="p",style=filled,fillcolor=cyan];   212     p0 [label="p.{} #0",style=filled,fillcolor=red];   213   }   214    215   else [label="else",shape=ellipse];   216   q1 [label="q = ...",style=filled,fillcolor=gold];   217    218   subgraph {   219     rank=same;   220     qa1expr [label="q.a #1",style=filled,fillcolor=cyan];   221     qa1 [label="q.a #1",style=filled,fillcolor=red];   222   }   223    224   subgraph {   225     rank=same;   226     qb0expr [label="q.b",style=filled,fillcolor=cyan];   227     qb0 [label="q.b #0",style=filled,fillcolor=red];   228   }   229    230   subgraph {   231     rank=same;   232     p1expr [label="p",style=filled,fillcolor=cyan];   233     p1 [label="p.{} #1",style=filled,fillcolor=red];   234   }   235    236   p -> pa0expr -> if -> fnaexpr -> q0 -> qa0expr -> p0expr -> qb0expr;   237   if -> else -> q1 -> qa1expr -> qb0expr;   238   qb0expr -> p1expr;   239    240   fnaexpr -> fna [dir=none,style=dashed];   241   pa0expr -> pa0 [dir=none,style=dashed];   242   qa0expr -> qa0 [dir=none,style=dashed];   243   p0expr -> p0 [dir=none,style=dashed];   244   qa1expr -> qa1 [dir=none,style=dashed];   245   qb0expr -> qb0 [dir=none,style=dashed];   246   p1expr -> p1 [dir=none,style=dashed];   247    248   p -> pa0 [dir=none];   249   p -> p0 [dir=none];   250   p -> p1 [dir=none];   251    252   q0 -> qa0 [dir=none];   253   q0 -> qb0 [dir=none];   254    255   q1 -> qa1 [dir=none];   256   q1 -> qb0 [dir=none];   257 }   258 }}}   259 }}}}   260    261 Since the names involved may be provided by different name versions, accesses   262 are counted independently. Meanwhile, a non-name or '''anonymous''' accessor   263 may be involved in an access. Such anonymous accessors are independent and do   264 not accumulate attribute usage because they potentially involve completely   265 different objects.   266    267 || '''Namespace''' || '''Name''' || '''Attribute''' || '''Access number''' ||   268 || `module.f`      || `p`        || `a`             || 0 ||   269 || `module.f`      || `p`        || `{}`            || 0 ||   270 || `module.f`      || `p`        || `{}`            || 1 ||   271 || `module.f`      || `q`        || `a`             || 0 ||   272 || `module.f`      || `q`        || `a`             || 1 ||   273 || `module.f`      || `q`        || `b`             || 0 ||   274 || `module.f`      || `{}`       || `a`             || 0 ||   275    276 Accessors may be recorded using a similar location scheme. For example:   277    278 || '''Namespace''' || '''Name''' || '''Attribute''' || '''Name version''' ||   279 || `module.f`      || `p`        || `{}`            || 0 ||   280 || `module.f`      || `q`        || `{}`            || 0 ||   281 || `module.f`      || `q`        || `{}`            || 1 ||   282    283 Here, the attribute field is left empty and indicates that name definitions   284 are being described. Although the data is superficially similar to name   285 accesses, it should be remembered that accesses employ an access number   286 whereas accessors employ a name version, with such identifiers being different   287 things.   288    289 === Names and Attribute Usage ===   290    291 Within each scope, the names employed by the program and the attributes used   292 with those names are tracked. As the path of execution diverges and converges   293 again through control-flow constructs, the tracking attempts to maintain the   294 '''attribute usage''' associated with each name, or specifically each   295 '''version''' of each name.   296    297 {{{{#!table   298 <style="vertical-align: top">   299 {{{#!python numbers=disable   300 y = ...         # y version 0   301 while cond0:   302     if cond1:   303         y.a1   304     elif cond2:   305         y = ... # y version 1   306         y.a2   307     else:   308         y.a3   309    310 # y version 0 is used with a1 or a3 or neither   311    312 # y version 1 is used with a2   313 }}}   314 ||   315 {{{#!graphviz   316 //format=svg   317 //transform=notugly   318 digraph usage {   319   node [shape=box,fontsize="13.0",fontname="sans-serif",tooltip="Name and attribute tracking"];   320   edge [tooltip="Name and attribute tracking"];   321   rankdir=TB;   322    323   sety0 [label="y = ...",style=filled,fillcolor=gold];   324   while [label="while cond0",shape=ellipse];   325    326   subgraph {   327     rank=same;   328     if [label="if cond1",shape=ellipse];   329     ya1 [label="y.a1",style=filled,fillcolor=cyan];   330   }   331    332   subgraph {   333     rank=same;   334     elif [label="elif cond2",shape=ellipse];   335     sety1 [label="y = ...",style=filled,fillcolor=gold];   336     ya2 [label="y.a2",style=filled,fillcolor=green];   337   }   338    339   subgraph {   340     rank=same;   341     ifelse [label="else",shape=ellipse];   342     ya3 [label="y.a3",style=filled,fillcolor=cyan];   343   }   344    345   endif [label="end if",shape=ellipse];   346    347   whileelse [label="(else)",shape=ellipse];   348    349   endwhile [label="end while",shape=ellipse];   350    351   sety0 -> while -> if -> elif -> ifelse -> endif;   352   if -> ya1 -> endif;   353   elif -> sety1 -> ya2 -> endif;   354   ifelse -> ya3 -> endif;   355   endif -> while [style=dashed];   356   while -> whileelse -> endwhile;   357 }   358 }}}   359 }}}}   360    361 The outcome of such tracking should be an indication of the attribute usage   362 with each name based on the shortest routes that names can take through the   363 control-flow structure. Such shortest-route usage defines the minimal   364 selection of attributes that can be considered used with a name, and thus such   365 usage defines the broadest selection of object types that can be identified as   366 supporting such attributes. In the above example, the following minimal   367 selections of attributes apply:   368    369 || '''Name Version''' || '''Minimal Set of Attributes''' || '''Types''' ||   370 || `y` (version 0) || ''empty set'' || ''any object'' ||   371 || `y` (version 1) || `a2` ||  ''only objects providing the single attribute'' ||   372    373 The assumption is made that any attribute used with a name is always provided   374 by the object referenced by that name and that the correct execution of the   375 code does not rely on an attribute being absent (and thus raising an   376 exception). By defining usage for each name, the toolchain can determine   377 whether any type can provide the attributes used with a name, producing a   378 compile-time error if no type supports such usage.   379    380 It is possible that certain routes taken by names might define attribute usage   381 that cannot be supported by types that do support the shortest-route usage,   382 yet it might not be appropriate to forbid usage of such types with such names:   383 the program logic may intend that such types do not visit the regions of the   384 code that employ the attributes that such types cannot support. However, as a   385 consequence, run-time tests will be required to prevent attribute accesses   386 that are inappropriate for such types from taking place. In the above example,   387 the following maximal selections apply:   388    389 || '''Name Version''' || '''Maximal Set of Attributes''' || '''Types''' ||   390 || `y` (version 0) || `a1`, `a3` || ''only objects providing both attributes'' ||   391 || `y` (version 1) || `a1`, `a2`, `a3` ||  ''only objects providing all three attributes'' ||   392    393 Tracking occurs separately in function or method namespaces and at a level   394 combining the static namespaces in a module. This latter combination of   395 namespaces permits the flow of global name details through class namespaces.   396 Tracking employs special '''tracking names''' with which usage is associated,   397 with globals and class attributes employing complete object paths to describe   398 their names, whereas locals merely employ the plain names defined and used in   399 local namespaces. Some examples follow:   400    401 || '''Namespace''' || '''Name''' || '''Name Scope''' || '''Tracking Name''' ||   402 || `__main__` (module) || `x` || global || `__main__.x` ||   403 || `__main__.C` (class) || `x` || global || `__main__.x` ||   404 || `__main__.C` (class) || `y` || local (class attribute) || `__main__.C.y` ||   405 || `__main__.f` (function) || `y` || local || `y` ||   406    407 === Name Initialisation and Aliases ===   408    409 Each name version may be associated with a value upon assignment provided by   410 an expression known as an '''initialiser'''. Such values may be used to inform   411 the interpretation of operations on names, restricting the types involved to   412 the initialised values. Some examples are as follows:   413    414 {{{#!python numbers=disable   415 x = 123         # initialiser is a constant, indicating a known type   416 x = classname() # initialiser refers to a class, indicating instantiation   417 }}}   418    419 Names may also be assigned the values of other names, and such a name becomes   420 an '''alias''' for such other names. Aliases can therefore be used to combine   421 attribute usage observations across names. Other forms of aliases involve   422 assigning attributes accessed via other names to any given name. Some examples   423 are as follows:   424    425 {{{#!python numbers=disable   426 y = x   # y    427 y.p     # p can be assumed to apply to x   428 z = x.q # z can be restricted to the types deduced for x.q   429 }}}   430    431 Initialising values are retained for later resolution because their identities   432 are not generally known until certain name resolution activities have taken   433 place.   434    435 === Invocations ===   436    437 During inspection only limited information is available about the nature of   438 invocations. However, some information will already be available about global   439 names and these may be defined by classes. Thus, invocations that cause the   440 instantiation of classes may be determined even during the inspection phase.   441    442 Information about class instantiation is most useful in combination with the   443 initialisation of names. When an assignment occurs, any initialising   444 expression that provides enough information can be evaluated to see if it   445 describes instantiation. If so, the nature of the instantiation can be   446 associated with the name and used in the deduction process to constrain any   447 usage of that name and its attributes. Such restrictions on the types   448 associated with names are applied in the deduction phase.   449    450 === Literals and Constants ===   451    452 Literal values or '''literals''' are specific values that appear in the   453 program, typically representing numbers, character strings, mappings or   454 collections. Literal numbers and strings are effectively '''constants''',   455 meaning that their values are unambiguously defined and can eventually be   456 referenced using a unique identifier that applies throughout the program and   457 which can refer to an initialised object in any generated program. Initially,   458 they are recorded for the namespace in which they appear. For example:   459    460 || '''Namespace''' || '''Identifier''' || '''Type''' || '''Encoding''' || '''Value''' ||   461 || `__main__` || `__main__.$c0` || `__builtins__.str.string` || `iso-8859-15` || `'\xc6\xd8\xc5'` ||   462 || `__main__` || `__main__.$c1` || `__builtins__.int.int` || || `123` ||   463    464 Since values may appear again in other namespaces, a mapping is generated from   465 such local identifiers to common global identifiers for constants. Where such   466 an identifier is derived from the value content, this can potentially occur   467 immediately, although in practice (and in general) such a mapping will be   468 generated after all modules have been inspected.   469    470 Collections and mappings may also be initialised using literal syntax but may   471 contain non-constant information such as names or expressions whose values are   472 unknown before the program is run. Such values can be represented as   473 instantiation operations of the appropriate type in any generated program, and   474 the instances of the type concerned can be associated with any names to which   475 such literals are assigned. Special names are used to refer to literal types   476 for collections and mappings. For example:   477    478 || '''Identifier''' || '''Type''' ||   479 || `$Ldict` || `__builtins__.dict.dict` ||   480 || `$Llist` || `__builtins__.list.list` ||   481 || `$Ltuple` || `__builtins__.tuple.tuple` ||   482    483 Such special names merely serve as convenient placeholders for the types of   484 any literals mentioned in a module.   485    486 == Consolidating Inspection Details ==   487    488 As briefly discussed above, certain activities occur after information has   489 been collected from an input program. Within each module, names that are   490 external to namespaces are recorded in a '''name reference''' collection.   491 These references identify unrecognised names but do not generally define their   492 origins. Examples of name references are as follows:   493    494 || '''Name''' || '''Identity''' ||   495 || `__main__.repr` || `<deferred>:__builtins__.repr` ||   496 || `__main__.sys` || `<module>:sys` ||   497    498 In order to obtain the final identities, deferred references may need to be   499 resolved, yielding concrete references:   500    501 || '''Name''' || '''Identity''' ||   502 || `__main__.repr` || `<function>:__builtins__.identity.repr` ||   503 || `__main__.sys` || `<module>:sys` ||   504    505 This process of resolution cannot be performed immediately with only the   506 knowledge available in a single module. Only with all program modules loaded   507 can such resolution occur.   508    509 === Identifying Deferred References ===   510    511 After all modules are loaded, and with all objects present in the program at   512 this point, each deferred reference defined within a module should ultimately   513 yield an identity. Any failure to do so indicates usage of a name not defined   514 in the program. The process of identifying them by resolving what each of them   515 points to, is illustrated by the following example:   516    517 || '''Reference''' || '''Object Path to Search''' ||   518 || `<depends>:__builtins__.repr` || `__builtins__.repr` ||   519 || `<depends>:__builtins__.identity.repr` || `__builtins__.identity.repr` ||   520 || `<function>:__builtins__.identity.repr` || ''identification has occurred'' ||   521    522 Initially, the origin information from the reference is used to find the   523 details of the referenced object. However, in this case, the details yield   524 another deferred reference that has not yet been resolved. Consequently, the   525 origin information must be used to find the details of the object referenced   526 in this case, finally yielding a reference with concrete object information.   527 Deferred references are mutated to concrete references, thus changing the   528 nature of such references throughout the accumulated data.   529    530 === Identifying Module Dependencies ===   531    532 During deferred reference resolution, relationships are catalogued between   533 modules in which deferred references are found and those providing the   534 corresponding referenced objects. In addition, module ordering dependencies   535 are established on the basis that some kinds of objects must be initialised   536 before being used in other parts of the program. As a result, the modules   537 providing such objects must themselves be initialised before the modules   538 referencing such objects. More information on this topic can be found in the   539 [[../Imports#Import_Sequencing|import sequencing]] documentation.   540    541 === Resolving Module Details ===   542    543 With all dependencies resolved, further work can be done with the details   544 within each module. The `resolving` module provides the functionality to each   545 module instance to perform such work.   546    547  1. Class bases can be identified.   548  1. Special names (referring to built-in objects employed by certain   549     operations) are resolved.   550  1. Each access involving an external name is investigated to see if it   551     provides an effectively constant object (as described below).   552  1. Invocation references are converted to provide instantiation information,   553     if appropriate.   554  1. Initialisers are investigated to see if they provide concrete object   555     details and can thus indicate the nature of a name.   556  1. Constant value literals are associated with the appropriate types.   557    558 Thus, attribute details for each module and its classes are finalised.   559 Meanwhile, modules that are still hidden are removed from the program.   560    561 ==== Investigating Accesses ====   562    563 External names will not have been resolved during the inspection process, but   564 with information about the whole program, it becomes possible to identify   565 names and to determine whether attribute accesses involving those names can   566 also be identified. Thus, name references from each namespace are investigated   567 in turn as follows.   568    569 Each name referencing a static object is considered a constant access of that   570 object. However, given a number of accompanying attributes describing an   571 access, each attribute is added in turn, and the resulting object path   572 identified. If the path identifies a constant object, the next attribute in   573 the access chain is added and the resulting object path identified, and so on.   574 The maximal collection of attributes identifying a constant object is then   575 recorded as a constant access, with any remaining attributes in the access   576 chain retained for traversal in a later phase.   577    578 Names yielding constant accesses do not need to have their identity deduced   579 through the application of attribute usage.   580    581 ==== Investigating Initialisers ====   582    583 During inspection, initialisers will have been recorded in terms of expression   584 results such as names, invocations, readily-identifiable instances, and   585 attribute accesses. With details of the whole program plus constant accesses   586 having been made available, such results may be definitively identified and   587 associated with initialised names.   588    589 Alias information may also be refined at this point by identifying attribute   590 accesses that are used to initialise names.   591    592 === Finalising Program Details ===   593    594 With class contents and relationships identified, class attribute inheritance   595 and instance attribute availability can be defined. Some classes may depend on   596 external state for their initialisation, and so additional module dependencies   597 may be defined at this stage. The `importer` module contains the functionality   598 to conduct these whole-program activities.   599    600 == Inspection Outcome ==   601    602 Once inspection is complete, the available knowledge concerns the whole   603 program and not merely individual modules or parts of the program. However,   604 only limited efforts have been made until this point, notably in the   605 acquisition of statically-defined object details when referencing names or   606 attributes, to integrate the different modules. The [[../Deduction|deduction]]   607 process is concerned with such integration.