INTRODUCTION

INTRODUCTION

In this chapter the applicability of HGA to compiler construction is considered and is the core of the proof of the thesis spanning experimental studies for about two decades and has been pointed out by P B Hansen that the general technique of HGA has been used similarly by him throughout his life in the development of portable compilers and here the first pioneering compilers in India for many major prpgramming languages developed by the author or implementations of his design are considered with all of them using HGA for development.

The sixth application of HGA is a summary of applications of the method to pioneering compiler developmental efforts of major real-life programming languages in an industrial environment in the seventies and eighties in India using moderately reliable computer systems. The yield of HGA has been first in India experimental/commer cial compilers/interpreters for PASCAL (sequential and con current), APL (LL/1 syntax directed compiler), CORAL-66 (employed in real-time applications), BLISS (experimental implementation), subsets of ADA, subsets of CHILL, FORTRAN- IV,-77 (scientific applications); LISP, SNOBOL, PROLOG (for computer science education); subsets of PL/1 (experimental); ALGOL-60 (commercial); C (experimental and limited commercial use) etc. The HGA developmental efforts used LL(1) uniformly: more in the form of top-down without backup, with one symbol look-ahead. Extended-BNF formulation of the syntax was em ployed and such features as did not fit the LL(1) framework were coerced to do so by (ugly(?)) semantic routines. The initial HLL version was worked out by the author in PASCAL with little debugging and extensive testing of the same and hand-coding was done by M.Phil theses work ( of the Hyderabad Central University), trainees in Computer Science and Engineering and Information Systems Engineering. The hardware platforms were the slowly stabilising TDC-316 & System 332 ( a 32 bit medium-large copy of IRIS-55 of CII- Honeywell Bull from which (as Prof.J.Saltzer commented on his visit) very little software was inherited). Extensions of the studies of the applicability of HGA has been to various TWS including the SLR(1), LR(1) and LALR(1) techniques.

6.1 High level programming languages(HLL) made their appearance around 1950's in the West and circa early seventies in the indigenous environment. Early programming was at the low-level machine/ assembly language programming. The early languages were BASIC, FORTRAN, COBOL & ALGOL-60. The mastering of the translator writing area led to the acceptance of HLL programming with its advantages of ease of portability of software, ease of entry into the field of programming, ease of the realisation of software reuse. A separation of the abstract or symbolic component of the program from the machine level abstractions to use the computer system was thus attempted and realised to some extent. Once the early resistance to symbolic programming was overcome an explosion in symbolic programming languages occurred. The next step was to consider specifications of the problem in precise natural language which could then be converted to the symbolic programming language in a number of steps. The major HLLs that had found their first experimen tal/commercial applications in the environment are consid ered. i) FORTRAN whose name is derived from FORMULA TRANSLATION was the first and oldest programming language to find commercial acceptance. It was originally developed at IBM for large scale numerical applications oriented problems. It initiated the historical debate as to whether a compiler can totally do away with the assembly language handcoding. It has been extensively used for HGA purposes as reported in this chapter.

ii) COBOL which stands for COmmon Business Oriented Language is the work-horse of business applications. It was originally created in 1959 and the major standardisations considered are COBOL-74 and COBOL-85. The implementation was on the TDC-316 and Hansen's Approach (HA) was suggested for the same, but not used. HGA was used with COBOL-74 exten sively as reported in chapter 9 for educational purposes and for software upgradation or downgrading the software as reported in the appendices. The permanence of COBOL is a permanent debate but no replacement which is effective has been found.

iii) ALGOL-60 stands for ALGOrithimic Language and the '60' refers to the year of definition, circa 1960. It is the basic historical foundation of HLLs development and the origin of practical computer science. The first Indian implementation by the author on the TDC-316 is reported in chapter 14 and HC is based on the same. HGA was used for scientific application packages on the same. It was redefined to ALGOL-68 which has not found much acceptance. However, an experimental implementation of a subset of ALGOL-68 using HGA was done by the author as reported in this chapter. This is the only attempt in India.

iv) BASIC stands for Beginner's All-purpose Symbolic Instruction Code and was defined in 1964 at Dartmouth College, to make programming as easy as possible. It was to be learnt in a few hours. It was used with HGA for instructional purposes and the first Indian implementation was by the author in 1969 for his Master's thesis at the Indian Institute of Technology, Kharagpur. The first commer cial implementation using HGA was on the TDC-312. v) LISP stands for LIst Processing and is the second oldest language after FORTRAN. An experimental implementation, the first in India, was done by the author using HGA with assembly language as the target language. It was used for instructional purposes and reported in this chapter.

vi) SNOBOL created in the early sixties at Bell Labs is a string processing language with some number handling capabilities. An experimental implementation, first in India, was done by the author on the TDC-332. It used FORTRAN and assembly language as the target implementation vehicles and was used for instructional purposes and reported in this chapter.

vii) PASCAL named after the French philosopher was designed by N.Wirth. The stated intention of defining the language was to teach the fundamentals of computer programming to beginners and still have an efficient implementation. With its extensions to Object PASCAL and Concurrent PASCAL it has been the basic tool for HGA studies for two decades. The first implementation in India was by the author using a variety of target languages and hardware platforms. All implementations were studies of and used HGA as reported in this chapter.

viii) EULER named after the famous mathematician was a generalisation of ALGOL-60 by Wirth and Weber. It used simple precedence in its reported formal published algorithm of definition. Eminently suited to the use of HC it was implemented by the author (along with Bernd Krieg) as part of a Compiler Writing course under Prof. David Gries. The first experimental implementation of the same in India was done using HGA by the author with FORTRAN as the target language on the TDC-332/IRIS-55.

ix) MODULA-2, born in 1980, is a descendent of the original MODULA and PASCAL. It was designed by Prof. N.Wirth. The first implementation of a large subset was done for experimental purposes on the IRIS-55 using HGA with FORTRAN as the target handcoded vehicle.

x) PL/1 stands for Programming Language One and is as old as 1966. Meant as a programming language for all purposes it has not met its target. This is because there seems to be an intrinsic requirement of poly-Programming Languages especially when one considers applications over very wide spectrum. The only indigenous implementations were experimen tal realisations based on subsets like the ones defined by R.C.Holt and implemented on the TDC platforms. The implemen tations were all studies of and used HGA.

xi) APL stands for A Programming Language and was developed by K.Iverson at IBM in 1962. It is a general purpose language using generalised matrix operations. The implementation using LL(1) syntax was the first in India and was by the author using HGA.

xii) ADA named after the world's first programmer Lady Augusta Ada Byron, has been ordained by the Department of Defence (DoD) of USA as being a requirement for military applications. An experimental implementation using HGA has been carried out by the author for a subset.

xiii) PROLOG stands for PROgramming in LOGic and was first implemented in 1972 in the West. It is the vehicle of the Fifth Generation Project in Japan. An experimental implementation using HGA, first in India, was by the author and is reported in this chapter. xiv) The C programming Language defined in 1972 is the current workhorse of infrastructure programming in Controls, Communications and Computers in the environment. Its availability is the main alternative to HGA especially with highly optimising versions available. It does not find use in criti cal nuclear applications except as a vehicle for the use of HGA. Implementations of C in the early eighties by the author used HGA uniformly.

xiv) CORAL-66 stands for COmmon Real time Language and was defined in 1966 as the British standard for real-time software. An implementation, first in India, was done by the author using HGA and was used for real-time software.

xv) CHILL stands for CCITT HIgh Level Language and has been ordained by the international body for standardisation in Telegraphy and Telephony the CCITT as being the vehicle for communication software. The only implementation of the same in India of a very large subset for experimental purposes using HGA was by the author. xvi) Software command languages. Attempts in the early to mid-eighties at word processors and spreadsheets in the environment using HGA for commercial purposes were swamped by the availability of LOTUS 1-2-3 and WORDSTAR. HGA may not be of any use for such dedicated applications.

6.2 The experimental systems programs reported below were over the last two decades and mainly on the TDC family of machines and used HGA. Some of the concrete experimentation with HGA is reported.

6.3 Around 1975 there were only a couple of hundred computer systems in India. This was also the time when the TDC effort at the Electronics Corporation of India Limited(ECIL) was at its peak to establish the indigenous computer industry, by national policy.

6.4 The TDC-312 (Trombay Digital Computer, 3 for the generation and 12 for the word length) and the 16 bit word length TDC-316 were developed and productionised and efforts were underway to develop and productionise the medium large System 332.

6.5 To aid the software development on the indigenous machines the IRIS-55 of CII-Honeywell- Bull had been procured. This had the FORTRAN and COBOL language processors, working in a batch environment.

6.6 At the same time, around 1975, there were only a handful of Universities, in India, offering Computer Science/Engineering/Technology programs . The bulk of the personnel with a formal computer background were more available to the West than to India (circa 1975). The University of Hyderabad initiated an academic program, offering M.Phil, in Computer Methods and the entire program was conducted by the staff of the Computer Group of ECIL for a decade( 1975-84).

6.7 The non-availability of any programming language which was ALGOL-like or PASCAL-like on the IRIS-55 led to difficulties in instruction in Programming Methodology and practical Systems Programming, as the only languages available were FORTRAN & COBOL. The visits of the eminent Prof. Hansen and Prof. C.A.R.Hoare around this time was cru cial in motivating the developments of the Compiler Writing projects envisaged and also used for instructional and com mercial implementations.

6.8 The use of HGA led to the successful implementation of a couple of dozen practical systems programs. 6.9 One of the severe problems in some academic programs in Computer Science/Engineering/Technology in the Seventies was that the students never got down to developing, testing and debugging any programs. Circa 1975, the micro- computer revolution had not come to India and this led to computer professionals who hardly had ever written a program. In some cases this crucial and fundamental aspect (experience of extensive practical programming ) was not even realised. 6.10 Computer Science/Engineering/Technology suffers from the drawback of having involved terminology or jargon for which quite often no accepted standards exist. This has led to the phenomenon, in India, of quite a few people using the terminology or jargon intuitively. To avoid this it was decided that the M.Phil program and the internal training programs be primarily oriented around intensive and extensive practical systems programming. 6.11 As Programming Methodology can be naturally taught in PASCAL or ALGOL-60 and does not come as naturally in COBOL or perhaps not at all in FORTRAN the method advocated by Prof. Hansen proved to be crucial. 6.12 The method of practical Systems Programming allows one to see through a program, to see the assembly language and machine language equivalents and with a little bit of effort the microprogrammed equivalents of the program. 6.13 Apart from the above one picks up certain essential skills: the unwinding of recursion to iteration, the mapping of some high level data structures onto the addressing modes of the machine instruction set, the postfix representation of expressions, control structures and all HLL constructs, microprogrammed equivalents of HLL constructs and at times even the realisations in hardware. 6.14 It may appear that today in 1994 such a method has outlived its usefulness, especially with the easy availability of the microcomputers and the PC revolution, however it is a crucial requirement in real-time programming that the HGA experience be substantial. 6.15 The basic aim of the systems program given in the table was to experiment with a few ideas at the system level so that an appreciation is gained of the amount of effort involved when one tries out alternative ideas and also to appreciate in depth what goes on when one uses jargon to describe the ideas involved in a non-trivial systems program. 6.16 Most of the experimentation was with error recovery and error correction ideas involving compiler development as most of the work in the seventies in the TDC effort was in the area of language processors. The method of development followed allowed the students/trainees to get a good feel of compiler writing and programming methodology. The culture of using flowcharts, so prevalent in the TDC environment was also sought to be changed in this approach. 6.17 The students/trainees were supplied with three gross descriptions: a) The algorithm stated grossly, with only a few steps of refinement in a mixture of ALGOL-like, PASCAL-like or PL/1-like programming languages with abstract data types at times. For most of the first compilers IHLL was Standard PASCAL and HGA was used.

b) The algorithm stated in Standard PASCAL (a handwritten and undebugged version). Towards the early eighties it was possible to subject the IHLL to some degree of debugging and testing.

c) The algorithm with FORTRAN-like control and data structures (but handwritten and undebugged). This uses a MA for FORTRAN-IV representations of PASCAL constructs by handcoding.

6.18 The above gross descriptions were first subjected to a detailed study and discarded. The implementation was then worked out systematically in a number of steps of refinement to obtain the final implementation in FORTRAN and/or assembly language.

6.19 To consider larger and larger subsets of the language the entire lexical analyser (using a finite automata) and the entire syntax analyser for the whole language was considered with the LL(1) strategy. The semantics were added for subsets of the language of increasing complexity.

6.20 This was followed for some of the other programming language implementations on some other platforms and not reported here.

6.21 The M.Phil program in Computer Methods created around 100 trained and qualified personnel and the in- house training at Computer Group, ECIL around 700 trained personnel in computer science. These figures do not include customer training and dedicated hardware training programs on the TDC series.

6.22 A record of optimisations (RO) was maintained as discussed in chapter 0 for documentation purposes.

6.23 Though quite a few of these personnel have been garnered by the West, a sizable portion of them are spread over the Indian computer industry and academic institutions offering programs in computer science/engineering/technology.

6.24 Though the TDC effort has supplied the country with around 1000 systems one of the main achievements of the effort was the creation of trained personnel for the Indian computer industry and academic institutions.

6.25 A few comments may be in order to discuss the overall experimental conclusions:

6.25.1 Experiment 5 dealt with different symbol table strategies for a standard PASCAL compiler and the elegance of the solution used in the PASCAL(P) technique appreciated.

6.25.2 Experiment 9 is interesting as APL is a real life language which requires a right to left parse. The syntax analyser was LL(1) based. By and large the experiments (spread out over a decade or two) indicated that LL(1) was a sufficient practical technique for most programming languages. The IHLL was in PASCAL and a later implementation verified in the IHLL and then reduced the same by handcoding to FORTRAN- IV and assembly language. 6.25.3 Experiment 6 was support material for the TDC- 316 commercial ALGOL-60 compiler which followed the Whetstone Compiler design of Randell and Russell. The Whetstone compiler stands out as a published version of a more or less cor rect compiler.

6.25.4 Experiments 3, 5, 7 and 12 used a hand- conversion of recursive descent to FORTRAN. Though this was done it was found to be an error prone process. The use of the syntax scaffolding using LL(1) either formally or intuitively with the extended Backus Normal Form(BNF) notation but using the same solutions for semantics proves to be much better. In the extended BNF notation a bracket structure indicating one or more iterations was found useful along with the zero or more iterations bracket structure usually found. This corresponds to the repeat and while control structures found in a PASCAL- like language. It was found that handcoding recursive routines using HGA is not easy.

6.25.5Experiments 18 and 19 allowed the conclusion that it really does not matter as to what method of syntax analysis is used so long as it is sufficient to serve as a scaffolding on which to hang the semantic routines. LL(1) is perhaps the easiest to understand and most appropriate. HGA can be used to obtain the syntax anaylser by handcoding rather than by a formal use of a construct for the syntax of traditional major real-life HLLs.

6.25.6Experiment 14, uses a small subset of ALGOL-60 for which LL(1) was used in the syntax analyser. In fact the subset was so chosen as to use LL(1) only.

6.25.7Experiment 8 indicated that simple precedence error recovery is very poor and it is difficult to talk about error correction with this technique.

6.25.8Experiment 13 can be looked upon as a culmination of the introduction of macro facilities in the assembler development on the TDC family of machines. Here HGA was used with PASCAL as IHLL and handcoded to assembly language.

6.25.9Experiment 17 is perhaps the only SNOBOL interpreter implemented in India. Here IHLL was PASCAL and the handcoding was to FORTRAN-IV & assembly language.

6.25.10 Experiment 20 consisted of a number of projects to study various syntax analysis techniques. The results were used for student projects in compiler writing courses in the M.Phil and the computer science/engineering oriented in-house training programs. Here IHLL was PASCAL and the handcoding was to FORTRAN- IV and assembly language.

6.25.11 Experiment 1 used subsets of PASCAL, FORTRAN, ALGOL-60 and COBOL to study error correction strategies and hand experiments in semi-automatically generated transition matrices. The grammars used were historically tiny as the technique is demanding in space, time and code.

6.25.12 Experiments 9, 11, 14, 15, 18 and 19 indicated that though recursive descent ( as in experiments 3, 12 and 15) is a technique which allows the syntax analyser to be written almost as fast as one can write using LL(1) intuitively with the extended Backus Normal Form, one can with a little bit of practice use the stack explicitly and still generate the syntax analyser almost as fast as one can write and thus is an application of HGA.

6.26 One major conclusion drawn was that it is sufficient to use formal syntax analysis methods for guidance and use patches which use state variables as in the Whetstone Compiler. Thus a rigid conformance to formal techniques is not necessary.

6.27 Experiments 2 and 4 used LL(1) (recursive descent) equivalents and parsers using the extended BNF, the parsers being developed by handcoding.

6.28 Experiment 20 showed the use of a fast transitive closure algorithm especially as one was restricted by the memory size and segmentation feature of the IRIS-55. A fast transitive closure algorithm using the adjacency representation cuts down the time and memory required as discussed in chapter 2.

6.29 The experiments indicated that simple algorithms/solutions for syntactic error correction over a variety of languages most probably do not exist.

6.30 The code generation in some of the experiments was restricted to the compile and go concept where inefficient code is generated which is immediately executed. In some of the experiments studies were made to generate code for the TDC-312 and TDC-316 etc. also so that experience and knowhow is gained for code generation and code optimisation for 0- address, 1-address and 2-address machines. The bypassing of the code generation phase or keeping it simple made the application of HGA more tractable. 6.31 Experiment 20 used as test data large subsets of C, PASCAL, ALGOL-60, BASIC, COBOL & FORTRAN. The LL(1) parser experiment also used LISP & APL as test data. 6.32 The basic philosophy generally followed was that theory and formal techniques only guide one's thinking and need not be rigidly adhered to. Thus most of the scanners only intuitively and broadly used the finite automata concept and except in experiment 20 most syntax analysers used the overall formal techniques as gross guidelines of the parsing technique involved. 6.33 In all the experiments extensive code optimisation techniques were not used as the TDC environment did not require the same. A reasonable amount of optimisation was however used. This may have made HGA application easy. 6.34 Experiment 20 also experimented with the possibility of breaking up a grammar to see which technique fits which part. Test data was however for very small grammars. 6.35 The absence of any relevant reliable compiler (let alone optimising) was the motivation for using HGA in all the above experiments. The high unreliability of the compilers developed by HGA had to be taken into account owing to the high unreliability of the hardware. The background of the programmers to practice HGA is as indicated. The current availability of technology makes a repetition of the above using HGA unnecessary except for expositional purposes. 6.36 Formal verification techniques were used with the view that they need not be formally used but mainly guide one's thinking. These were mainly used on the PASCAL-like HLL specifications. The above implementations are not necessarily production software oriented except for (6). 6.37 The conclusion drawn was that HGA aids in controlling the complexity involved in 'thinking' out the implementation though in these applications the assembly language equivalents by hand are not generated except for experimental FORTRAN subsets on the TDC-316 and TDC-332 systems.

PRACTICAL SYSTEMS PROGRAMMING ORIENTED COMPILER AND COMPILER RELATED PROJECTS WITH EMPHASIS ON A STUDY OF SYNTACTIC ERROR RECOVERY AND CORRECTION AND WITH THE USE OF HGA.

COMPUTING ENVIRONMENT:

TDC-332/System332/IRIS-55/PC-286,-386,-486,Pentium/ND series /CDC series/PC-LAN. S.No. Project Assitance/

Implementation

01 Syntactic error recovery/ P.Gopalakrishnan correction with the use of M.Phil, Thesis

transistion matrices in compiling. 1975-76 Purpose: Experimental study of

adhoc strategies by the use of HGA.

02 A study of syntactic error correction J. Vaidyanathan

using the LL(1) strategy of parsing. M.Phil, Thesis Purpose: Experimental study as per 1975-76

Floyd's approach.

03 An implementation of PL/0 through a K. Gangaram programming technique(HGA). M.Phil Thesis

Purpose: A study of Hansin's suggge- 1977-78 stion with the IHLL as PASCAL

and the THLL as Assembly/ Machine/FORTRAN.

04 Review of error correction through T.Rajmouli

LL(1) techniques using HGA. M.Phil Thesis Purpose: A refined experimental 1977-78

study as per Floyd's approach with the IHLL

as PASCAL and THLL as PASCAL.

05 Generalisedsymbol tableorganisationG.V.Subramaniam for a PASCAL Compiler using HGA with M.Phil Thesis

IHLL as PASCAL and THLL as Assembly/ 1977-78 machine/FORTRAN IV.

Purpose: An unwinding of a generali- sation of the PASCAL(P) symbol

table using IHLL as PASCAL and THLL as assembly/machine/FORTRAN.

06 Design and implementaion of B.Sukumaran Nair

features for TDC-316 ALGOL-60. M.Phil Thesis Purpose: A review of the literature 1975-76

of I/O of ALGOL-60 and practical alternative impleme-

ntations of I/O withFORTRAN as IHLL and assembly as THLL.

07 Anatomy of a typical PASCAL compiler. M.Mohan Reddy

Purpose: An unwinding by hand of M.Phil.Thesis PASCAL(P) to FORTRAN as 1982-83

implementation language.

08 Experimentalimplementationwith syn- K.Subramaniam tactic error-correction of an EULER M.Phil.Thesis

interpreter. 1981-82. Purpose: An experimental study of

simple precedence error recovery and correction.

09 An experimental syntax oriented imple- Rita Siviah

mentation of APL. M.Phil.Thesis. Purpose: A syntax directed compiler for 1980-81.

APL and the use of APL as an hardware description language

for the TDC family of machines.

10 Acritical studyof implementaionof the IIT,Madras PASCAL(P) stack machine. Summer Trainees.

Purpose: Towards the frist implementation 1976 of PASCAL in India.

11 Syntaxanalyserwitherrorcorrectionof N.T.Sreekumar

Concurrent PASCAL. M.Phil.Thesis. Purpose: A study of Concurrent PASCAL for 1983-84.

systems programming using HGA.

12 An implementation of ALGOL-60 using the Trainees Project. description of Grau et al. 1978-79

Puposre: An unwinding by hand using HGA of the ALGOL-60 translator which

uses recursive descent.

13 AnimplementationofGPM usingHGA. Trainees Project. 1975.

14 An implementation of an ALGOL-68 Trainees Project.

subset using HGA. 1980 Purpose: A small subset of ALGOL-68

with IHLL as PASCAL and THLL as FORTRAN using HGA.

15 AnexperimentalLISP interpreter. Trainees Project.

Purpose: An implementation of Commom 1976 LISP with PASCAL as IHLL and

FORTRAN as THLL on the IRIS-55.

16 ImplementationofPL/1subsetswithHGA. Trainees Porject. Purpose: A small subset of PL/1 implemented 1980-82

as per the SP(K) subsets of HOLT with PASCAL as IHLL and FORTRAN as

THLL on the IRIS-55/System 332.

17 AnimplementationofSNOBOL usingHGA. Trainees Project. Purpose:The first implemnetation of 1982

SNOBOL in India with PASCAL as IHLL and FORTRAN as THLL on the

IRIS-55.

18 Asyntaxdirectedimplementationof FORTRAN Trainees Project. Purpose: An implemnetation of a large FORTRAN 1976

subset using 5 different parsing techniques but the same semantics.

19 Syntax directedimplementationsof PASCAL Trainees Project.

using HGA with FORTRAN as THLL as PASCAL 1979 as IHLL.

Purpose: An implementation of PASCAL using 5 different parsing techniques

but with the same semantics using HGA on the IRIS-55/System 332.

20 Implementaion of parser generators for simple Trainees Project.

precedence, operator precedence, extended 1974-82 precedence, transititon matrices, LL(1), LR(1),

LALR(1) techniques. Purpose: The first such attempts in India.

21 Anexperimental implementationof LOGO. Trainees Project.

Purpose: To develop knowhow in graphics. 1983

22 An experimental implemnetation of SmallTalk Trainees Project. using HGA with PASCAL as IHLL and C++ as THLL. 1985.

23 Anexperimental implementationof PHIGS with Trainees Project.

PASCAL as IHLL and C++ and assembly as THLL. 1986

24 An experimental implementation ADA subsets M.Tech. Thesis. using HG A.C. Subramaniam

1990-92

25 An experimental implementation of CHILL M.Tech.Thesis. using HGA for communication software. 1992-93.

26 An experimental implementation of M.Tech. Thesis FORTH using HGA with PASCAL as IHLL 1986-87. assembbly as THLL for astronomy departments.

27 An experimental word-processor on Trainees Project. the ND-55 using HGA with PASCAL as IHLL 1989 C as THLL.

28 An experimental implementation of MODULA M.Phil Thesis

using IHLL as PASCAL and THLL K.Neeraja as C++. 1985

29 An experimental implementation of PROLOG H.Padmashree

with C as THLL and PASCAL as IHLL 1987 R&D project

30 An experimental implemmentation of CORAL-66 R&D Project with PASCAL as IHLL and assembly as THLL. 1977

INTRODUCTION The seventh application of HGA elaborates on the third and shows a logical extension to bottom-up parsing techniques: from the constants in the parsing table that standard precedence techniques ( and most generalisations of the same) consider the ALCOR group considered algorithms in the parse table. Incorporation of error-correction suggested augmentation of the algorithms with data structures and hence encapsulation in an object is a logical extension of the generalisation. Thus an object in a table of objects is to be messaged based on the top of the parse stack and the incoming symbol. The necessity of having to deal with different ab stractions of the problem at the same level in the third application generalises to the concepts of Data Abstraction and Polymorphism in a very logical extension of ideas. The development of the TM parser first with error- correction and then incorporation of error-corrections a subsequent step (with syntactic and semantic backup) allows the development of the Inheritance concept.

Thus we have the concepts of Data Abstraction, Polymorphism, Encapsulation and Inheritance from the intrinsic ideas involved in the third application and this generalises to the concept of OOP(Object Oriented Programming). The parse table of objects obtained contains in effect a data-base obtained based on the syntax of the programming language considered and this generalises to the concept of OODBMS (Object Oriented Data Base Systems).

Thus a hypothetical point of view can be taken that OOP and OODBMS follow as logical generalisations of the intrinsic ideas involved in the efforts of the ALCOR group. This generalises HGA to use Object PASCAL in HLL descriptions especially in the Graphics Specialism.

7.1 To explore the use of HGA with OOP a cosmetic software package for a civil engineering application is underway from 1993. This is the development of a special Estimation and Costing Package for use by PC users and oriented to small to medium size (by Indian urban standards) construction of residential buildings/complexes and aiming at a target of corresponding builders/contractors. The special features of this package are:

a) The input through a special purpose interactive cum procedural language 'CIVIL' which with the FONT PROCESSING of the custom-made GIST card (Department of Electronics, Government of India), through CDAC (Pune, India) allows a single Machine Translation to be sufficient for a multi- lingual interface involving India's major languages (15 nos).

b) The language CIVIL's design is simplified by the standardisation of terminology and methods of the Costing and Estimation sub-area of Civil Engineering. LEX/YACC under UNIX are used here for a conversion from CIVIL (MULTI-LINGUAL) TO C++ for procedural features, the interactive features being mapped onto the 'breakpoint' feature.

c) A dBase oriented database (with a multi-lingual interface that GIST allows) is to be used for the package for spatial/temporal/historical information.

d) Extensions to the Man-machine interface by pen- input, is planned for Graphical I/O along with primitive intelligent Script Text Processing using neural nets.

e) Automatic OCR and Primitive Intelligent script processing techniques to input building plans that exist as a historical backlog or existing current information repositories.

f) Spoken Input/Output Interfaces in the Syllabaries that Indian Languages are. For Input primitive neural nets with limited vocabulary are being considered.

g) Extensions to other aspects of the Contractor/Builder's work by a suitable adaptive (temporal/historical) data-base with primitive transaction processing is being considered.

h) Application of existing 3-D graphics packages and picture data-bases to be integrated into the software to aid visual (cosmetic) views.

i) For commercialization suitable cosmetic engineer ing using graphics and multi-media is being considered like 'overdone' HELP and DIAGNOSTICS.

j) Existing standard PC based integrated packages for DSS are being exploited with a suitable WINDOWS environment along with the multi-lingual interface which GIST provides.

k) The aim of the project is to apply and study the use of HGA in emerging technologies.

7.2 A use of OOP with HGA has use in pattern matching over a range of Geometries. An elementary but encyclopaedic treatment of Geometry is found in Klein [Kle,39]. An extension to the methods of chapter 4 is possible to affine geometries and more generalised geometries.

7.3 In chapter 4 was considered a pattern-matching problem of a sub-pattern against a template of a full master-pattern. Here the digitisation was by representing both the patterns by finite sets of computable points. The patterns, by generalisation, can be considered to be digitised by a finite number of computable geometric manifolds i.e. either as a collection of lines, planes or in general the equivalent algebraic manifolds.

7.4 To perform the pattern-=matching subject to a threshold T it is only necessary to partition the patterns suitably and apply the redistribution theorem.

7.5 In affine geometry one can consider the transformation of points as being defined by the formulae--

---- ---- ----- ----- --- --- --- ---| x' | | a1 b1 c1| | x | | d1 |

| y' | = | a2 b2 c2|* | y | + | d2 || z' | | a3 b3 c3| | z | | d3 |

---- ---- ------ ---- --- --- --- ---

7.5.1 Thus we have 12 unknowns and given 4 points we have enough linear equations to obtain the transformation.

7.5.2 In projective geometry the transformation is:

x' =(a1x + b1y + c1z +d1)/(a4x + b4y + c4z + d4)

y' =(a2x + b2y + c2z + d2)/(a4x + b4y + c4z + d4)

z' =(a3x + b3y + c3z +d3)/(a4x + b4y + c4z + d4)

7.5.3 Here we have 16 unknowns and leaving out a constant factor 15. Given 5 points we can determine the transformation parameters.

7.6 The above is formalised in Kleens's Erlangen program as saying that affine geometry is represented by the affine group G12 and projective geometry by the projective group G15. A naive pattern matching it is only necessary to ensure that the number of points k is 4 for affine geometry and k =5 for projective geometry are in one partititon when applying the pigeonhole principle. If T(k-1) - 1 partitions are created and the points in the master-pattern or sub- pattern distributed among the same then we are sure by the pigeonhole principle that if a match of threshold T or more exists then k points in the match occur in some partition. We are also sure that all matches with T or more points matching will have k points in some partition. In Eculidean space we can optimise by discarding points based on the matching metric of lengths. The redistribution theorem then can be used to suppress matches less than T points.

7.7 A generalisation is to use geometric manifolds instead of points. If the master-pattern consists of a finite number of computable manifolds (lines, areas, volumes, algebraic manifolds) and the sub-pattern is similarly considered to be a finite number of geometric manifolds, lines, areas, volumes, algebraic manifolds) and a requirement exists of at least T manifolds matching, it is only necessary to create (T/(k-1)) -1 partitions. The k is the number of points that determine the manifold to be considered. By distributing the points among the partitions we ensure by the pigeonhole principle that k points occur in some one partition. Thus by considering the points in a partitions in groups of k one is certain to consider a manifold in a match, if it exists. Depending on the particular geometry considered one can apply the redistribution theorem or cut down the number of cases depending on some metric in metrical geometry. 7.8 The simplest cases are where a manifold in the master-picture can be considered to have a unique corresponding manifold in the sub-pattern (or vice versa). In the case of the geometry of continuous deformations we have a group of infinite elements and the method breaks down. 7.9 Once we have determined one or two matching manifolds, the transformation is to beapplied to all the points in the sub-pattern (master-pattern). Thus we still have a try all possibilities component as we still tryu all groups of k points in a partition. This 'try all possibilities' component can benefit from HGA and OOP the latter for software reuse purposes with HC.

7.10 It is known that in the application of affine geometry the procedure of considering the problem of estimating motion parameters from a pair of range images by a solution of linear equations based on 4 points is naive as one has to consider errors in the matching of points. A generalisation of chapter 5 techniques can be applied in this case though what is normally used is to use features as the geometric manifold to be isolated.

INTRODUCTION

The Eighth application of HGA, is in the non- traditional area of the applicability of real-life programming language (ALGOL-like, PASCAL-like or COBOL- like) descriptions, to aid popularisation of education and literacy in Automata Theory and the Theory of Computation in the Indian Environment and thus applied HGA to (reliable (!)) abstract target machines. It was found that the use of the RAM and RASP turned out to be crucial, in the training, instruction and mastery of the machine/assembly language instruction repertoire, of a given processor. One can easily master the entire instruc tion repertoire, repeating elementary programming examples, using HGA, by isolating, universal subsets of instructions in the repertories. In this one takes abstract views of the instruction repertoire subsets, as dealing with a set manipu lating machine, a propositional Calculus machine, at one end; and at the other extreme end, the full power of a Universal machine: a string manipulating machine or a machine doing fixed and floating point arithmetic through algorithms. Such studies and training, in a totally industrial environment, were crucial in training assembly/machine language program mers, from the dilettante programmer to hard-core systems programmers, with proven success for a decade and a half. The TDC-12 & TDC-312 (Trombay Digital Computer, 3rd genera tion, 12 bit word length) employed octal machine language programming, over a range of applications in Controls, Commu nications and Computers. An upgradation took place to assem bly language only with the TDC-316, as the assembler (paper tape oriented on the TDC-12, -312) was too slow. 8.1 An extension of HGA, not too popular, was to use other abstract machines, at the lower end of the hand-coding. While it was found that the RAM and RASP were directly useful, the equivalent abstract machines of the Chomsky hierarchy, only served as cultural background, barring the finite automata (which is everywhere). HGA when used for elementary Computational Complexity theorems, is effective, but laborious. It was concluded that while useful for introductory and moderate theorems (results), there is no other way of study, train, think, teach or do the area expect as propounded by the pre-eminent Prof. J.E. Hopcroft (1969, 1979). Similar attempts are seen recently and similar conclusions can be drawn.

8.1.2 A different line of generalisation of HGA is to the abstraction of Automata. An Automata are nothing but computing agents, and the full description of their behaviour, cumbersome and at times complex. HLL specifica tions have been usefully employed to describe their behaviour and construction.

8.1.3 The extreme length and complexity of detailed description, can be shortened by algorithmic descriptions at times. This can be considered to be a useful use of HGA. Thus all constructions in Formal Languages of Automata Theory can be described in detail by an algorithmic specification in real- life PASCAL-like language, with suitable ADT's and these can be reduced by-hand to the detailed abstract formalisations.

8.1.4 An extension of such constructions to post- machines and program-machine is relatively straight forward.

8.1.5 However such applications of HGA are fruitful only for simple constructions, in a practical sense. More advanced constructions, though effective, turn out to be meaningless unless they are integrated by the use of 'intuitive' arguments' normally employed, in more advanced constructions. Furthermore, demonstrations by counter- examples, as in the non-closure of cfg's under intersection, seem to have no mapping into HGA. Furthermore' demonstrations like the decidability of equivalence of fa seem to be easier in the mathematical formulation of sets rather than PASCAL-like descriptions, though these aid in controlling the abstraction of the formalism. The HLL descriptions in such cases may not be considered to be practical algorithms, but effectively an aid to control complexity and undue abstractions.

8.1.6 HGA as used here us an aid to describe the constructions and an aid in the understanding 'what is done', but no aid in mastering the intuitive arguments and techniques that go into how to 'do the area'.

8.2 HGA applications to the area of syntax analysis techniques is extremely practical and has been employed, but in the pure area of Formal Language and Automata Theory, perhaps a new environmental language and/or technique has to develop and just as ADTs help controlling complexity a new type of ADTs oriented towards this area has to develop.

8.2.2 It is felt that this should be on the lines of tradeoffs between complexity of formalism, complexity of abstraction, complexity of HLL specifications/descriptions, complexity arising from different abstractions operating at the same level of abstraction, and complexity of hand-translation. The extended PASCAL to be employed in HGA thus varies depending on the domain of problems considered.

8.2.3 Thus a universalisation of HGA will require, augmentation by suitable borrowings of features from different programming environments and the techniques, skills and methodologies will vary for effective practical applicability.

8.2.4 Thus we have logical extensions to HGA, to choose the appropriate 'source specifications', the 'translation specification' and the final 'target specification'. Thus if ultimately the target is machine language, we have to go through a variety of HGA stages. Thus HGA is not a single stage application, but in general will involve in general stages of greater and greater refinement.

8.2.5 Thus the HGA process will have to be considered by itself as a meta-process of stepwise refinements of the HGA process.

8.2.6 Assembly Language Programming (the use of RASP and Post Machines and Program Machines):

8.2.7 The instruction repertoire of real life digital computers, character-oriented (IBM 1401), decimal-oriented (IBM 1620), binary machines, is partitioned into subsets, such that each subject is universal the sense of being able to compute the partial recursive function.

8.2.8 An obvious subset consists of the arithmetic operations, the compare and conditional jumps, and the unconditional jumps. By reducing the arithmetic multiplication/division to addition, making at times the assumption of ultimate word length one gets more and more universal sets.

8.2.9 By dropping the restriction on the register size, assuming it to be unlimited in size, one gets new register- oriented universal sets.

8.2.10 By reducing arithmetic to repeated incrementation and operators one isolates smaller universal sets.

8.2.11 Once the coding scheme is understood (BCD,EBCIDIC or ASCII) one proceeds to isolate string oriented operations, using string sizes that fit only one word, two words, multiple words or at times ignoring the word size. Then we obtain an abstraction of a real-life digital computer as a string manipulating machine. A class of sub-machines considered is to view the real-life digital computer as a set manipulation machine, a prepositional calculus machine, by their equivalent realisations of bit string operations, as in standard Pascal implementations.

8.2.12 The above process of isolation of subsets is generalised to stack operations. The exercise of reducing the instruction repertoire to universal subjects, allows a comprehensive study and use with elementary programming examples of the instructions repertoire. A cultural skill & gain is the ability to migrate to different machines in the same environment without much cultural shock.

8.2.13 Before the above method was used in instruction the migration of assembly language programming from the TDC-12 (PDP-8 like) to TDC-316 (PDP 11 like) was found to have some mental blocks, owing to the powerful addressing nodes of the latter, not found in the former.

8.3 The mastery and appreciation of the use of the entire instruction repertoire is thus easily achieved and then applied to more sophisticated assignments; and is the on-the-job performance is considered it has turned out good entry- level assembly language programmers in the areas of Real-Time Application, Communications software and system programming.

8.4 The elementary programming techniques used were the PASCAL-like control structure and data structures (with limited use of pointers, and the aim was to think out the programming examples in the High Level Language, and map the same systematically to unoptimised assembly language equivalents, and in a final pass optimise the assembly language.

8.5 Introducing Many programming languages simultaneously

In an obvious use of finding universal subjects of features, one could study a particular programming language, or many programming languages simultaneously.

8.5.2 The crucial concept of effective computations through RASPs, now equipped more as subset - FORTRAN, subset COBOL, subset PASCAL, etc., is essential to allow the entry-level programmer to migrate from one programming language to anoth er.

8.5.3 A practical application of HGA is to the problem of Pattern Matching in Euclidean space considered in chapters 4 and 5. HGA is applied to obtain a practical algorithm which when mapped onto a LAN for the chance print matching problem leads to a viable solution.

8.5.4 It was informally opined by J E Hopcroft (circa 1972) that perhaps the only practical result of Automata Theory is Cook's theormem. The theorem states that is a 2-way deterministic pushdown automata can perform a computation then it can be simulated in linear time by a RAM and hence a practical linear algorithm emerges with current general purpose digital computers.

8.5.6 The traditional application of the theorem has been the determination of a substring in a given string or variants of the problem. By a suitable string encoding the pattern matching problem in Euclidean space is solved by an extension and application of the theorem, and using HGA to obtain a viable solution.

8.6 An extension of Cook's theorem is to the common subsequence problem. Given two strings x and y the common subsequence is c1c2c3---ck where with x=a1a2----an and y=b1b1---bm there exist j1<j2<j3---<jk, and l1<l2<l3---<lk , for all ji in 1..n and li in 1..m such taht aji = bli, for i in 1..k. The common subsequence probnlem can be solved in time O(mn) [Hop,74]. A more efficient solution is that it can in the present application be solved in time O(mlogn) [Hun,77].

8.7 A mapping of a 2DPDA computation to a RAM can be considered to be an application of HGA as the the mapping to a practical algorithm. In the pattern matching problem the data structure that practically arises is the position tree. Given a string x we form a string x$ with a new symbol $, not in the vocabulary of x. If x$ = a1a1---an a(n+1), then the indices refer to the positon in the string of the symbols in x. Associated with each position i is the shortest substring identifer of x which is a substring of x starting from ai and uniquely identifying the position i. A position tree is a tree with its leaves being the positions 1 to n+1, and the path from the root to the leaf being the substring identifier for that positon. It is known that a position tree always exists and can be constructed in O(n**2) time or even O(n) time by compacting the chains of nodes which have only one son. The common subsequence problem can be solved by using position trees in a straightforward manner.

8.8 The algorithm for the determination of the largest common subsequence is used as a scaffolding in the pattern matching problem in Eucildean space, by a suitable string encoding. The existence of a common subsequence in the encoding is shown to yield a coarse match which is then refined to obatin a fine match.

8.9 The more effiecient O(mlogn) algorithm [Hun,77] uses a variant. Given two strings (sequences) x and y a threshold matrix Tik is set up where the value of Tik is the portion of the string y that should be considered starting from the first positon to obntain a susequence of k matches with i. Tik can be computed in a straightforward manner from T(i-1)k and T(i-1)(k-1). By compacting the data structures the efficient algorithm arises.

Definition: Given a pattern M and a pattern N, a base vector ik is any line of M in the i-bunch-vector and a base vector jl is any line of N in the j-bunch-vector.

Definition: For an i-bunch-vector ( of M ) and a j- bunch-vector (of N) if vertices k,p of M; l,q of N are considered , i not= k not=p and l not= q not= j, then the included angles kip and liq are redfered to as alpha(kip) and beta(ljq) respectively.

Definition: An ordering of lines in the i-bunch- vector (of M) is a clockwise (or anticlockwise) ordering of lines (vectors) subject the the following conditions:

a) ((Ai,k,p,r) (i,k,p,r in M and i not= k not= p not= r)) , (alpha(kip) < alpha(kir) implies ir occurs later in the ordering, and (alpha(kip) = alpha(kir) implies ir occurs later iln the ordering, if L(ip) < L(ir), and (alpha(kip) = alpha(kir) and L(ip) = L(ir) implies ip occurs before ir (arbitrary choice).

Definition: A string encoding of the i-bunch-vector of M is defined as P1i alpha1i P2i alpha2i---PMi alphaMi where (Aj)(j in 1..M) Pji is the length of the lilne ji and apha(ji) is the included angle between the lines ji and (j+1)i.

Definition: A string encoding of the j-bunch-vector of N isdefined as P1j beta1j P2j beta2j---PMj alphaMj where (Ai)( i in 1..N) Pij is the length of the line ij and aphaij is the included angle between the lilnes ij and (i+1)j.

Terminology: The string encoding of the i-bunch-vector of M is abbreviated as C(r= 1 to M) Pri alphari and the string enclding of the j-bunch-vector of N as C(r=1 to N) Pri betarj. The C operator is analogous to the phi and sigma operators used for continued products and sums and is here used for concatenation of strings.

Definition: A string encoding of M, with all the bunch- vectors is X = C(i= 1 to M) ((C(r=1 to M) Pri alphari)##).

Definition: A string encoding of N, with all the bunch- vectors is Y = C(j=1 to N) (C (for s=1 to N) Psj betasj) ##).

Definition: A string encoding of the pattern matching problem is (C(i=1 to M) ((c(Pri alphari)##)Y###))#.

Definition: A matching sequence of lengths is any subsequence of x = C(r=1 to M) Pri and y=C(s=1 to N) Prj such that if x and y are reqritten as C(k=l to l+T-1) Pmk dmk and C(k=l to l+T-1) Pnk dnk respectively, the (Az)(z=l to l+T- 1)(Pmz =Pnz) i.e the lengths Pmz,Pnz matching dmk and dnk are arbitrary subsequences of x and y.

Lemma: A matching sequence of lengths is a necessary condition for a match to exist.

Proof: If a match exists a T-polygon exists and hence a vertex i exists such that the i-bunch-vector has a matching sequence of lines which is the longest common matching subsequence of lengths.

Definition: The included angle associated with dmz is defined as alpha(t-1) + alpha(t)+ --- + alpha(u-1) + alpha(u) where dmz is of the form alpha(t-1)Pt alpha(t) P(t+1)--- alpha(u-1)P(u)alpha(u) in M.

Definition: The included angle associated with dnz is defined as beta(t-1) + beta(t) + beta(t+1) + --- + beta(v-1) + beta(v) where dnz is of the form beta(t-1)Ptbeta(t)P(t+1) - --beta(v-1)PvBeta(v) in N.

Lemma: For three points in M, i,k,p with i not= k not=p and three points in N, j, l, q sucshs that j not= l not= q, a three point match i -->j, k__>l, p-->q exists iff dmz = dnz where dmz=alphaki(k+1)Pis alpha(k+1)i(k+2)---P(p-1)i alpha(p- 1)ip and dnz=betalj(l+1) Pj(l+1) alpha(l+1)j(l+2)---P(q-1)j beta(q-1)j beta(q-1)iq.

Proof: The included angles alphakip and betaljq must be equal.

Definitio;n: A linear encoding of the patern is defined as (C(i=1toM)(C(k=1to M)Prialphari)#)#(C(j=1to N)(C(s=1toN)Psj betasj)#)##.

Defintion: A prima-facie one-point match of a point i in M and a point j in N is defined as having aat least T lines in the i-bunch-vector matching at least T lilnes in the j- bunch-vector insofar as the lengths of the vectors are concerned.

Lemma: A coarse pattern matching can be done in time O(M**2N**2/T**2)log max(M,N).

Proof: In the linear encoding o;f the pattern matching problem only points with a prima-facie one-point match need be considered. Thus only M/T +1 and N?T+1 points need be considered for M and N respectively by the Pigeonhole principle. To determing prima-facie one-point matches with sorted M-lines and N-lines for each node, requires O(MMlog N + NNlogM) time i.e. O(KKlogK) time, where K=max(M,N). To find the largest common subsequence of matching lengths of the bunch-vectors associated with points i and j ( in M and N respectively) takes O(MN) time, by a straightforward algorithm. Thus the total time taken is O(MMNN/TT)logK.

Lemma: The coarse matching can be done in time O(KKK/TT)logK logK time.

Proof: by using the more efficeient algorithm for determining the largest common subsequence the lemma follows.

Lemma: A fine pattern match takes O(MMNNlogK) time in the worst case.

Proof: The previous coarse match yuields sets of T points that match. For a refinement the diagonals of the T- polygon associated with the T points have to be checked, and this takes O(T(T-1)/2) or O(TT) time for each set of T points.

Lemma: in the case of a match <T points existing an O(MMlog M ) algorithm is possible.

Proof: by partitioning and redistribution one can avoid pattern matching as all holes cases will arise. The time taken is only for sorting and comparison of lines. Here M>= N>= T is assumed.

Lemma: in the case of a match of >= T pints occurs a O(MMlogM) time is possible.

Proof: By partitioning and redistribution the only time taken is for sorting and comparison of line lengths and final pattern mataching. Here M>= N>=T is assumed.

Lemma: A pracitcal algorithm for the pattern matching problem takes O(MlogM) time with a LAn of M nodes.

Proof: A straightforward parallel processing speed-up of the algorithm with the LAn considered as M parallel processors

yields the result. Here M >= N>= T is assumed.

Outline of the final Algorithm:

[Step 1] Form the i-bunch-vectors and j-bunch-vectors.

[Step 2] Sort M-lines and N-lines using a precomputed table to determine the Pythogerean length from the cartesian coordinates.

[Step 3] Partition and redistribute either M or N and check for matching lines.

[Step 4] Have a matrix[1..M,1..N] to record one-point matches which is set up in step 3.

[Step 4] For each one-point match (i,j) with i in M and j in N use the algorithm for determining the longest common subsequences to set up count := maximum number of lengths of the i-bunch-vector that match lengths in the j-bunch-vector.

[Step 5] if count for all one-point matches is < T then the match fails.

[Step 6] Proceed to a fine match by considering all triangles in M and N with vertices i and j in the M-polygon and N- polygon to see if a match with T points exists.

Lemma: If a match of T points exists the time taken cannot be more than O(TTKlogK).

Proof: At most TT one point matches exist and the time taken to determine the longest common subsequences is O(KlogK). To determine a fine match one has to check TT triangles.

Lemma: If a match of T points exists then the time taken cannot be more than O(TTlogK) with a LAN of K nodes.

Proof: Obvious.