Beginner's Mind

Saturday, October 11, 2008

Refactoring Expression Trees

In the first version of my Ix interpreter library I had two expression tree types: AST and AST (Abstract Syntax Tree and Abstract Semantic Graph, respectively). The idea was that syntax would be turned into an AST just after parsing, then by successive transformations become an Abstract Semantic Graph, where nodes in the graphs link to each other in appropriate ways (and so isn't really a tree at all).

The problem with this approach was that there are many transformation processes: the one that parses syntax to create AST, some transformations from AST to AST, and more from ASG to ASG. Where should which transformations take place? The idea, initially, was for all variable resolution to have taken place by the time the graph was an ASG, but this takes place in stages, and there is no easy way to keep track of "an AST with this kind of partial resolution".

Instead, an ideal language would allow the creation of custom types for each state in the process. One might imagine the series of transformations being expressed through a linear type system.

One could approximate a linear type system simply by giving stages names, for example a freshly parsed tree is of type ASG while the variable-resolution result is of type resolved ASG or something similar. After each transformation process we can "annotate" the value type to explain what's been done to it. It would only be meaningful to the user, but type compatibility prevent one from trying to execute, say, a tree which has not performed variable resolution or type checking. Immediately before the execution of the interpreter, we'd check for the existence of resolution, type checking, and other annotations.

I aim to have Ix support such features. We'll see how it works out.

Labels:

Friday, October 3, 2008

XML as a front-end syntax

(This is the first in what I hope to be a series of many more blog posts written about the programming language I am developing called Ix (pronounced like Styx).)

I'm sure that the title of this post has already inspired horror in the readers who have understood its true implication.

In the course of working on the Ix internals, the VM and evaluation logic, I've had a need to test various properties of the system. In test fashion, I often need to mock or stub out pieces I have not yet implemented. Since I haven't yet come close to nailing down the syntax, I've needed a means to specify tree-structured data, to specify the syntax tree to internal components. Viola! XML seemed like the right fit.

Indeed, XML seemed the right fit as a choice of Intermediate Representation; the idea being to parse the front-end Ix language, convert it into an Abstract Syntax Tree, then process it from there. If one wanted to serialize any of these steps to disk, a language like XML seems the natural fit.

Having been toying with the concept of using XML for the IR and specifying test programs in IR directly, it dawned on me to consider using XML as a front-end language. Surely this has been tried? Many people are familiar with Ant -- would they want to write a whole program this way?

On first impression there are many downsides from writing code with XML. Things can rarely be expressed concisely. However, I think there may be some advantages too:

Improved Block Clarity

When in the middle of a large chunk of code, the reader may be better able to understand it due to the closing syntax precisely explaining which blocks are closing. For example:


<fn name="something">
<if ...>
...

...
</if>
</fn>


When the developer comes to the end of the block, there is no confusion which block is closed. While the close of a straightforward function is easy to understand, I believe there will be many cases, the equivalent of Java code like the following, where readability is actually improved in this manner. Here is a snippet of Java code from the test Ix evaluator:

          public RuntimeException apply(Val val) {
return new InvalidValueTypeException(context, expected, val);
}
});
}
public FnVal visit(FnVal fn) {
return fn;
}
}


That's a lot of closing blocks! Can you tell what's what? Well no, because I omitted the top portion -- but even looking at the full code, you may be scrolled down and cannot see the top section either. I posit that some code like the following may be more readable:

    (You get the idea...)

</fn>
<fn name="visit"><public/><in name="fn" type="FnVal"/><out type="FnVal"/>
<return>fn</return>
</fn>
</class>


This property has been touted as an advantage of XML before; whether it actually improves programming readability is up for grabs, but seems in XML's favor. With a decent editor, the engineer won't actually end up typing more than "</", and so won't require more typing than typical languages' syntax.

Domain-Specific Languages are More Obvious and Better Supported

Here I am using an unusual definition of the term "Domain-Specific Language", which is traditionally a specialized foreign language, often as embedded within a general language. SQL is an example of a domain-specific language for database queries.

When I am working with a language, I often think of it as comprised of many smaller languages. Consider C++, for example. One has the "language of types", where T[N] means "array of type T, lengh N", such as int[10]. This is an entirely different language (in my thinking) than the "language of expressions", wherein T[N] means "access the Nth element of array T".

One of the interesting and novel aspects of C was the parallel syntax of arrays and types, seen in arrays and pointers and many other places. I am uncertain whether this kind of "analogy" is beneficial or detrimental, though I probably lean towards the latter, simply for the reason that a single lexical form having multiple meanings can be confusing because context is required to differentiate them (as is the case for type inference and other forms of conciseness-improving features which I believe harm readability -- but that's for another post).

In any case, I think of programming languages as compositions of these "domain-specific languages". Modern languages have many, including:

  1. The language of types ( T* )

  2. The language of expressions ( a + b )

  3. The language of hierarchy ( a.b.c ), often overlapping language of expressions

  4. The language of classes and structure; classes contain methods; methods contain the language of expressions



Thinking of languages in this way helps me to conceptually organize the syntax and semantics -- what kind of operators can be invoked in various circumstances. It also provides an obvious path to a modular language implementation.

Getting back to the point, as a language designer and a language user, I find it helpful for the circumstances to be very specific when we "context switch" into a domain-specific languages.

Some expressions absolutely require custom language support. For example, given a generic function-call syntax like Lisp S-expressions, one would not prefer to write ((. abc def) ghi) instead of abc.def.ghi; it's less clear. The author of Clojure, a Lisp-like language for the JVM, added a special syntax to parallel Java's dot operator because sequential, hierarchical access is unpalatable with S-expressions. In Clojure, (. target member ...) can be written (.member target ...), making the form more Lisp-like, and the language has primitive straightforward support for a Java-like dot operator too: new java.util.HashMap translates to Clojure (new java.util.HashMap).

With an XML syntax, such specific languages manifest as actual XML-embedded domain specific languages. While we could perform a literal translation of the S-expression form ((. abc def) ghi) into XML, the result would be poorly readable. Can we do better, but with XML? Perhaps something like this: <dot><o>abc</o><o>def</o><o>ghi</o></dot>. Ouch :-( No one is going to want to use a language like that. We fundamentally need a form more concise than XML can provide.

Instead, we can embed our language within XML, something like this: <dot>abc.def.ghi</dot>. It's "embedded" in the sense that the document has structure which is not represented in the XML -- generally an XML no-no as I understand it. However, for a programming language, our goal is to provide a workable syntax, and the embedded form is almost in every way preferable. (The only manner in which it's not preferable is that the expression requires a custom, non-XML parser.)

This is effectively a disadvantage of XML, in that one needs to work outside the XML framework to accomplish something, but that's only a problem if one sets out to do everything in XML, rather than use it for its strengths. The advantage of this kind of embedding is that it makes the embedding very obvious to the user; the engineer can be come directly familiar with the various sorts of embedded languages as he sees the various XML nodes invoking them. He understands which cases invoke which kind of syntax and there is little room for a lack of clarity in this matter.

An XML programming language syntax also naturally supports user extensions for domain-specific languages, since all it needs to do is permit the embedding of arbitrary XML trees in the language structure -- the syntax of the language is in this manner naturally extensible. One could imagine an implementation that allows language-engine-plugins to work on the abstract document, transforming custom DSL nodes into standard language nodes until no DSL nodes are left; and if any are left, there is a syntax error.

An XML-based language could support this much more easily, and in a more first-class manner, than others.

Interoperability

This property is simple. XML is a well-known, very well-supported standard. It would be easy for proprietary programs to read and write the language if its syntax is XML, since so much software already speaks XML.

The simple, parsable syntax would facilitate additional machine processing of the syntax itself -- for things like refactoring or bug finding. Handling the syntax is no longer the arduous task of writing a parser.

Support for Literate Programming

I imagine an XML language would provide excellent support for Literate Programming in Knuth fashion:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: "Literate Programming."

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.


In short, Literate software contains extensive documentation and other elements meant mainly for human, not computer, consumption. XML would obviously provide incomparable support for this, as HTML and other XML-based documents could be embedded directly in the programs.




Overall, the use of XML syntax presents both opportunities and inherent disadvantages. It's difficult to conceive of a pure XML language, which represents all language structure as XML document structure. I wouldn't want to use that. However, there are some apparent significant advantages that are making me carefully look at the option of providing a front-end syntax for Ix which is primarily XML.

Regardless of whether XML ends up being a permanent front-end for the language, I think taking a critical look at an XML representation of the language tree will give me more familiarity with the deep structure and relationship between language elements, which should improve the ultimate syntax, whatever it may be.

Labels: