Beginner's Mind

Sunday, June 28, 2009

Moving to Tumblr

I've moved my blog to Tumblr. Check it out at http://blog.jcrites.com

Sunday, December 21, 2008

Cross-applicable

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all sentences short or avoid all detail and treat subjects only in outline, but that every word tell.

From "The Elements of Style" by William Strunk Jr

Monday, December 8, 2008

Semantic Patches

One of the goals of my programming language Ix is to support the evolution of programs in a first-class way. I haven't figured out all of the details, but the idea is for the language to know about version 1 and version 2 of the program, and to provide warnings when certain kinds of dangerous transformations occur.

Having a static type system also provides similar functionality without any explicit notions of transformations through time; if you change a function to remove a parameter, then client code which needs to be changed will fail to compile. Thus the static type system provides implicit support for verifying program changes.

Could a language support something more powerful? Some others were thinking along these lines and developed Semantic Patches:

Our goal is to document and automate the kinds of collateral evolutions that occur in code. Because programmers are accustomed to manipulating program modifications in terms of patch files, we base our transformation language on the patch syntax, extending patches to semantic patches. But as opposed to a traditional patch, a single small semantic patch can modify hundreds of files, at thousands of code sites. This is because the features of our semantic patch language (SmPL) make a semantic patch generic by abstracting away the specific details and variations at each code site among all drivers. Semantic patches, and the associated transformation engine spatch, abstract away:


  • Differences in spacing, indentation, and comments

  • Choice of names given to variables (use of metavariables)

  • Irrelevant code (use of '...' operator)

  • Other variations in coding style (use of isomorphisms) e.g. if(!y) <=> if(y==NULL) <=> if(NULL==y)




A natural extension of this technology is to perform validation along with the transformation.

Friday, October 31, 2008

An Introduction to Ix

For years, I have had a dream. This dream has been a waking dream, a fantasy that preoccupies my waking thoughts. It is a mundane dream, in a sense: I dream of my ideal tool, the tool I use to practice my trade. Mundane in the sense that a carpenter dreaming of the ideal hammer is mundane.

The tool I use is more abstract. It’s the tool man uses to control machines, a programming language. Unlike natural languages, such as English or Spanish, programming languages are invented by men deliberately to serve their purposes. There are many of these languages, thousands. Why so many?

I started thinking about developing my own language in my first college programming languages course. “Wow, this is easy,” I thought after implementing my first interpreter. It’s not easy to make it run fast, but it’s relatively easy to make it exist.

My design of Ix has been influenced by the best of all the programming languages I’ve used. In a sense I’ve been most impressed by Lisp, where code is data, and programs can rewrite themselves. However, in order to achieve the code is data objective, Lisp had to give up a lot of readability and conciseness of which other languages are capable.

I felt that Lisp came very close to hitting the mark, the mark for which I myself am aiming (“the perfect programming language”), but ultimately missed. One of its failings is its choice of primitive data structure, the cons cell or the pair. During the development of Ix, I’ve made it a priority to select the ideal fundamental structure.

The language’s fundamental construct should be one with which we can build any data structure known to computer science. The cons cell fails at this. It can’t, for example, be used to implement arrays. Although Lisp’s syntax was designed so that “there is only one sort of stuff” (the cons cell), and all syntax is expressed with it, I think it’s the wrong primitive construct with which to approach computer science, and the wrong fundamental construct even for language syntax. You can see it break down when Lisp needs to represent arrays or associative structures. (‘a 1 ‘b 2) is not a clear way to represent that ‘a maps to 1 and ‘b to 2.

Having decided the ideal primitive construct can’t be cons, I set out to find it. Since in the language I am building data is code, that also means that the primitive with which I represent language syntax will also be the primitive with which to build data structures. The first structure I considered was the array. Indeed, an array of length 2 is a cons, so surely arrays are at least as good a choice.

I came very close to deciding upon the array as the basic structure, but what held me back was the feeling that the language should be able to support named parameters. Yes, that’s all. I hacked together a clunky implementation where the basic syntactical unit had both array- and map-like properties before I finally put things together after reading Steve Yegge’s blog post called The Universal Design Pattern. I don’t necessarily think he was meaning to convey the idea I took away from it. Steve intended to convey that (property) maps are a useful data structure with which to build programs; what I ended up realizing is that they are actually the ideal fundamental data structure for programming language syntax, and are the most fundamental structure in computer science.

To make this connection it helps to think of data structures in an abstract mathematical sense. Consider arrays and linked lists. What are the most basic properties of these structures?

I would argue that fundamentally, these data structures are associations, specifically functions. They associate position (index) to an element value. A linked list with a section of the Fibonacci sequence {5, 8, 13, 21, 34} is a function mapping the indexes {1, 2, 3, 4, 5} to the values, so:

Index

Value

1

5

2

8

3

13

4

21

5

34

An array with the same contents represents the same association. Looking at the situation with the greatest degree of abstraction, arrays and linked lists are two implementations of a function; being a function is their first property. Only when we move down a rung on the abstraction ladder does the performance of the two structures come into play, such as the lookup or insertion time.

Lisp-style functional programming has taken a rather different approach on the situation, looking at the cons cell as a general recursive type, which it is. But without the language having a generalized concept of collections, it’s hard to get there.

So, Ix is Lisp-like but starts with the premise that the basic unit of syntax is the map. The key is taking the realizations from earlier, and noting that elements in a map don’t have to be “named” in any sense. The key might just be an integer which is the element’s position.

Ix’s syntax starts out like Lisp. (f a b) represents calling the function f with arguments a and b. In many contexts, like at the beginning of a statement, a function call is implied, and so the parentheses may be omitted. Assuming I can properly avoid ambiguity, C-style function call notation may be used: f(a b).

For clarity, the symbols “;” or “,” can be used to delimit arguments, so our recurring example could be written (among other ways):

· f(a,b)

· f(a;b)

· (f a; b)

Function call syntax is a map. A named argument can be specified with the maps-to operator, which is the symbol “:”. Example: open-socket(host : “example.com”; port : 22).

Aside from the function call syntax “( )”, Ix has two other primitive notations: (explicit) maps, and blocks.

Collections

Maps, primarily known as collections, are created by placing an expression inside “[ ]”. Elements in the maps may be identified by name (explicitly) or by position (implicitly). For example, a map associating the symbol A with 1 and B with 2 can be written explicitly with the aforementioned maps-to operator:

[ a : 1; b : 2]

Elements which are not explicitly assigned a name are implicitly assigned a position. For example, the portion of the Fibonacci sequence mentioned earlier can be written:

· [ 5 8 13 21 34 ]

· [ 5, 8, 13, 21, 34 ]

· [ 5; 8; 13; 21; 34 ]

Ix presents this variety of syntax meaning the same thing to allow the expression of connotations – generally, for the programmer to use the syntax that “looks right”.

The syntax of function calls and the syntax of collections (maps) is thus the same; the only difference is that function calls indicate function-call evaluation, while the collection syntax indicates the creation of a collection.

Blocks

The remaining type of “object group” notation is the block, which uses “{ }”. Block syntax indicates the creation of an anonymous function; the contents of a block is a collection of expressions. For example, (+ a b) represents the sum of a and b (evaluated presently), while {+ a b} represents an anonymous function taking parameters a, b and returning their sum. When a block is evaluated, its constituent expressions are evaluated sequentially.

By default, parameters are not positional, so one would have to call such a function with named parameters. For example:

({+ a b} a : 1, b : 2)

The final primitive piece of Ix notation is the function binding operator, written “=>”. The block notation creates an anonymous function, and any contained variables are free variables. In order to evaluate the function, the variables must be assigned values. This can be done by calling the function with named arguments, or can be done by using the binding operator, which assigns function arguments positions. For example:

[a b] => {+ a b}

The result of the previous example expression is a function with two positional arguments.

Macros

So far, we’ve seen how to call functions, how to create collections of values, and how to create anonymous functions. We haven’t yet seen how to create variables or anything else.

A design goal of Ix is for the syntax to be unified, following the Lisp “code is data” mantra. In Ix, code is indeed data, and the creation of variables and other standard constructs is through special function calls.

An Ix macro is a function which takes as its arguments program fragments, called expressions. It may choose to evaluate these expressions, or may not, or may transform them to some other meaning. One important Ix macro is def, which creates named variables. The following example creates a variable n with value 3:

def n 3

How does it make sense for “n” to be an argument to a function call if the variable doesn’t exist? The answer lies in the concept of the macro: the macro isn’t a function of values, but rather program expressions. Thus, the def macro actually takes in the expression “n” in an abstract sense, and by inspecting it creates a variable binding.

Another macro is if, which takes in a value and two blocks, and acts as one would expect. Assuming some basic comparison operators, we now have enough tools to implement the Fibonacci function:

def fib [n] => {

if (> n 2) {

+ fib(- n 1) fib(- n 2);

} {

n;

}

}

What we have now is a syntax very similar to a high-level language, one that is much friendlier to humans, yet every component is formed from the simple primitives explained herein.

I haven’t yet fully designed the syntax and semantics for higher constructs like classes and interfaces, but I do have some ideas how they’d work. First, consider an additional operator :: which acts like Haskell’s “$”, which right-associates the remaining expression. In other words, assume the following expressions are congruent:

· g (f a)

· (g (f a))

· g :: f a


I'm hesitating t to post this publicly, just because I know the language is likely to undergo rapid evolution as I apply increasing scrutiny to the syntax, and begin implementing samples in the language. The precise syntax needn't be decided with finality this early on, partially because I'm designing the interpreter in a very abstract manner (working only on "expression tree" objects) so as to function as an API exposing the syntax for other consumers, but also because one of my goals is to allow the language to be "skinned" in the sense that translation to/from various syntaxes should be pluggable.

Thursday, October 30, 2008

You can't insure what has already happened

I like Obama. He seems like a moral, intelligent guy who thinks a lot about the decisions he has to make. I would expect his policies to be well thought-out, reasoned from first principles.

That's why I was surprised to read the details of his proposed health plan. They make no sense. I'm not speaking of my opinion -- I'm speaking of logical, objective sense.

Obama's page on healthcare states the following:

Require insurance companies to cover pre-existing conditions so all Americans regardless of their health status or history can get comprehensive benefits at fair and stable premiums.


Before I respond to this proposal specifically, let's review what "insurance" is. Bad things happen in the world: houses burn down. A lot of people invest a significant portion of their savings into their houses, and they'd be in a very bad place if the house were to burn down.

Disasters are unpredictable, and we don't know whether any particular house might burn down. We can estimate general risk, based on general factors, such as the temperature and the surroundings (dry brush nearby?). Consider the two possible outcomes after purchasing insurance:

  • Your house burns down, in which case you come out even.

  • Nothing happens, in which case you've suffered a loss of the premium.


The idea behind insurance is that we're betting something bad will happen. We're placing a bet such that we come out ahead if the bad thing happens. We're probably risk-averse, meaning we'd prefer to pay a few dollars a month to remove the (financial) risk, even if we don't expect it to happen. Or perhaps we're betting that the odds of our house burning down are large enough to warrant the cost of the insurance premium.

The insurance company charges a little bit more than the statistical cost of someone's house burning down, and so takes in a profit. The customers get piece of mind, knowing that there is no chance of losing their savings. This works out fine -- but it's entirely based on the fact that we don't know whether the house will burn down. That's the concept of risk.

If your house is currently on fire, will you be able to get any insurance company to sell you an insurance policy? Of course not. Since the house is already on fire, there is nothing to "insure" against. It's going to burn down, and someone will have to suffer that loss.

"Insuring" against events which have already happened is not only bad business, it simply makes no sense. When the house has already burned down, it makes no sense to after the fact ask an insurance company to sell you a policy. Hopefully now the direction of this essay is obvious.

Obama's healthcare plan is senseless because it requires healthcare insurance companies to give up their "pre-existing condition" clauses. But without those clauses it's not insurance!

The reason you can pay a premium of (say) $200 per month to have your healthcare costs covered in case you get sick and have a $10,000 hospital bill is precisely because it's not known whether you will get sick and incur the cost. If it's known in advance that you will get sick for sure, there is no basis with which to make a bet. The only thing to "insure" is other, unforeseen illnesses.

Forcing healthcare companies to remove pre-existing condition clauses is tantamount to forcing housing insurance companies to sell insurance to people whose houses are currently on fire. Some people currently can't get healthcare insurance because they're already to sick and there is nothing to insure. Requiring healthcare insurance companies to cover them only passes their financial burden on to everyone else, in the form of their premiums going up.

Insurance is about things that haven't happened yet. Anyone who disagrees is welcome to sell economic insurance against George W. Bush winning the 2004 Presidential elections, or against the housing crisis.


I strongly support healthcare reform in America. I do believe the system is bad, and I haven't made up my mind, but I probably support government-provided healthcare rather than insurance. I have no problem with that; I believe that people who understand such things well should go to work and build something better. We can discuss such a system and reason about it, again from our morals and first principles. I support working towards that goal, knowing from the beginning that we're operating outside the free market, because market efficiency is not our goal.

But I can't get behind government interference with the market in a way that is senseless and in staunch opposition to the principles the underlie that market. If you're creating a market, don't interfere with it. An insurance market is for insurance; there's nothing to insure when you already know the outcome.

Saturday, October 11, 2008

Refactoring Expression Trees

In the first version of my Ix interpreter library I had two expression tree types: AST and AST (Abstract Syntax Tree and Abstract Semantic Graph, respectively). The idea was that syntax would be turned into an AST just after parsing, then by successive transformations become an Abstract Semantic Graph, where nodes in the graphs link to each other in appropriate ways (and so isn't really a tree at all).

The problem with this approach was that there are many transformation processes: the one that parses syntax to create AST, some transformations from AST to AST, and more from ASG to ASG. Where should which transformations take place? The idea, initially, was for all variable resolution to have taken place by the time the graph was an ASG, but this takes place in stages, and there is no easy way to keep track of "an AST with this kind of partial resolution".

Instead, an ideal language would allow the creation of custom types for each state in the process. One might imagine the series of transformations being expressed through a linear type system.

One could approximate a linear type system simply by giving stages names, for example a freshly parsed tree is of type ASG while the variable-resolution result is of type resolved ASG or something similar. After each transformation process we can "annotate" the value type to explain what's been done to it. It would only be meaningful to the user, but type compatibility prevent one from trying to execute, say, a tree which has not performed variable resolution or type checking. Immediately before the execution of the interpreter, we'd check for the existence of resolution, type checking, and other annotations.

I aim to have Ix support such features. We'll see how it works out.

Labels:

Friday, October 3, 2008

XML as a front-end syntax

(This is the first in what I hope to be a series of many more blog posts written about the programming language I am developing called Ix (pronounced like Styx).)

I'm sure that the title of this post has already inspired horror in the readers who have understood its true implication.

In the course of working on the Ix internals, the VM and evaluation logic, I've had a need to test various properties of the system. In test fashion, I often need to mock or stub out pieces I have not yet implemented. Since I haven't yet come close to nailing down the syntax, I've needed a means to specify tree-structured data, to specify the syntax tree to internal components. Viola! XML seemed like the right fit.

Indeed, XML seemed the right fit as a choice of Intermediate Representation; the idea being to parse the front-end Ix language, convert it into an Abstract Syntax Tree, then process it from there. If one wanted to serialize any of these steps to disk, a language like XML seems the natural fit.

Having been toying with the concept of using XML for the IR and specifying test programs in IR directly, it dawned on me to consider using XML as a front-end language. Surely this has been tried? Many people are familiar with Ant -- would they want to write a whole program this way?

On first impression there are many downsides from writing code with XML. Things can rarely be expressed concisely. However, I think there may be some advantages too:

Improved Block Clarity

When in the middle of a large chunk of code, the reader may be better able to understand it due to the closing syntax precisely explaining which blocks are closing. For example:


<fn name="something">
<if ...>
...

...
</if>
</fn>


When the developer comes to the end of the block, there is no confusion which block is closed. While the close of a straightforward function is easy to understand, I believe there will be many cases, the equivalent of Java code like the following, where readability is actually improved in this manner. Here is a snippet of Java code from the test Ix evaluator:

          public RuntimeException apply(Val val) {
return new InvalidValueTypeException(context, expected, val);
}
});
}
public FnVal visit(FnVal fn) {
return fn;
}
}


That's a lot of closing blocks! Can you tell what's what? Well no, because I omitted the top portion -- but even looking at the full code, you may be scrolled down and cannot see the top section either. I posit that some code like the following may be more readable:

    (You get the idea...)

</fn>
<fn name="visit"><public/><in name="fn" type="FnVal"/><out type="FnVal"/>
<return>fn</return>
</fn>
</class>


This property has been touted as an advantage of XML before; whether it actually improves programming readability is up for grabs, but seems in XML's favor. With a decent editor, the engineer won't actually end up typing more than "</", and so won't require more typing than typical languages' syntax.

Domain-Specific Languages are More Obvious and Better Supported

Here I am using an unusual definition of the term "Domain-Specific Language", which is traditionally a specialized foreign language, often as embedded within a general language. SQL is an example of a domain-specific language for database queries.

When I am working with a language, I often think of it as comprised of many smaller languages. Consider C++, for example. One has the "language of types", where T[N] means "array of type T, lengh N", such as int[10]. This is an entirely different language (in my thinking) than the "language of expressions", wherein T[N] means "access the Nth element of array T".

One of the interesting and novel aspects of C was the parallel syntax of arrays and types, seen in arrays and pointers and many other places. I am uncertain whether this kind of "analogy" is beneficial or detrimental, though I probably lean towards the latter, simply for the reason that a single lexical form having multiple meanings can be confusing because context is required to differentiate them (as is the case for type inference and other forms of conciseness-improving features which I believe harm readability -- but that's for another post).

In any case, I think of programming languages as compositions of these "domain-specific languages". Modern languages have many, including:

  1. The language of types ( T* )

  2. The language of expressions ( a + b )

  3. The language of hierarchy ( a.b.c ), often overlapping language of expressions

  4. The language of classes and structure; classes contain methods; methods contain the language of expressions



Thinking of languages in this way helps me to conceptually organize the syntax and semantics -- what kind of operators can be invoked in various circumstances. It also provides an obvious path to a modular language implementation.

Getting back to the point, as a language designer and a language user, I find it helpful for the circumstances to be very specific when we "context switch" into a domain-specific languages.

Some expressions absolutely require custom language support. For example, given a generic function-call syntax like Lisp S-expressions, one would not prefer to write ((. abc def) ghi) instead of abc.def.ghi; it's less clear. The author of Clojure, a Lisp-like language for the JVM, added a special syntax to parallel Java's dot operator because sequential, hierarchical access is unpalatable with S-expressions. In Clojure, (. target member ...) can be written (.member target ...), making the form more Lisp-like, and the language has primitive straightforward support for a Java-like dot operator too: new java.util.HashMap translates to Clojure (new java.util.HashMap).

With an XML syntax, such specific languages manifest as actual XML-embedded domain specific languages. While we could perform a literal translation of the S-expression form ((. abc def) ghi) into XML, the result would be poorly readable. Can we do better, but with XML? Perhaps something like this: <dot><o>abc</o><o>def</o><o>ghi</o></dot>. Ouch :-( No one is going to want to use a language like that. We fundamentally need a form more concise than XML can provide.

Instead, we can embed our language within XML, something like this: <dot>abc.def.ghi</dot>. It's "embedded" in the sense that the document has structure which is not represented in the XML -- generally an XML no-no as I understand it. However, for a programming language, our goal is to provide a workable syntax, and the embedded form is almost in every way preferable. (The only manner in which it's not preferable is that the expression requires a custom, non-XML parser.)

This is effectively a disadvantage of XML, in that one needs to work outside the XML framework to accomplish something, but that's only a problem if one sets out to do everything in XML, rather than use it for its strengths. The advantage of this kind of embedding is that it makes the embedding very obvious to the user; the engineer can be come directly familiar with the various sorts of embedded languages as he sees the various XML nodes invoking them. He understands which cases invoke which kind of syntax and there is little room for a lack of clarity in this matter.

An XML programming language syntax also naturally supports user extensions for domain-specific languages, since all it needs to do is permit the embedding of arbitrary XML trees in the language structure -- the syntax of the language is in this manner naturally extensible. One could imagine an implementation that allows language-engine-plugins to work on the abstract document, transforming custom DSL nodes into standard language nodes until no DSL nodes are left; and if any are left, there is a syntax error.

An XML-based language could support this much more easily, and in a more first-class manner, than others.

Interoperability

This property is simple. XML is a well-known, very well-supported standard. It would be easy for proprietary programs to read and write the language if its syntax is XML, since so much software already speaks XML.

The simple, parsable syntax would facilitate additional machine processing of the syntax itself -- for things like refactoring or bug finding. Handling the syntax is no longer the arduous task of writing a parser.

Support for Literate Programming

I imagine an XML language would provide excellent support for Literate Programming in Knuth fashion:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: "Literate Programming."

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.


In short, Literate software contains extensive documentation and other elements meant mainly for human, not computer, consumption. XML would obviously provide incomparable support for this, as HTML and other XML-based documents could be embedded directly in the programs.




Overall, the use of XML syntax presents both opportunities and inherent disadvantages. It's difficult to conceive of a pure XML language, which represents all language structure as XML document structure. I wouldn't want to use that. However, there are some apparent significant advantages that are making me carefully look at the option of providing a front-end syntax for Ix which is primarily XML.

Regardless of whether XML ends up being a permanent front-end for the language, I think taking a critical look at an XML representation of the language tree will give me more familiarity with the deep structure and relationship between language elements, which should improve the ultimate syntax, whatever it may be.

Labels: