Monday, August 10, 2015

Creating your own DSL - Parsing

  1. Why use a DSL?
  2. Why create your own DSL?
  3. What makes a good DSL?
  4. Creating your own DSL - Parsing
  5. Creating your own DSL - Parsing (with Ruby)
  6. Creating the Packager DSL - Initial steps
  7. Creating the Packager DSL - First feature
So far, we've talked about the whys and wherefores of DSLs. If you've made it this far, you probably agree that DSLs are a good idea. You have probably identified a spot in your processes where a DSL would make life so much easier. So, let's get started.

At its heart, creating a programming language (or DSL) is dealing with these three activities:
  1. Parsing the program file into a data structure
    • If there are errors in syntax, inform the user here.
  2. Validating the data structure
    • If there are errors in semantics, inform the user here.
  3. Executing the requested activities
    • If there are errors in what was attempted, inform the user here.
Before we can design our language, we need to pick what our parsing process will be. The parsing step is what deals with user interface. The parser we choose will strongly shape what kind of language we can create for our users. If we pick a very simplistic parser, that means we are constrained to a possibly unusable DSL. It doesn't matter what kind of wonderful things our DSL can do if no-one is able to work with the language itself. (For example, compare COBOL and BASIC with Java, Python, and Ruby.) On the other hand, if we pick a very complex parser, we may never end up creating the language at all.

There are hundreds of tools for creating programming languages. (Make no mistake - you're doing just that.) The problem with most of them (and the reason most DSLs are never created) is that they're far too complicated (such as writing a parser and lexer to generate an AST). Most developers simply will not be able to reframe their simple DSL in terms of tokens and similar parsing terms.

The good news is that most DSLs do not need the full treatment of a parser+lexer. Most DSLs are better ways of describing nested data structures. So, maybe we should try something like JSON or YAML. And some tools (such as Ansible and Salt) do just that. They use a YAML parser to handle step 1. This definitely solves the problem of parsing - just let someone else do it. :)

YAML and JSON, though, while very easy for the DSL author to work with, is a really poor interface for the DSL user. Data structures become extremely brittle the moment you use them for anything non-trivial. This matters because the user is the group that will be working within the DSL 100x more than the author will be working within the DSL definition. We should optimize (wherever possible) for the most usage.

The first major issue is no variables. Every programming language (and DSLs are no exception) works in problemspaces that want to reuse the same values. When using just YAML, the user ends up having to re-specify the same value over and over. When (not if!) that value changes, the user will invariably forget one place to change the value.

An enterprising DSL author might decide to create a section in their document called "variables" (or somesuch) and allow the user to specify a hashtable of key-value pairs for use elsewhere in the DSL. This would work, but it becomes very cumbersome to work with. Now, the user has to specify "This is a variable lookup" which means the author has to provide special tokens (or sigils) for doing that. Now, the user cannot just "use YAML". It's YAML+. And every DSL author will have their own unique '+'. No-one wants to invest in non-transferrable skills.

The second, larger issue is no control mechanisms. Programming languages provide three key control mechanisms:
  1. Branching (if-then-else, switch/case)
  2. Looping (for loops, while loops)
  3. Abstraction (subroutines)
Without these, data structures balloon in size because the user ends up having to repeat themselves. An example can illustrate best:

server {
      name "www1.place.com"
      ip "192.168.50.1"
      ...
}
server {
      name "www2.place.com"
      ip "192.168.50.2"
      ...
}
server {
      name "www3.place.com"
      ip "192.168.50.3"
      ...
}
(1 .. 3).each do |id|
      server {
            name "www#{id}.place.com"
            ip "192.168.50.#{id}"
            ...
      }
end

Now, imagine the server definition runs to 50 lines, all of them identical except for the two lines in the example. And instead of 3 webservers, your product is in heavy use and has 10 servers. Or 30. Which would you prefer to maintain, as a user? Your users will feel the same way.

The maintainers of Salt and Ansible immediately recognized this problem and provide an interesting solution. Their DSL files aren't actually YAML. They are Jinja2 files that render to YAML. Jinja2 is a templating language that provides variables and the control structures missing from YAML. Obviously, some products (Salt and Ansible) feel this is a workable solution. Their users must feel the same way, or they wouldn't use the product.

I don't agree. This forces users to learn two languages - YAML and Jinja2. That's twice the barrier to entry and twice the opportunity for the user to make an error. DSLs are meant to reduce the barriers to entry and reduce the potential for error. But, we still want someone else to handle all the parsing (because parsing is hard and error-prone). We need an easy way to set key-value pairs (because that's most of what we want), but we still want variables and all the valuable control structures. While 90% of all the usage of the type of DSL we're writing could be satisfied by basic YAML, we want the escape hatch of a full-on programming language for when we need it. There has to be a better tool

That better tool is Ruby.
prev