GopherCon 2018 - How to Write a Parser in Go
Presenter: Sugu Sougoumarane
Liveblogger: Nick Snyder
Sugu Sougoumarane is the co-creator of Vitess, which contains a SQL parser that is now used by many other projects. In this talk he demonstrates how to write a parser using goyacc.
Summary
- goyacc is almost an exact translation of the original yacc, except it is written in Go.
- Using a parser generator like goyacc is a quick way to get a working parser for a LALR(1) grammar.
- Lex is not code that you live in; it is code you write once and then use for a long time. It is ok if the code is not clean.
- Code examples
Parser use cases
- Language
- Data files (most common)
- Network wire format
- Complex state machines
How to write a parser in two easy steps
goyacc
- Translates a formal grammar into code.
- Creates a state machine that it runs through.
- Works with LALR(1) grammars (look head one token and decide what action to take).
- A surprising number of languages can be parsed with LALR(1) parsers.
- Embeds custom code.
Why goyacc
- Readable - the file format reads like natural language
- Extensible - easy to add rules
- Easy testing - yacc is already tested, so only need to test your own parts as input to yacc
- Efficient - because it uses a state machine
- Detect conflicts - it will tell you if you add grammar rules that conflict (this is the best part)
Why not goyacc
- If you need error handling. It can tell you syntax error at position, but can't recover from error or give nicer error messaging.
- If you need extreme efficiency; can't beat hand-coded.
- If you need to parse a complex (non-LALR(1)) grammar.
Using goyacc
General steps:
- Express grammar.
- Grammar rules can return values and assign types.
- Write code to return values.
- Write lexical analyzer.
Goyacc is almost an exact translation of the original yacc so some of the idiosyncrasies have been inherited. For example, C programs return only 1 value: 0 for success and 1 for failure. This means you need awkward boilerplate to give values to the lexer:
Use go generate to create the actual parser.go file:
How to parse a phone number
Area code has three parts: area code, first part, second part.
Captital letters signify tokens.
How to return values
The generated parser is just a single function that runs a state machine and uses local variables.
These variables are saved in a union data structure:
Actions run Go code (i.e. everything inside the braces) when a rule matches. Dollar variables address a variable that is a value returned by the parser.
Lexing
Two things are happening concurrently during lexing:
- Rules are getting matched. The int returned determines which rules match.
- Code is getting executed depending on which rule is matched. The result value is used inside the code you write.
Sometimes lex can return the byte itself as an int. Yacc has builtin predetermined tokens so all first 127 bytes are reserved and can be returned without telling the parser you are returning them
How does it work?
Goyacc
- Generates a const string per symbol
- Defines an interface for the Lexer
- Defines a parse function that accepts the lex as input
Options for lexing
- Rob Pike https://talks.golang.org/2011/lex.slide
- Fatih Arslan https://medium.com/@farslan/a-look-at-go-scanner-packages-11710c2655fc
- Use a ‘lex’ port
- go/scanner
- Hand-code
Lex is not code that you live in. It is code you write once and then use for a long time. Ok if the code is not clean.
Future improvements
For complicated grammars (e.g. SQL), Goyacc can generate a big result structure that is expensive to pass around. Goyacc actually assigns this structure every time there is a state transition.
C (yacc) has structure called union which efficiently packs the datastructre, but there is no equivalent in Go....except interfaces are a very close equivalent!
Unlike C union type, you can type assert an interface in Go. One limitation with using a type asserted Go interface is that it is an rvalue which means you can't assign to it.
Switching Vitess to use an interface instead of struct doubles performance, but would be a backward incompatible change to goyacc.