SH:Blog:

Recursive Descent Parsing Without Lexing

You don't have to make a lexer first, before parsing, in writing your recursive descent parser. The main annoyance of parsing without lexing is that you have to put these "skip_whitespace"† function calls everywhere and keep track of whether you've skipped whitespace already and can expect a non-whitespace character beginning a token.‡

Let's say you've got a function that parses an identifier. The question you'll ask yourself is, should parse_identifier skip whitespace first? Or should it expect whitespace to be skipped for it? (What should the general rule be, here?)

The best choice is to have all your functions expect whitespace to be pre-skipped for them.

This makes it very likely that your whitespace-related bugs (where you forgot to skip whitespace) are clearly visible and revealed by testing. It means that your code is going to be crystal clear about where whitespace is supposed to have been skipped, because whenever functions are called, that's a point where we expect it to have been skipped.

It's also a good idea, but not as useful, to avoid having functions that sometimes skip whitespace before returning, but not all the time. To fix that, put a skip_whitespace at the end of those control flow paths which otherwise don't skip whitespace. That way, you're less likely to have a place where you wrongfully assumed you didn't need to skip whitespace.

(We don't want to gratuitously skip whitespace, ever, because then anybody reading the code will have no idea when and where it's really necessary, which means any changes will need yet more gratuitous calls.)

- Sam

(posted July 12 '16)


† It also skips comments.
‡ The secondary annoyance is some extra lookahead/rewind logic when dealing with keywords and other token-like things.