Advanced Syntax

Precedence Parsing
Non-associative operators
"Advanced" Syntax

Precedence Parsing

We've all heard about "order of operations," which tells us how to treat syntax like a * b + c * d ^ e ^ f / g as if it were (a * b) + ((c * (d ^ (e ^ f))) / g). For the sake of example we're using ^ for exponentiation, because that's right-associative.

We can specify precedence levels for binary operators by giving them numbers and telling whether they're left or right associative.

+, - : 10, left
*, / : 20, left
^    : 30, right

The operators with higher precedence get their neighbors attached to them first, and ties are resolved based on left/right associativity. But here's what I do. Instead of saying "left" and "right," specify "left-precedence" and "right-precedence" values, or "grabbing-strengths," that say how hard an operator grabs at its neighbor.

+, - : 104, 106
*, / : 204, 206
^    : 306, 304

We can see the grabbing strengths in action:

a 204*206 b 104+106 c 204*206 d 306^304 e 306^304 f 204/206 g

Based on the grabbing strengths,

b gets pulled left because 206 > 104
c gets pulled right because 106 < 204
d gets pulled right because 206 < 306
e gets pulled right because 304 < 306
f gets pulled left because 304 > 204

(Hold on. You might be bored. Skip to non-associative operators if you want to skip the hum-drum rehash of precedence climbing.)

What happens next? Operators that successfully pull both of their neighbors inward will "win." We parenthesize those sub-expressions first:

(a * b) 104+106 c 204*206 d 306^304 (e ^ f) 204/206 g

And then we keep parenthesizing winners until we're done:

(a * b) 104+106 c 204*206 (d ^ (e ^ f)) 204/206 g

(a * b) 104+106 (c * (d ^ (e ^ f))) 204/206 g

(a * b) 104+106 ((c * (d ^ (e ^ f))) / g)

((a * b) + ((c * (d ^ (e ^ f))) / g)

Let's say we're writing a parser by hand – in other words, we consume and peek at input off an input stream. We define a function to parse expressions, from left-to-right, like this:

enum Direction { Left, Right, Indeterminate };

Expr parse_expr(InputStream *p) {
    return parse_expr_after_prec(p, 0);
}

// We're walking through a text file from left to right. prec is
// the right precedence of whatever operator we've already parsed,
// grabbing at the stuff we're about to parse, from the left side.
// This function consumes atoms and binary operators until
// it gets to a weaker operator.
Expr parse_expr_after_prec(InputStream *p, int prec) {
    // First consume an "atom" -- something that isn't a binary operator.
    Expr lhs = parse_atom(p);

  continue_after_lhs:
    // Look for the operator but don't consume it.
    char op;
    if (!try_peek_operator(p, &op)) {
        // No operator, we're at the end of the expression.
        return lhs;
    }
    int op_left_prec = left_precedence(op);
    Direction dir = grab_direction(prec, op_left_prec);
    switch (dir) {
    case Left:
        // The next operator grabs more weakly, so we abandon!
        return lhs;
    case Right: {
        // (Consume the operator we peeked at.)
        char op2 = parse_operator(p);
        assert(op2 == op);

        // op grabs more strongly.  So, first, we consume its
        // right-hand expression, using its right precedence.
        Expr rhs = parse_expr_after_prec(p, right_precedence(op));

        // op grabbed more strongly than whatever's to the left of lhs,
        // and it grabbed more strongly than whatever's to the right of
        // rhs.    So we'd parenthesize (lhs op rhs), so-to-speak, and make
        // that our new lhs.
        lhs = build_binop_expr(lhs, op, rhs);
        goto continue_after_lhs;
    } break;
    case Indeterminate:
        // There's no ordering relation between prec and op_left_prec.
        // Parse fails!
        throw parse_error("sorry!");
    }
}

Direction grab_direction(int lhs, int rhs) {
    if (lhs > rhs) {
        return Left;
    } else if (lhs < rhs) {
        return Right;
    } else {
        return Indeterminate;
    }
}

int left_precedence(char op) {
    switch (op) {
    case '+': return 104;
    case '-': return 104;
    case '*': return 204;
    case '/': return 204;
    case '^': return 306;
    default:
        throw logic_error("bad operator!");
    }
}

int right_precedence(char op) {
    switch (op) {
    case '+': return 106;
    case '-': return 106;
    case '*': return 206;
    case '/': return 206;
    case '^': return 304;
    default:
        throw logic_error("bad operator!");
    }
}

The loop in parse_expr_after_prec is a standard technique for parsing operators. You can modify it to handle ternary operators, function call syntax (if it's not highest-precedence(??)) and generally, whatever you need. Unary operators should be given maximum precedence and be handled by parse_atom.

Non-associative operators

We use grab_direction to compare grabbing strengths. This gives us the flexibility to define a partial ordering. Suppose we change its definition:

Direction grab_direction(int x, int y) {
    // Treats e.g. 105 as being incomparable to 104 and 106.
    if (lhs > rhs + 1) {
        return Left;
    } else if (lhs + 1 < rhs) {
        return Right;
    } else {
        return Indeterminate;
    }
}

Suppose we add an operator "%" (the modulo operator) with left-precedence 205 and right-precedence 205. Then expressions like

a 204/206 b 205%205 c

are invalid. The reason is, ask yourself: which grabs b harder? It's a stalemate. Thus you're not allowed to mix modulo up with multiplication and division – you're forced to use parentheses.

In fact, I recommend the following choices, if you're in charge of the syntax:

+, - : 104, 106
*    : 204, 206
/    : 204, 205
%    : 205, 205
^    : 306, 304

This prohibits expressions like a / b / c and a / b * c.

"Advanced" Syntax

We're currently using numbers to represent grabbing strengths, and requiring the number to differ by at least 2 in order for an operator to "win." This defines a partial ordering. We could make a fancier partial ordering. Suppose we add the bitwise operators "&" and "|", and the equality operator, "==", to the language. We want to:

prohibit mixing bitwise operators with any others, as in a & b | c and a & b + c
permit associativity with the same operator, like a & b & c
permit comparisons like a + b == c | d

We could define precedence levels this way:

+, - : A-104, A-106
*    : A-204, A-206
/    : A-204, A-205
%    : A-205, A-205
^    : A-306, A-304  // exponentiation, not xor

&    : B-1, B-1
|    : C-1, C-1

==   : D-0, D-0

The partial ordering is that the A's worked as before, and A > D, B > D, and C > D. In code, like this:

struct PrecLevel {
    char letter;  // A, B, C, D
    int number;
};

Direction grab_direction(PrecLevel lhs, PrecLevel rhs) {
    switch (lhs.letter) {
    case 'A':
        switch (rhs.letter) {
        case 'A':
            if (lhs.number > rhs.number + 1) {
                return Left;
            } else if (lhs.number + 1 < rhs.number) {
                return Right;
            } else {
                return Indeterminate;
            }
        case 'B': return Indeterminate;
        case 'C': return Indeterminate;
        case 'D': return Left;
        }
    case 'B':
        switch (rhs.letter) {
        case 'A': return Indeterminate;
        case 'B': return Left;
        case 'C': return Indeterminate;
        case 'D': return Left;
        }
    case 'C':
        switch (rhs.letter) {
        case 'A': return Indeterminate;
        case 'B': return Indeterminate;
        case 'C': return Left;
        case 'D': return Left;
        }
    case 'D':
        switch (rhs.letter) {
        case 'A': return Right;
        case 'B': return Right;
        case 'C': return Right;
        case 'D': return Indeterminate;
        }
    }
}

And that's how you can make a language with better syntax.

(Do you want to know why the levels are 104, 106, 204, ... and not 14, 16, 24, ...? Ask a BASIC programmer.)

- Sam

(posted June 28 '16)

Update: Table-Based Precedence

A simpler way to formulate operator precedence is with a table of size n². For operators x and y, the table specifies whether the expression a x b y c pulls left or right. In other words, the table displays the value of grab_direction(prec_level(x), prec_level(y)).

	+	-	*	/	%	^	&	\|	==
+	L	L	R	R	R	R			L
-	L	L	R	R	R	R			L
*	L	L	L	L		R			L
/	L	L				R			L
%	L	L				R			L
^	L	L	L	L	L	R			L
\|								L	L
&							L		L
==	R	R	R	R	R	R	R	R

(updated May 27 '20)

	+	-	*	/	%	^	&	\|	==
+	L	L	R	R	R	R			L
-	L	L	R	R	R	R			L
*	L	L	L	L		R			L
/	L	L				R			L
%	L	L				R			L
^	L	L	L	L	L	R			L
\|								L	L
&							L		L
==	R	R	R	R	R	R	R	R

	+	-	*	/	%	^	&	\|	==
+	L	L	R	R	R	R			L
-	L	L	R	R	R	R			L
*	L	L	L	L		R			L
/	L	L				R			L
%	L	L				R			L
^	L	L	L	L	L	R			L
\|								L	L
&							L		L
==	R	R	R	R	R	R	R	R

	+	-	*	/	%	^	&	\|	==
+	L	L	R	R	R	R			L
-	L	L	R	R	R	R			L
*	L	L	L	L		R			L
/	L	L				R			L
%	L	L				R			L
^	L	L	L	L	L	R			L
\|								L	L
&							L		L
==	R	R	R	R	R	R	R	R