Deep AST matching guide

So you want to take advantage of the javascript AST, and move beyond elemental use cases, (e.g. finding string concatenation in order to convert to template literals).

The technique well go over is match more than a single node, while keeping track of where we are within the AST as we traverse it. As a simple example, let's say you want to match function-returns where the return is an object with a key called `coolKey`, but only if the function name also starts with `cool`. We're going to need a rough understanding of the structure of the AST, and how AST traversing libraries move through the tree structure.

Firstly the structure of the AST. For any given thing you want to match on, you can simply paste your code into astexplorer.net and inspect it, though something to keep in mind is that it's generally safe to assume that the AST will follow the same rough structure as your code indentation. We use indentation as a way to visually represent hierarchy of the code, and the AST has a similar hierarchy (though with a great deal many more levels). They are similar enough to know that the return statement of a function is going to be nested further in the structure than the function definition.

Now AST libraries. These generally take callbacks for entering and existing nodes, though in what order? How do they pick the next node to enter? Since most folks will be familiar XML style tree structure, let's use that as an example.

```
<body>
    <node1>
        <node2>
            <node3>
                <node4>
                </node4>
            </node3>
        </node2>
        <node5>
        </node5>
    </node1>
    <node6>
    </node6>
    <node7>
    </node7>
</body>
```

It will call enter and leave nodes in the same order as if you were reading the above line-by-line. Or how I like to think of it is, once it enters a node, it will fully explore everything nested in it before it leaves that node. To be 100% explicit the order would be.

enter-node1, enter-node2, enter-node3,
enter-node4, leave-node4, leave-node3,
leave-node2, enter-node5, leave-node5,
leave-node1, enter-node6, leave-node6,
enter-node7, leave-node7

Back to our first example, putting the simplest version of what we're trying to match into AST explorer.

function coolFunction() {
    return {
      coolKey: 'should match',
    }
}

The AST for this (with properties remove for brevity) would look like:

{
    "type": "FunctionDeclaration",
    "id": {
        "type": "Identifier",
        "name": "coolFunction"
    },
    "body": {
        "type": "BlockStatement",
        "body": [{
            "type": "ReturnStatement",
            "argument": {
                "type": "ObjectExpression",
                "properties": [{
                    "type": "Property",
                    "key": {
                        "type": "Identifier",
                        "name": "coolKey"
                    },
                    "value": {
                        "type": "Literal",
                        "value": "should match",
                    },
                }]
            }
        }]
    }
}

This example is simple enough that we could match the entire thing from the function declaration node. However, that wouldn't be very robust as the return statement could be nested in an `if` statement, possibly multiple times, changing the structure of the AST significantly. It's better to match the function declaration node and the return node separately. Doing so will also be useful later if we want to expand code to cover more cases, say class and arrow function declarations.

Code for matching the function declaration and the return statement:

const isCoolFunction = node => 
    node.type === 'FunctionDeclaration' &&
    node.id.type === 'Identifier' &&
    node.id.name.startsWith('cool')

const isCoolReturn = node => 
    node.type === 'ReturnStatement' &&
    node.argument.type === 'ObjectExpression' &&
    node.argument.properties.some(property => 
        property.key &&
        property.key.type === 'Identifier' &&
        property.key.name === 'coolKey'
    )

Putting it all together, we can now match both the function name and the return statement. We know the return node is nested within the function declaration node. Therefore we'll only match the return statement if we're already within a `coolFunction` node by setting a flag `isMatchCoolFunction`.

let isMatchCoolFunction = false;
const onEnter = node => {
    if (isCoolFunction(node)) {
        isMatchCoolFunction = true;
    }
    if (isMatchCoolFunction && isCoolReturn(node)) {
        console.log('found it 🤘, so cool!');
    }
};

const onLeave = node => {
    if(isCoolFunction(node)) {
        isMatchCoolFunction = false;
    }
};

Please note that we're matching the `coolFunction` node again when leaving the function node so that we can unset set flag. Without this step, we'd start matching cool return statements with un-coolFunctions ... Not cool at all 💅.

Some more sincere examples I know of are from using a framework like Angular where I aimed to get the component name and template path together (`coolDirective` and `template/path.html` respectively).

angularModule.directive('coolDirective', () => {
    // component logic
    return {
        template: require('template/path.html'),
    };
});

The process would be very similar to what we just went through, use a flag to for when you've entered a component declaration, then match the template string. What if there are multiple components chained off one another.

angularModule
    .directive('coolDirective1', () => {/* ... */})
    .directive('coolDirective2', () => {/* ... */})
    .directive('coolDirective3', () => {/* ... */})

The chaining complicates things as the AST structure is a little tricky, lets look at another simplified-XML style version (with emojis for visually clarity of the hierarchy):

<body>
    <🏂CallExpression-directive3>
        <🏂MemberExpression-directive3>
            <🚒CallExpression-directive2>
                <🚒MemberExpression-directive2>
                    <🧚CallExpression-directive1>
                        <🧚MemberExpression-directive1>
                        <🧚arguments-for-directive1 />
                    <🧚/CallExpression-directive1>
                <🚒/MemberExpression-directive2>
                <🚒arguments-for-directive2 />
            <🚒/CallExpression-directive2>
        <🏂/MemberExpression-directive3>
        <🏂arguments-for-directive3/>
    <🏂/CallExpression-directive3>
</body>

First thing we might notice is the reversed order as it starts with number 3; however, that's of little concern. Note that the `CallExpression` is where we'd match the component declaration, but it's within the `arguments` that we'd match the return statement and where the template path is defined. The challenge is that the `MemberExpression` is before it's sibling node "arguments", and the `MemberExpression` is what contains the next `CallExpression`. The result is that we'll hit all of the `CallExpression`s first before we hit any of the `arguments`, and what's more when we do get the `arguments`, we'll get them in the opposite order. Again 100% explicit the order we'll enter these nodes is (skipping the `MemberExpression`s:

CallExpression-directive3, CallExpression-directive2,
CallExpression-directive1, arguments-for-directive1,
arguments-for-directive2, arguments-for-directive3, 

Once we know this, solving this problem becomes simple, we can keep track of what CallExpressions we've entered with a `stack` approach, and pop and push appropriately.

const isCoolChainedDirectiveDeclaration = node => {
  /*Here well match the CallExpressions that are "directives"*/
};
const getDirectiveName = node => {
  /* returns the component/directive name i.e. "coolDirective1" */
};
const isDirectiveReturnStatement = node => {
  /*Here well match the the return statements containing the template path*/
};
const getTemplatePath = node => {
  /* extracts the template path from the return node */
};

const coolDirectiveStack = [];

const onEnter = node => {
  if (isCoolChainedDirectiveDeclaration(node)) {
    const directiveName = getDirectiveName(node);
    coolDirectiveStack.unshift(directiveName);
  }
  if (
    coolDirectiveStack.length 
    && isDirectiveReturnStatement(node)
  ) {
    const templatePath = getTemplatePath(node);
    const directiveName = coolDirectiveStack[0];
    console.log(
      `Yew!🤘The directive ${directiveName}'s 
      template is at: ${templatePath}`
    );
  }
};

const onLeave = node => {
  if (isCoolChainedDirectiveDeclaration(node)) {
    coolDirectiveStack.shift();
  }
};

The approach is very similar to how we matched our coolFuntion/Return, except we're using a stack instead of a flag to keep track of how nested we are in the tree. Note as a personal preference I use .unshift and .shift, Rather than .push and .pop as I find `stack[0]` to be an elegant method for accessing the last thing on the stack, rather than `stack[stack.length - 1]`.

None of these examples will likely be of direct use to you, though the techniques of matching to multiple nodes within the broader pattern you want to match, while keeping track of where you are with the AST should be applicable for you matching advanced and deeply nested patterns within your code.

Did you know you can get Electronic-Mail sent to your inter-web inbox, for FREE !?!?
You should stay up-to-date with my work by signing up below.