C语言编译器开发之旅（二）：解析器

本节是我们这个编译器系列的第二节，进入语法分析与语义分析的部分解。在本节我们会编写一个简单的解析器。

解析器的主要功能分为两个部分：

识别输入的语法元素生成AST（Abstract Syntax Trees）并确保输入符合语法规则
解析AST并计算表达式的值

在开始代码编写之前，请先了解本节最重要的的两个知识点。

抽象语法树（AST）：https://blog.csdn.net/lockhou/article/details/109700312

巴科斯-瑙尔范式（BNF）：https://www.bilibili.com/video/BV1Us411h72K?from=search&seid=2377033397008241337

我们需要识别的元素包含四个基本的数学运算符+，-,*,/与十进制整数共五个语法元素。那么首先让我们为我们的解析器将识别的语言定义一个语法。我么这里采用BNF描述：

expression: number

          | expression '*' expression

          | expression '/' expression

          | expression '+' expression

          | expression '-' expression

          ;

number:  T_INTLIT

         ;

我们都知道BNF定义的语法是递归定义的，那么我们也需要一个递归函数去解析输入的表达式。在我们现有的语法元素可以构成的表达式中第一个语法元素始终为数字，否则就是语法错误。其后可能是一个运算符，或者只有一个数字。那么我们可以用如下伪代码表示我们的递归下降解析函数：

function expression() {

  Scan and check the first token is a number. Error if it's not

  Get the next token

  If we have reached the end of the input, return, i.e. base case

  Otherwise, call expression()

}

让我们来模拟一次此函数的运行，输入为2 + 3 - 5 T_EOF其中T_EOF 是反映输入结束的标记。

expression0:

  Scan in the 2, it's a number

  Get next token, +, which isn't T_EOF

  Call expression()

    expression1:

      Scan in the 3, it's a number

      Get next token, -, which isn't T_EOF

      Call expression()

        expression2:

          Scan in the 5, it's a number

          Get next token, T_EOF, so return from expression2

      return from expression1

  return from expression0

为了进行语义分析，我们需要代码来解释识别的输入，或者将其转换为另一种格式，例如汇编代码。在旅程的这一部分，我们将为输入构建一个解释器。但要实现这一目标，我们首先要将输入转换为抽象语法树。

抽象语法树的节点结构定义如下：

// defs.h

// AST node types

enum {

  A_ADD, A_SUBTRACT, A_MULTIPLY, A_DIVIDE, A_INTLIT

};

// Abstract Syntax Tree structure

struct ASTnode {

  int op;                               // "Operation" to be performed on this tree

  struct ASTnode *left;                 // Left and right child trees

  struct ASTnode *right;

  int intvalue;                         // For A_INTLIT, the integer value

};

节点元素op表示该节点的类型，当op的值为A_ADD、A_SUBTRACT等运算符时，该节点具有左右两颗子树，我们将使用op代表的运算符对左右两棵子树的值做计算；当op的值为A_INTLIT时，代表该节点是整数值，是叶节点，节点元素intvalue存储着该整数的值。

tree.c 中的代码具有构建 AST 的功能。函数mkastnode()生成一个节点并返回指向节点的指针：

// tree.c

// Build and return a generic AST node

struct ASTnode *mkastnode(int op, struct ASTnode *left,

                          struct ASTnode *right, int intvalue) {

  struct ASTnode *n;

  // Malloc a new ASTnode

  n = (struct ASTnode *) malloc(sizeof(struct ASTnode));

  if (n == NULL) {

    fprintf(stderr, "Unable to malloc in mkastnode()\n");

    exit(1);

  }

  // Copy in the field values and return it

  n->op = op;

  n->left = left;

  n->right = right;

  n->intvalue = intvalue;

  return (n);

}

我们对其进一步封装出两个常用的函数，分别用来创建左子树与叶节点：

// Make an AST leaf node

struct ASTnode *mkastleaf(int op, int intvalue) {

  return (mkastnode(op, NULL, NULL, intvalue));

}

// Make a unary AST node: only one child

struct ASTnode *mkastunary(int op, struct ASTnode *left, int intvalue) {

  return (mkastnode(op, left, NULL, intvalue));

我们将使用 AST 来存储我们识别的每个表达式，以便稍后我们可以递归遍历它来计算表达式的最终值。我们确实想处理数学运算符的优先级。这是一个例子。考虑表达式 2 * 3 4 * 5。现在，乘法比加法具有更高的优先级。因此，我们希望将乘法操作数绑定在一起并在进行加法之前执行这些操作。

如果我们生成 AST 树看起来像这样：

          +

         / \

        /   \

       /     \

      *       *

     / \     / \

    2   3   4   5

然后，在遍历树时，我们会先执行 2 * 3，然后是 4 * 5。一旦我们有了这些结果，我们就可以将它们传递给树的根来执行加法。

在开始解析语法树之前，我们需要一个将扫描到的token转换为AST节点操作值的函数，如下：

// expr.c

// Convert a token into an AST operation.

int arithop(int tok) {

  switch (tok) {

    case T_PLUS:

      return (A_ADD);

    case T_MINUS:

      return (A_SUBTRACT);

    case T_STAR:

      return (A_MULTIPLY);

    case T_SLASH:

      return (A_DIVIDE);

    default:

      fprintf(stderr, "unknown token in arithop() on line %d\n", Line);

      exit(1);

  }

}

我们需要一个函数来检查下一个标记是否是整数文字，并构建一个 AST 节点来保存文字值。如下：

// Parse a primary factor and return an

// AST node representing it.

static struct ASTnode *primary(void) {

  struct ASTnode *n;

  // For an INTLIT token, make a leaf AST node for it

  // and scan in the next token. Otherwise, a syntax error

  // for any other token type.

  switch (Token.token) {

    case T_INTLIT:

      n = mkastleaf(A_INTLIT, Token.intvalue);

      scan(&Token);

      return (n);

    default:

      fprintf(stderr, "syntax error on line %d\n", Line);

      exit(1);

  }

}

这里的Token是一个全局变量，保存着扫描到的最新的值。

那么我们现在可以写解析输入表达式生成AST的方法：

// Return an AST tree whose root is a binary operator

struct ASTnode *binexpr(void) {

  struct ASTnode *n, *left, *right;

  int nodetype;

  // Get the integer literal on the left.

  // Fetch the next token at the same time.

  left = primary();

  // If no tokens left, return just the left node

  if (Token.token == T_EOF)

    return (left);

  // Convert the token into a node type

  nodetype = arithop(Token.token);

  // Get the next token in

  scan(&Token);

  // Recursively get the right-hand tree

  right = binexpr();

  // Now build a tree with both sub-trees

  n = mkastnode(nodetype, left, right, 0);

  return (n);

}

这只是一个子简单的解析器，他的解析结果没有实现优先级的调整，解析结果如下：

正确的树状结构应该是这样的：

          +

         / \

        /   \

       /     \

      *       *

     / \     / \

    2   3   4   5

我们将在下一节实现生成一个正确的AST。

那么接下来我们来试着写代码递归的解释这颗AST。我们以正确的语法树为例，伪代码：

interpretTree:

  First, interpret the left-hand sub-tree and get its value

  Then, interpret the right-hand sub-tree and get its value

  Perform the operation in the node at the root of our tree

  on the two sub-tree values, and return this value

调用过程可以用如下过程表示：

interpretTree0(tree with +):

  Call interpretTree1(left tree with *):

     Call interpretTree2(tree with 2):

       No maths operation, just return 2

     Call interpretTree3(tree with 3):

       No maths operation, just return 3

     Perform 2 * 3, return 6

  Call interpretTree1(right tree with *):

     Call interpretTree2(tree with 4):

       No maths operation, just return 4

     Call interpretTree3(tree with 5):

       No maths operation, just return 5

     Perform 4 * 5, return 20

  Perform 6 + 20, return 26

这是在 interp.c 中并依据上述伪代码写的功能：

// Given an AST, interpret the

// operators in it and return

// a final value.

int interpretAST(struct ASTnode *n) {

  int leftval, rightval;

  // Get the left and right sub-tree values

  if (n->left)

    leftval = interpretAST(n->left);

  if (n->right)

    rightval = interpretAST(n->right);

  switch (n->op) {

    case A_ADD:

      return (leftval + rightval);

    case A_SUBTRACT:

      return (leftval - rightval);

    case A_MULTIPLY:

      return (leftval * rightval);

    case A_DIVIDE:

      return (leftval / rightval);

    case A_INTLIT:

      return (n->intvalue);

    default:

      fprintf(stderr, "Unknown AST operator %d\n", n->op);

      exit(1);

  }

}

这里还有一些其他代码，比如调用 main() 中的解释器：

  scan(&Token);                 // Get the first token from the input

  n = binexpr();                // Parse the expression in the file

  printf("%d\n", interpretAST(n));      // Calculate the final result

  exit(0);

本节内容到此结束。在下一节中我们将修改解析器，让其对表达式进行语义分析计算正确的结果。

本文Github地址：https://github.com/Shaw9379/acwj/tree/master/02_Parser

C语言编译器开发之旅（二）：解析器的更多相关文章

C语言编译器开发之旅（开篇）
编译器写作之旅最近在Github上看到一个十分有趣的项目acwj(A Compiler Writing Journey),一个用C语言编写编译器的项目.身为一个程序员,这在我看来是一件十分酷的事 ...
C语言编译器开发之旅（一）：词法分析扫描器
本节我们先从一个简易的可以识别四则运算和整数值的词法分析扫描器开始.它实现的功能也很简单,就是读取我们给定的文件,并识别出文件中的token将其输出. 这个简易的扫描器支持的词法元素只有五个: 四个基 ...
用java实现编译器-算术表达式及其语法解析器的实现
大家在参考本节时,请先阅读以下博文,进行预热: http://blog.csdn.net/tyler_download/article/details/50708807 本节代码下载地址: http: ...
Android开发8——利用pull解析器读写XML文件
一.基本介绍对XML解析有SAX和DOM等多种方式,Android中极力推荐xmlpull方式解析xml.xmlpull不仅可用在Android上同样也适用于javase,但在javase环境中需自 ...
springmvc配置式开发下的视图解析器
多个视图解析器优先级:
7.SpringMVC 配置式开发-ModelAndView和视图解析器
ModelAndView 1.Model(模型) 1.model的本质就是HashMap,向模型中添加数据,就是往HashMap中去添加数据 2.HashMap 是一个单向查找数组,单向链表数组 3. ...
QT开发之旅二TCP调试工具
TCP调试工具顾名思义用来调试TCP通信的,网上这样的工具N多,之前用.NET写过一个,无奈在XP下还要安装个.NET框架才能运行,索性这次用QT重写,发现QT写TCP通信比.NET还要便捷一些,运行 ...
C#微信开发之旅(二)：基础类之HttpClientHelper(更新：SSL安全策略）
public class HttpClientHelper 2 { 3 /// <summary> 4 /// get请求 5 ...
atitit.java解析sql语言解析器解释器的实现
atitit.java解析sql语言解析器解释器的实现 1. 解析sql的本质:实现一个4gl dsl编程语言的编译器 1 2. 解析sql的主要的流程,词法分析,而后进行语法分析,语义分析,构建sq ...

随机推荐

LA4636积木艺术
题意: 有一些1*1*1的单位正方体积木,现在要摆积木,每一块积木有两种方法,要么放在地面上,要么放在别的积木的正上方,现在给你摆好积木的正面图和侧面图,问你最少用了多少块积木. 思路: ...
C++ Socket 简单封装
以下代码一部分来自于<网络多人游戏架构与编程>, 其它的都是我瞎写的. 备忘. 一个简单的Socket封装,没有做什么高级的操作(比如IO完成端口等等). 1 #pragma once 2 ...
第四部分数据搜索之使用HBASE的API实现条件查询
因为数据清洗部分需要用到Mapreduce,所以先解决hbase的问题,可以用命令先在hbase存一下简单的数据进行查询,之后只要替换数据就可以实现了原本功能在看该部分前,确保Hase API看了, ...
Maven关于web.xml中Servlet和Servlet映射的问题
在配置Servlet时,有两个地方需要配置. 一个是<servlet>,另一个是<servlet-Mapping>,这两个一个是配置Servlet,一个是配置其映射信息. &l ...
Educational Codeforces Round 105 (Rated for Div. 2)
A. ABC String 题目:就是用'('和')'来代替A,B,C并与之对应,问是不是存在这样的对应关系使得'('和')'正好匹配思路:第一个和最后一个字母是确定的左括号或者是右括号,这样就还剩 ...
c语言编程学习之字符串
字符串字面量与字符变量 1.字符串字面量字符串字面量是一对双引号括起来的字符序列.当c语言编译器在程序中遇到长度为n的字符串字面量时,它会为字符串字面量分配长度为n+1的内存空间.这块内存空间用来存 ...
Visual Lab Online —— Beta版本发布声明
项目内容班级:北航2020春软件工程博客园班级博客作业:Beta阶段发布声明发布声明目录发布方式.发布地址与运行环境要求软件主体浏览器扩展 Beta版本新功能登录注册页注册时邮箱 ...
如何理解PaaS平台，与SaaS、IaaS有什么区别？
我们经常会看到SaaS.PaaS.IaaS,但总是会摸不着头脑,有的人甚至会以为是恐怖组织的代号.其实,无论是SaaS.PaaS还是IaaS,都代表的是某一种服务,比如SaaS的含义为"软件 ...
Mybatis-spring-boot-starter自动配置的原理分析
相信大家在使用SpringBoot的过程中,经常会使用到mybatis,通过使用mybatis-spring-boot-starter依赖进行自动配置,省去了自己依赖配置和Bean配置的很多麻烦. 有 ...
yiled
def fib(max): n,a,b = 0,0,1 while n < max: print("hallo") yield b #把函数执行过程冻结在这一步,并且把b的值 ...

C语言编译器开发之旅（二）：解析器

C语言编译器开发之旅（二）：解析器的更多相关文章

随机推荐

热门专题