Parser 3

From DaveWiki

Jump to: navigation, search

The first stage in parsing will be "tokenizing" (technically known as Lexical_analysis), or splitting up the text expression into individual tokens. For example, the expression:

  Result = 3.14 * cos(Phase / 2 - 45)

might be broken down into the tokens:

  Identifier (Result)
  Equals
  Number (3.14)
  Multiply
  Function (cos)
  Left Bracket
  Identifier (Phase)
  Divide
  Number (2)
  Subtract
  Number (45)
  Right Bracket

Token Types

The following token types will be used for our expression parser:

enum token_t 
{ 
	TOKEN_UNKNOWN, 
	TOKEN_NUMBER, 
	TOKEN_FUNCTION, 
	TOKEN_ADD, 
	TOKEN_SUBTRACT, 
	TOKEN_MULTIPLY,
	TOKEN_DIVIDE,
	TOKEN_EXPONENT,
	TOKEN_IDENTIFIER,
	TOKEN_LEFTBRACKET,
	TOKEN_RIGHTBRACKET,
	TOKEN_EQUAL,
	TOKEN_END,
	TOKEN_MODULUS,
	TOKEN_NEGATE,
	TOKEN_COMMA
};

Note that some tokens could be defined differently than here. For example, TOKEN_ADD and TOKEN_SUBTRACT could be combined into a single TOKEN_ADDOP token and the individual add/subtract operators would be handled in the later compiling phases. For our simple compiler the difference is minor and it is easier to separate the operators sooner rather than later.

Character Classes

In order to turn a raw string into tokens we will define various character classes according to our expression grammar. For our simple parser each character can be assigned a single class. For more complex grammars we may have to consider assigning each character multiple classes (i.e., bit fields instead of an enumeration).

enum charclass_t
{
	CC_UNKNOWN = 0,		/* Anything not covered below */
	CC_DIGIT,		/* Digits: 0-9 */
	CC_ALPHA,		/* Alpha: a-zA-Z */
	CC_UNDERSCORE,		/* Underscore: _ */
	CC_WHITESPACE,		/* Whitespace: \n\r\t\v */
	CC_ADDOP,		/* Add Operators: +- */
	CC_MULTOP,		/* Multiply Operators: * / */
	CC_EQUAL,		/* Equal: = */
	CC_DECIMAL,		/* Decimal: . */
	CC_RIGHTBRACKET,	/* Right Bracket: ) */
	CC_LEFTBRACKET,		/* Left Bracket: ( */
	CC_END,			/* End: \0 */
	CC_NEGATE,		/* Negate: ~ */
	CC_EXPONENT,		/* Exponent: ^ */
	CC_MODULUS,		/* Modulus: % */
	CC_COMMA		/* Comma: , */
};
 
/*
 * Local definitions to make the character class array smaller
 */
#define CC_UK CC_UNKNOWN
#define CC_DI CC_DIGIT
#define CC_AL CC_ALPHA
#define CC_US CC_UNDERSCORE
#define CC_WS CC_WHITESPACE
#define CC_AD CC_ADDOP
#define CC_ML CC_MULTOP
#define CC_EQ CC_EQUAL
#define CC_DC CC_DECIMAL
#define CC_RB CC_RIGHTBRACKET
#define CC_LB CC_LEFTBRACKET
#define CC_EN CC_END
#define CC_NG CC_NEGATE
#define CC_EX CC_EXPONENT
#define CC_MD CC_MODULUS
#define CC_CM CC_COMMA         
 
/*
 * Holds the character class for all 8-bit characters
 */
charclass_t g_CharClasses[256] = 
{
	CC_EN, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_WS, CC_WS, CC_WS, CC_WS, CC_WS, CC_UK, CC_UK, //0
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //16
	CC_WS, CC_UK, CC_UK, CC_UK, CC_UK, CC_MD, CC_UK, CC_UK, CC_LB, CC_RB, CC_ML, CC_AD, CC_CM, CC_AD, CC_DC, CC_ML, //32
	CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_DI, CC_UK, CC_UK, CC_UK, CC_EQ, CC_UK, CC_UK, //48
	CC_UK, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, //64
	CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_UK, CC_UK, CC_UK, CC_EX, CC_UK, //80
	CC_UK, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, //96
	CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_AL, CC_UK, CC_UK, CC_UK, CC_NG, CC_UK, //112
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //128
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //144
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //160
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //176
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //192
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //208
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, //224
	CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK, CC_UK	//240
};
 
/*
 * Converts a character to its character class.
 */
charclass_t GetCharClass (const char Char)
{
	return (g_CharClasses[(unsigned char) Char]);
}

The GetCharClass function above will simply convert a character into its assigned character class. Also because of our simple language grammar we can define a function to convert a character into its associated token type if applicable:

/*
 * Converts a character to its associated token (if any). Used during
 * the tokenizing process.
 */
token_t ConvertCharClassToToken (const char Char)
{
	switch (GetCharClass(Char))
	{
	case CC_ADDOP:			
		if (Char == '+') return (TOKEN_ADD); 
		if (Char == '-') return (TOKEN_SUBTRACT); 
		return (TOKEN_UNKNOWN);
	case CC_MULTOP:			
		if (Char == '*') return (TOKEN_MULTIPLY); 
		if (Char == '/') return (TOKEN_DIVIDE); 
		return (TOKEN_UNKNOWN); 
 
	case CC_EQUAL:		return (TOKEN_EQUAL);
	case CC_RIGHTBRACKET:	return (TOKEN_RIGHTBRACKET);
	case CC_LEFTBRACKET:	return (TOKEN_LEFTBRACKET);
	case CC_END:		return (TOKEN_END);
	case CC_NEGATE:		return (TOKEN_NEGATE);
	case CC_EXPONENT:	return (TOKEN_EXPONENT);
	case CC_MODULUS:	return (TOKEN_MODULUS);
	case CC_COMMA:		return (TOKEN_COMMA);
 
	case CC_DIGIT:
	case CC_ALPHA:
	case CC_UNDERSCORE:
	case CC_DECIMAL:
	case CC_WHITESPACE:
	case CC_UNKNOWN:
	default:		return (TOKEN_UNKNOWN); 
	}
 
}

This function is not really required but it greatly helps simplify the tokenizing code.

Token Formats

  • Numbers -- Since all numbers are floating points the familiar C/C++ floating number format will be used:
     [0-9]+[.[0-9]+]?[e|E[-|+]?[0-9]+]?
  
  For example:
     12345
     1.456
     971832.13945
     1e10
     1e+10
     1e-10
     1.9028E+19
We did not add support for NaN or INF values although we could if we wanted to (it just makes our number tokenizing function a little more complex). Also note that the numbers do not include the sign since that is parsed separately nor does it attempt to verify the validity of the number (it could under/over flow the storage format used).
  • Operators -- All of our operators are single characters so we can simply tokenize them one at a time as we find them. In more complex grammars with multiple character operators the rule typically is to construct the longest possible operator token.
  • Identifiers -- Identifiers for our grammar will be considered as variable and function names. Identifiers must start with a letter, followed by any number of letters, digits or underscores and are case-insensitive. To make things simple we will also make the rule that variables cannot have the same name as a function (this again is not required but just makes things simpler for us).
  • White Space -- We will consider all spaces, tabs, carriage returns, and line feeds to be white space which will be ignored.


Tokenizing

The tokenizing process takes the expression string as input and outputs a list of tokens. The tokenizing code is shown in basic pseudo-code below.

	bool TokenizeIdentifier (string InputString)
	{
		CurrentPos = InputString[0]
 
		while (!CurrentPos.EndOfString())
		{
			CharType = GetCharClass(CurrentPos);
			if (CharType != CC_ALPHA && CharType != CC_DIGIT && CharType != CC_UNDERSCORE) break
			CurrentPos.NextChar
		}
 
		Identifier = InputString.Left(CurrentPos)
		Function = FindFunction(Identifier)  //When we implement this anyways
 
		if (Function)
			AddToken(TOKEN_FUNCTION)
		else
			AddToken(TOKEN_IDENTIFIER, pStart)
		endif
 
		return true
	}
 
	bool TokenizeNumber (InputString)
	{
		CurrentPos = InputString[0]
 
			// Parse the first digit group, left of the decimal point/exponent
		while (!CurrentPos.EndOfString)
		{
			switch (CurrentPos)
			{
			case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':
				break
			case 'e': case 'E':
				goto TokenizeNumber_ParseExponent
			case '.':
				goto TokenizeNumber_ParseSecondNumber
			default:
				goto TokenizeNumber_End
			}
 
			CurrentPos.NextChar
		}
 
		goto TokenizeNumber_End
 
	TokenizeNumber_ParseSecondNumber:
 
		CurrentPos.NextChar
 
			// Parse the numbers after the decimal point
		while (!CurrentPos.EndOfString)
		{
			switch (CurrentPos)
			{
			case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':
				break
			case 'e': case 'E':
				goto TokenizeNumber_ParseExponent
			default:
				goto TokenizeNumber_End
			}
 
			CurrentPos.NextChar
		}
 
		goto TokenizeNumber_End
 
	TokenizeNumber_ParseExponent:
 
		CurrentPos.NextChar
 
			// Parse the optional sign before the exponent
		switch (CurrentPos)
		{
		case '+': case '-':
			CurrentPos.NextChar
			break
		}
 
			// Parse the exponent digits
		while (!CurrentPos.EndOfString)
		{
			switch (CurrentPos)
			{
			case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':
				break
			default:
				goto TokenizeNumber_End
			}
 
			CurrentPos.NextChar
		}
 
	TokenizeNumber_End:
 
		AddToken(TOKEN_NUMBER)
		return true
	}
 
	bool Tokenize (string Statement)
	{
		CurrentPos = Statement[0]
		EndParse = false
 
		while (!EndParse)
		{
			CharType = GetCharClass(CurrentPos)
                        if (CurrentPos.EndOfString()) EndParse = true
 
			switch (CharType)
			{
			case CC_UNKNOWN:
				return false
			case CC_DIGIT:
			case CC_DECIMAL:
				if (!TokenizeNumber(CurrentPos)) return false
				break
			case CC_ALPHA:
				if (!TokenizeIdentifier(CurrentPos)) return false
				break
			case CC_END:
				AddToken(TOKEN_END)
				EndParse = true
				break
			case CC_WHITESPACE:
				CurrentPos.NextChar
				break
			default:
				TokenType = ConvertCharClassToToken(CurrentPos)
 
				if (TokenType == TOKEN_UNKNOWN)
					return false
				else
					AddToken(TokenType)
					CurrentPos.NextChar
				endif
 
				break
			} // End of switch
 
		} // End of while loop
 
		if (CharType != CC_END)	AddToken(TOKEN_END);
		return true
	}

While that may look complicated all we're doing is parsing numbers, operators, and identifiers (functions and variable names) from the string. We are not yet actually figuring out if the expression is valid or not. At this stage the only thing that can result in failure is if we encounter an unknown character. This tokenizing code is essentially just a very basic regular expression/fine state machine using switches and gotos.

We are also not worried about the tokenizer performance since the expression strings will be relatively short (easily within 100 characters) and will only need to be tokenized/compiled once.


Testing the Tokenizer

Testing the tokenizer with some sample expressions is a simple matter:

int _tmain(int argc, _TCHAR* argv[])
{
	parser::CExpressionParser3a Parser;
 
	Parser.SetStatement("Result = 7.1*cos(Phase - 45)");
	Parser.Tokenize();
	Parser.PrintTokens();
 
	_getch();
	return 0;
}

which outputs:

       Token(Identifier) = 'Result'
       Token(Equal) = '='
       Token(Number) = '7.1'
       Token(Multiply) = '*'
       Token(Identifier) = 'cos'
       Token(Left Bracket) = '('
       Token(Identifier) = 'Phase'
       Token(Subtract) = '-'
       Token(Number) = '45'
       Token(Right Bracket) = ')'
       Token(EndofExpression) = 

Note that we haven't implemented functions yet so the cos function is recorded as an identifier.


Next: Parser_4

Personal tools