[NAME] ALL.dao.grammar.lexical [TITLE] Lexical Structures [DESCRIPTION] Dao programs are written in text encoded in UTF-8, an ASCII (American Standard Code for Information Interchange) compatible multi-byte Unicode encoding. Some character conversions are performed on the program source codes during the lexical translation. 0.1 Character Conversion from DBC to SBC Double Byte Characters (DBC) in the Unicode range 0xff00-0xff5f (exclusive) are converted to Single Byte Charactors (SBC). Such conversion occurs only outside of comments and string literals. This is to allow proper interpretation of operators which are not encoded in ASCII. 0.2 Comments Dao mainly uses the sharp mark # to mark comments. * Single line comment: from # to the end of the line; * Multiple line comment: paired with #{ and #}; Here #(0x23) and {(0x7b) could be their DBC version, namely, 0x23+0xfee0, and 0x7b+0xfee0. Multiple line comments may contain other #{ and #}, if they are properly paired, namely, they are allowed to be nested. 0.3 Quotation Marks The basic quotation marks to enclose string literals are the single quotation mark 0x27 and the double quotation mark 0x22. Some other marks are also interpreted as quotations, they are listed in the following table: * Single quotation mark: 0x27; Paired with another single quotation mark; No conversion; * Double quotation mark: 0x22; Paired with another double quotation mark; No conversion; * DBC single quotation mark: 0x27+0xfee0; Paired with another DBC single quotation mark; Conversion to single quotation mark 0x27; * DBC double quotation mark: 0x22+0xfee0; Paired with another DBC double quotation mark; Conversion to double quotation mark 0x22; * Left single quotation mark: 0x2018; Paired with right single quotation mark 0x2019; Conversion to single quotation mark 0x27; * Left double quotation mark: 0x201c; Paired with right double quotation mark 0x201d; Conversion to double quotation mark 0x22; 0.4 Keywords The following keywords are reserved for the language: * Types: 1 TypeKeyword ::= type | any | int | float | complex | long | string 2 | enum | array | list | map | tuple | cdata * Structures: 1 StructKeyword ::= interface | class | routine | operator | syntax * Storage/scoping: 1 StorageKeyword ::= const | global | static | var * Permisions: 1 PermKeyword ::= private | protected | public * Built-in constants/variables: 1 ConstVarKeyword ::= none | self * Control statements: 1 ControlKeyword ::= defer | if | else | for | while | do 2 | switch | case | default | break | skip * Other statements: 1 OtherStmtKeyword ::= type | use | load | as | return | yield * Operators: 1 OperatorKeyword ::= and | or | not | in * Miscellaneous: 1 MiscKeyword ::= 1 Keyword ::= TypeKeyword | StructKeyword | StorageKeyword 2 | PermKeyword | ConstVarKeyword | ControlKeyword 3 | OtherStmtKeyword | OperatorKeyword | MiscKeyword 0.5 Basic Character Class Definitions Basic character classes: 1 DecDigit ::= '0' ... '9' 2 HexDigit ::= DecDigit | 'a' ... 'f' | 'A' ... 'F' 3 AsciiLetter ::= 'a' ... 'z' | 'A' ... 'Z' 4 5 WideChar ::= "UTF-8 encoded unit of one or more bytes" 6 WideAlpha ::= WideChar & iswalpha( WideChar ) != 0 7 WideAlnum ::= WideChar & iswalnum( WideChar ) != 0 Where iswalpha() and iswalnum() are the C99 functions that test if a wide character is belonging to certain class. Here WideChar can be more than one byte, in such case, these UTF-8 bytes are converted into Unicode before passing to the C99 test functions. 0.6 Identifiers 1 AsciiIdentifier ::= ( AsciiLetter | '_' ) ( AsciiLetter | DecDigit | '_' )* 2 WideIdentifier ::= ( WideAlpha | '_' ) ( WideAlnum | '_' )* 3 4 Identifier ::= AsciiIdentifier | WideIdentifier 0.7 Literals 0.8 Number Literals Integer literals: 1 DecInteger ::= DecDigit + 2 HexInteger ::= ( '0x' | '0X' ) HexDigit + 3 4 Integer ::= DecInteger | HexInteger 5 6 LongBase ::= '2' ... '16' 7 LongInteger ::= Integer 'L' [ LongBase ] Floating pointer number literals: 1 DotDec ::= DecDigit * '.' DecDigit + 2 DecDot ::= DecDigit + '.' DecDigit * 3 DecNumber ::= DotDec | DecDot 4 DecNumber ::= DecInteger | DecNumber 5 SciNumber ::= DecNumber ( 'e' | 'E' ) [ '+' | '-' ] DecInteger 6 7 Float ::= ( DecInteger | DecNumber | SciNumber ) 'F' 8 Double ::= ( DecInteger | DecNumber | SciNumber ) [ 'D' ] Complex number, imaginary part literal: 1 ComplexImaginary ::= [ Float ] 'C' Symbol literal: 1 Symbol ::= '$' Identifier Type holder literal: 1 TypeHolder ::= '@' Identifier 0.9 String Literal Basic string literal: 1 SingleQuoteString ::= ' ' ' ValidCharSequence ' ' ' 2 DoubleQuoteString ::= ' " ' ValidCharSequence ' " ' String literal with DBC quotation marks: 1 DBCSingleQuoteString ::= ' ' ' ValidCharSequence ' ' ' 2 DBCDoubleQuoteString ::= ' " ' ValidCharSequence ' " ' String literal with Unicode single and double quotation marks: 1 USingleQuoteString ::= ' ‘ ' ValidCharSequence ' ’ ' 2 UDoubleQuoteString ::= ' “ ' ValidCharSequence ' ” ' Verbatim string literal: 1 VerbatimMBString ::= '@[' [Delimiter] ']' Characters '@[' [Delimiter] ']' 2 VerbatimWCString ::= '@@[' [Delimiter] ']' Characters '@@[' [Delimiter] ']' Where Delimiter can contain letters, digits, underscores, blank spaces, dots, colons, dashes and assignment marks. It must be unique such that '@[' [Delimiter] ']' or '@@[' [Delimiter] ']' does not appear in the string content. Here a ValidCharSequence is a sequence of characters where the enclosing quotation marks may only appear inside the sequence as escaped characters. So the followings are valid string literals: 1 ' " ' 2 ' “ ' # the enclosing mark is ', so " “ can appear without problem 3 " ' " 4 " ” " 5 “ ' ' ” # the same for other quotations 6 ' \' ' 7 " \" " String literal: 1 MultiByteString ::= SingleQuoteString | DBCSingleQuoteString 2 | USingleQuoteString | VerbatimMBString 3 4 WideCharString ::= DoubleQuoteString | DBCDoubleQuoteString 5 | UDoubleQuoteString | VerbatimWCString 6 7 String ::= MultiByteString + | WideCharString + Here the repeating marks mean two or more MultiByteString or WideCharString can be placed one after another, and they will will be jointed into a single string literal during lexical translation. 0.10 Escape Sequences in String Literal Escape characters: * \\: backslash; * \t: horizontal tab; * \f: form feed; (not implemented) * \n: line feed; * \r: carriage return; * \': single quotation mark; * \": double quotation mark; Escape digits (not implemented): * \ooo: character with octal value ooo; * \xhh: character with hex value hh; * \uxxxx: Unicode character with hex value xxxx; * \uxxxxxxxx: Unicode character with hex value xxxxxxxx; 0.11 Operators * Left unary operators: 1 LeftUnaryOperater ::= '++' | '--' | '!' | '~' | '%' | 'not' * Right unary operators: 1 RightUnaryOperator ::= * Binary operators: 1 BinArith ::= '+' | '-' | '*' | '/' | '%' | '**' 2 BinComp ::= '==' | '!=' | '<' | '>' | '<=' | '>=' 3 BinBool ::= '&&' | '||' | 'and' | 'or' 4 BinBit ::= '&' | '|' | '^' | '<<' | '>>' 5 BinMisc ::= 'in' | 'not in' | '?=' | '?<' 6 7 BinaryOperator ::= BinArith | BinComp | BinBool | BinBit | BinMisc * Composite assignment operators: 1 AssignmentOperator ::= '+=' | '-=' | '*=' | '/=' | '&=' | '|=' * Other operators: 1 OtherOperator ::= '->' | '=>' | ':' | '.' | '...' 1 UnaryOperator ::= LeftUnaryOperater | RightUnaryOperator 2 3 Operator ::= UnaryOperator | BinaryOperator 4 | AssignmentOperator | OtherOperator 0.12 Miscellaneous 0.13 Semicolon Like in some other languages, semicolon can be used to mark the end of a statement. However the use of semicolon is optional, the compiler is able to determine the end of a statement based on some semantic rules.