[NAME]
ALL.dao.grammar.lexical

[TITLE]
Lexical Structures

[DESCRIPTION]

Dao programs are written in text encoded in UTF-8, an ASCII (American Standard Code for
Information Interchange) compatible multi-byte Unicode encoding. Some character
conversions are performed on the program source codes during the lexical translation.

 0.1  Character Conversion from DBC to SBC 

Double Byte Characters (DBC) in the Unicode range 0xff00-0xff5f (exclusive) are converted
to Single Byte Charactors (SBC). Such conversion occurs only outside of comments and
string literals. This is to allow proper interpretation of operators which are not
encoded in ASCII.

 0.2  Comments 

Dao mainly uses the sharp mark # to mark comments. 
  *  Single line comment: from # to the end of the line;
  *  Multiple line comment: paired with #{ and #}; 
Here #(0x23) and {(0x7b) could be their DBC version, namely, 0x23+0xfee0, and 
0x7b+0xfee0. Multiple line comments may contain other #{ and #}, if they are properly 
paired, namely, they are allowed to be nested.

 0.3  Quotation Marks 

The basic quotation marks to enclose string literals are the single quotation mark 0x27
and the double quotation mark 0x22. Some other marks are also interpreted as quotations,
they are listed in the following table:

  *  Single quotation mark: 0x27;
     Paired with another single quotation mark;
     No conversion;
  *  Double quotation mark: 0x22;
     Paired with another double quotation mark;
     No conversion;
  *  DBC single quotation mark: 0x27+0xfee0;
     Paired with another DBC single quotation mark;
     Conversion to single quotation mark 0x27; 
  *  DBC double quotation mark: 0x22+0xfee0;
     Paired with another DBC double quotation mark;
     Conversion to double quotation mark 0x22; 
  *  Left single quotation mark: 0x2018;
     Paired with right single quotation mark 0x2019;
     Conversion to single quotation mark 0x27; 
  *  Left double quotation mark: 0x201c;
     Paired with right double quotation mark 0x201d;
     Conversion to double quotation mark 0x22; 


 0.4  Keywords 

The following keywords are reserved for the language:
  *  Types:
          
        1  TypeKeyword ::= type | any | int | float | complex | long | string
        2                 | enum | array | list | map | tuple | cdata
          
     
  *  Structures:
          
        1  StructKeyword ::= interface | class | routine | operator | syntax
          
     
  *  Storage/scoping:
          
        1  StorageKeyword ::= const | global | static | var
          
     
  *  Permisions:
          
        1  PermKeyword ::= private | protected | public
          
     
  *  Built-in constants/variables:
          
        1  ConstVarKeyword ::= none | self
          
     
  *  Control statements:
          
        1  ControlKeyword ::= defer | if | else | for | while | do 
        2                    | switch | case | default | break | skip
          
     
  *  Other statements:
          
        1  OtherStmtKeyword ::= type | use | load | as | return | yield
          
     
  *  Operators:
          
        1  OperatorKeyword ::= and | or | not | in
          
     
  *  Miscellaneous:
          
        1  MiscKeyword ::=
          


     
   1  Keyword ::= TypeKeyword | StructKeyword | StorageKeyword 
   2            | PermKeyword | ConstVarKeyword | ControlKeyword
   3            | OtherStmtKeyword | OperatorKeyword | MiscKeyword
     

 0.5  Basic Character Class Definitions 

Basic character classes:
     
   1  DecDigit ::= '0' ... '9'
   2  HexDigit ::= DecDigit | 'a' ... 'f' | 'A' ... 'F'
   3  AsciiLetter ::= 'a' ... 'z' | 'A' ... 'Z'
   4  
   5  WideChar ::= "UTF-8 encoded unit of one or more bytes"
   6  WideAlpha ::= WideChar & iswalpha( WideChar ) != 0
   7  WideAlnum ::= WideChar & iswalnum( WideChar ) != 0
     
Where iswalpha() and iswalnum() are the C99 functions that test if a wide character is 
belonging to certain class. Here WideChar can be more than one byte, in such case, these
UTF-8 bytes are converted into Unicode before passing to the C99 test functions.

 0.6  Identifiers 

     
   1  AsciiIdentifier ::= ( AsciiLetter | '_' ) ( AsciiLetter | DecDigit | '_' )*
   2  WideIdentifier ::= ( WideAlpha | '_' ) ( WideAlnum | '_' )*
   3  
   4  Identifier ::= AsciiIdentifier | WideIdentifier
     


 0.7  Literals 

 0.8  Number Literals 
Integer literals:
     
   1  DecInteger ::= DecDigit +
   2  HexInteger ::= ( '0x' | '0X' ) HexDigit +
   3  
   4  Integer ::= DecInteger | HexInteger
   5  
   6  LongBase ::= '2' ... '16'
   7  LongInteger ::= Integer 'L' [ LongBase ]
     


Floating pointer number literals:
     
   1  DotDec ::= DecDigit * '.' DecDigit +
   2  DecDot ::= DecDigit + '.' DecDigit *
   3  DecNumber ::= DotDec | DecDot
   4  DecNumber ::= DecInteger | DecNumber
   5  SciNumber ::= DecNumber ( 'e' | 'E' ) [ '+' | '-' ] DecInteger
   6  
   7  Float  ::= ( DecInteger | DecNumber | SciNumber ) 'F'
   8  Double ::= ( DecInteger | DecNumber | SciNumber ) [ 'D' ]
     

Complex number, imaginary part literal:
     
   1  ComplexImaginary ::= [ Float ] 'C'
     

Symbol literal:
     
   1  Symbol ::= '$' Identifier
     

Type holder literal:
     
   1  TypeHolder ::= '@' Identifier
     


 0.9  String Literal 

Basic string literal:
     
   1  SingleQuoteString ::= ' ' ' ValidCharSequence ' ' '
   2  DoubleQuoteString ::= ' " ' ValidCharSequence ' " '
     

String literal with DBC quotation marks:
     
   1  DBCSingleQuoteString ::= ' ' ' ValidCharSequence ' ' '
   2  DBCDoubleQuoteString ::= ' " ' ValidCharSequence ' " '
     

String literal with Unicode single and double quotation marks:
     
   1  USingleQuoteString ::= ' ‘ ' ValidCharSequence ' ’ '
   2  UDoubleQuoteString ::= ' “ ' ValidCharSequence ' ” '
     

Verbatim string literal:
     
   1  VerbatimMBString ::= '@[' [Delimiter] ']' Characters '@[' [Delimiter] ']'
   2  VerbatimWCString ::= '@@[' [Delimiter] ']' Characters '@@[' [Delimiter] ']'
     
Where Delimiter can contain letters, digits, underscores, blank spaces, dots, colons, 
dashes and assignment marks. It must be unique such that '@[' [Delimiter] ']' or '@@[' 
[Delimiter] ']' does not appear in the string content.

Here a ValidCharSequence is a sequence of characters where the enclosing quotation marks
may only appear inside the sequence as escaped characters. So the followings are valid
string literals:
     
   1  ' " '
   2  ' “ '  # the enclosing mark is ', so " “ can appear without problem
   3  " ' "
   4  " ” "
   5 ' '# the same for other quotations
   6  ' \' '
   7  " \" "
     

String literal:
     
   1  MultiByteString ::= SingleQuoteString | DBCSingleQuoteString
   2                    | USingleQuoteString | VerbatimMBString
   3  
   4  WideCharString ::= DoubleQuoteString | DBCDoubleQuoteString
   5                   | UDoubleQuoteString | VerbatimWCString
   6  
   7  String ::= MultiByteString + | WideCharString +
     
Here the repeating marks mean two or more MultiByteString or WideCharString can be 
placed one after another, and they will will be jointed into a single string literal
during lexical translation.

 0.10  Escape Sequences in String Literal 

Escape characters:
  *  \\: backslash;
  *  \t: horizontal tab;
  *  \f: form feed; (not implemented)
  *  \n: line feed;
  *  \r: carriage return;
  *  \': single quotation mark;
  *  \": double quotation mark; 

Escape digits (not implemented):
  *  \ooo: character with octal value ooo;
  *  \xhh: character with hex value hh;
  *  \uxxxx: Unicode character with hex value xxxx;
  *  \uxxxxxxxx: Unicode character with hex value xxxxxxxx; 


 0.11  Operators 


  *  Left unary operators:
          
        1  LeftUnaryOperater ::= '++' | '--' | '!' | '~' | '%' | 'not'
          

  *  Right unary operators:
          
        1  RightUnaryOperator ::=
          

  *  Binary operators:
          
        1  BinArith ::= '+' | '-' | '*' | '/' | '%' | '**'
        2  BinComp  ::= '==' | '!=' | '<' | '>' | '<=' | '>='
        3  BinBool  ::= '&&' | '||' | 'and' | 'or'
        4  BinBit   ::= '&' | '|' | '^' | '<<' | '>>'
        5  BinMisc  ::= 'in' | 'not in' | '?=' | '?<'
        6  
        7  BinaryOperator ::= BinArith | BinComp | BinBool | BinBit | BinMisc
          
     
  *  Composite assignment operators:
          
        1  AssignmentOperator ::= '+=' | '-=' | '*=' | '/=' | '&=' | '|='
          
     
  *  Other operators:
          
        1  OtherOperator ::= '->' | '=>' | ':' | '.' | '...'
          
     
     
   1  UnaryOperator ::= LeftUnaryOperater | RightUnaryOperator
   2  
   3  Operator ::= UnaryOperator | BinaryOperator 
   4             | AssignmentOperator | OtherOperator
     

 0.12  Miscellaneous 

 0.13  Semicolon 

Like in some other languages, semicolon can be used to mark the end of a statement.
However the use of semicolon is optional, the compiler is able to determine the end of a
statement based on some semantic rules.