[NAME] ALL.dao.type.string.pattern [TITLE] String Pattern Matching [DESCRIPTION] 0.1 Introduction Dao has built-in support for regular expression based string pattern matching. A regular expression ( regex) is a string representing a pattern (rules), from which a set of strings can be constructed. The pattern represented by a regular expression is used to search in a string for sub-strings that can be constructed from that pattern, namely sub-strings that match that pattern. A number of operations can be performed on the resulting sub-strings, including extraction, replacing and splitting etc. Starting from the latest release, the functionalities of regex can be accessed by string methods. The previous perl-style regex has been removed (as a way to simplify the language and implementation). The syntax of the new regex is similar to that of Lua, but most of the character class names are different. 0.2 Character Class A character class is used to identify a set of characters. * x : ordinary characters represent themselves, excluding magic characters ^$|()%.[]*+-?{}<>; * . : a dot represents any characters; * %a : all alphabetic characters; * %s : all white space characters; * %k : all control characters; * %p : all punctuation characters; * %d : all digits; * %x : all hexadecimal digits; * %c : all lower case characters; * %w : all alphabetic characters, digists and character _; * %A : non alphabetic characters, complement of %a; * %S : non white space characters, complement of %s; * %K : non control characters, complement of %k; * %P : non punctuation characters, complement of %p; * %D : non digits, complement of %d; * %X : non hexadecimal digits, complement of %x; * %C : upper case characters; * %W : complement of %w; * %x : represents character x, where x is any non-alphanumeric character; x may also be an alphabetic character if it is not one of the character class symbols or b or B. * [set] : represents the union of all characters in set; a range of characters starting from a character x up to another character y can be included in set as x-y ; the above character classes can also be included in set; * [^set] : complement of [set]; 0.3 Pattern Item A pattern item can be * a single character class; * ^ : match at the begin of a string; * $ : match at the end of a string; * %n : match n-th captured sub string; n can be one or more digits; * %bxy : match a balanced pair of characters x and y; here balance means, starting from the same matched position, the mached sub string should contain the same number and minimum number of x and y; the same as that in Lua; * %B{pattern1}{pattern2} : match a balanced pair of patterns pattern1 and pattern2; here balance has the same meaning as in %bxy; A pattern item e can be optional skiped or matched repeatedly as indicated by: * e? : match zero time or once; * e* : match zero time or any number of times; * e+ : match once or more; * e{n} : match exactly n times; * e{n,} : match at least n times; * e{,n} : match at most n times; * e{n,m} : match at least n times and at most m times; 0.4 Grouping and Captures In a pattern, one or more pattern items can be grouped together by parenthesis to form sub patterns (group). Alternative patterns in a group can be separated by |, and the group could be optionally skipped if an empty alternative pattern is specified as (|pattern) or (pattern|). When a string is matched to a pattern, the sub strings that match the groups of sub patterns can be captured for other use. Captures are numbered according to their left parenthesis. For example, in pattern (%a+)%s*(%d+(%a+)), the first (%a+) will have group number 1, and (%d+(%a+)) will have group number 2, and the second (%a+) will have group number 3. For convenience, the whole pattern has group number 0. In case there are multiple possible ways of matching a substring starting from the same position, the matching length is calculated as the sum of the lengths of the sub-matches of all groups (including number 0 group) in the pattern, and the matching giving maximum matching length is returned as the result. In this way, one can put a deeper nesting of parenthesis around a group, if one want that group has high priority to be matched. For example, when 1a2 is matched to patterh (%d%w*)(%w*%d), there are two possible ways of macthing, namely, 1a matching to (%d%w*) and 2 matching to (%w*%d), or 1 matching to (%d%w*) and a2 matching to (%w*%d), but if an extra parenthesis is added to one of the group, for example, as (%d%w*)((%w*%d)), then the matching becomes unique, which is the second way of matching where letter a is matched in the last group. 0.5 String Matching Methods Like in Lua, the regular expression matching functionalities are accessed through various string methods. The regular expression patterns are stored in strings, and passed to these string methods. Each pattern string corresponds to an internal representation of a regular expression, which are compiled from the pattern string at the first time it is used. Though the strings that represent the same pattern can be passed multiple times to these methods, they are compiled once in one process (virtual machine process). So the overhead of compiling a regular expression can be normally ignored. The following methods are provided: * string.pfind(): see dao.type.string.pattern.pfind; * string.match(): see dao.type.string.pattern.match; * string.submatch(): see dao.type.string.pattern.submatch; * string.extract(): see dao.type.string.pattern.extract; * string.capture(): see dao.type.string.pattern.capture; * string.change(): see dao.type.string.pattern.change; [STRUCTURE] dao.type.string.pattern--| dao.type.string.pattern: String Pattern Matching (11.1 KB) |--pfind-----| dao.type.string.pattern.pfind: string.pfind() (0.6 KB) |--match-----| dao.type.string.pattern.match: string.match() (0.3 KB) |--submatch--| dao.type.string.pattern.submatch: string.submatch() (0.4 KB) |--extract---| dao.type.string.pattern.extract: string.extract() (0.9 KB) |--capture---| dao.type.string.pattern.capture: string.capture() (0.3 KB) |--change----| dao.type.string.pattern.change: string.change() (0.7 KB)