338 lines
15 KiB
Plaintext
338 lines
15 KiB
Plaintext
|
|
#*****************************************************************************
|
|
#
|
|
# Copyright (C) 2016 and later: Unicode, Inc. and others.
|
|
# License & terms of use: http://www.unicode.org/copyright.html
|
|
#
|
|
#*****************************************************************************
|
|
#*****************************************************************************
|
|
#
|
|
# Copyright (C) 2002-2016, International Business Machines Corporation and others.
|
|
# All Rights Reserved.
|
|
#
|
|
#*****************************************************************************
|
|
#
|
|
# file: rbbirpt.txt
|
|
# ICU Break Iterator Rule Parser State Table
|
|
#
|
|
# This state table is used when reading and parsing a set of RBBI rules
|
|
# The rule parser uses a state machine; the data in this file define the
|
|
# state transitions that occur for each input character.
|
|
#
|
|
# *** This file defines the RBBI rule grammar. This is it.
|
|
# *** The determination of what is accepted is here.
|
|
#
|
|
# This file is processed by a perl script "rbbicst.pl" to produce initialized C arrays
|
|
# that are then built with the rule parser.
|
|
#
|
|
# perl rbbicst.pl < rbbirpt.txt > rbbirpt.h
|
|
|
|
#
|
|
# Here is the syntax of the state definitions in this file:
|
|
#
|
|
#
|
|
#StateName:
|
|
# input-char n next-state ^push-state action
|
|
# input-char n next-state ^push-state action
|
|
# | | | | |
|
|
# | | | | |--- action to be performed by state machine
|
|
# | | | | See function RBBIRuleScanner::doParseActions()
|
|
# | | | |
|
|
# | | | |--- Push this named state onto the state stack.
|
|
# | | | Later, when next state is specified as "pop",
|
|
# | | | the pushed state will become the current state.
|
|
# | | |
|
|
# | | |--- Transition to this state if the current input character matches the input
|
|
# | | character or char class in the left hand column. "pop" causes the next
|
|
# | | state to be popped from the state stack.
|
|
# | |
|
|
# | |--- When making the state transition specified on this line, advance to the next
|
|
# | character from the input only if 'n' appears here.
|
|
# |
|
|
# |--- Character or named character classes to test for. If the current character being scanned
|
|
# matches, perform the actions and go to the state specified on this line.
|
|
# The input character is tested sequentally, in the order written. The characters and
|
|
# character classes tested for do not need to be mutually exclusive. The first match wins.
|
|
#
|
|
|
|
|
|
|
|
|
|
#
|
|
# start state, scan position is at the beginning of the rules file, or in between two rules.
|
|
#
|
|
start:
|
|
escaped term ^break-rule-end doExprStart
|
|
white_space n start
|
|
'^' n start-after-caret ^break-rule-end doNoChain
|
|
'$' scan-var-name ^assign-or-rule doExprStart
|
|
'!' n rev-option
|
|
';' n start # ignore empty rules.
|
|
eof exit
|
|
default term ^break-rule-end doExprStart
|
|
|
|
#
|
|
# break-rule-end: Returned from doing a break-rule expression.
|
|
#
|
|
break-rule-end:
|
|
';' n start doEndOfRule
|
|
white_space n break-rule-end
|
|
default errorDeath doRuleError
|
|
|
|
#
|
|
# start of a rule, after having seen a '^' (inhibits rule chain in).
|
|
# Similar to the main 'start' state in most respects, except
|
|
# - empty rule is an error.
|
|
# - A second '^' is an error.
|
|
#
|
|
start-after-caret:
|
|
escaped term doExprStart
|
|
white_space n start-after-caret
|
|
'^' errorDeath doRuleError # two '^'s
|
|
'$' scan-var-name ^term-var-ref doExprStart
|
|
';' errorDeath doRuleError # ^ ;
|
|
eof errorDeath doRuleError
|
|
default term doExprStart
|
|
|
|
#
|
|
# ! We've just scanned a '!', indicating either a !!key word flag or a
|
|
# !Reverse rule.
|
|
#
|
|
rev-option:
|
|
'!' n option-scan1
|
|
default reverse-rule ^break-rule-end doReverseDir
|
|
|
|
option-scan1:
|
|
name_start_char n option-scan2 doOptionStart
|
|
default errorDeath doRuleError
|
|
|
|
option-scan2:
|
|
name_char n option-scan2
|
|
default option-scan3 doOptionEnd
|
|
|
|
option-scan3:
|
|
';' n start
|
|
white_space n option-scan3
|
|
default errorDeath doRuleError
|
|
|
|
|
|
reverse-rule:
|
|
default term ^break-rule-end doExprStart
|
|
|
|
|
|
#
|
|
# term. Eat through a single rule character, or a composite thing, which
|
|
# could be a parenthesized expression, a variable name, or a Unicode Set.
|
|
#
|
|
term:
|
|
escaped n expr-mod doRuleChar
|
|
white_space n term
|
|
rule_char n expr-mod doRuleChar
|
|
'[' scan-unicode-set ^expr-mod
|
|
'(' n term ^expr-mod doLParen
|
|
'$' scan-var-name ^term-var-ref
|
|
'.' n expr-mod doDotAny
|
|
default errorDeath doRuleError
|
|
|
|
|
|
|
|
#
|
|
# term-var-ref We've just finished scanning a reference to a $variable.
|
|
# Check that the variable was defined.
|
|
# The variable name scanning is in common with assignment statements,
|
|
# so the check can't be done there.
|
|
term-var-ref:
|
|
default expr-mod doCheckVarDef
|
|
|
|
|
|
#
|
|
# expr-mod We've just finished scanning a term, now look for the optional
|
|
# trailing '*', '?', '+'
|
|
#
|
|
expr-mod:
|
|
white_space n expr-mod
|
|
'*' n expr-cont doUnaryOpStar
|
|
'+' n expr-cont doUnaryOpPlus
|
|
'?' n expr-cont doUnaryOpQuestion
|
|
default expr-cont
|
|
|
|
|
|
#
|
|
# expr-cont Expression, continuation. At a point where additional terms are
|
|
# allowed, but not required.
|
|
#
|
|
expr-cont:
|
|
escaped term doExprCatOperator
|
|
white_space n expr-cont
|
|
rule_char term doExprCatOperator
|
|
'[' term doExprCatOperator
|
|
'(' term doExprCatOperator
|
|
'$' term doExprCatOperator
|
|
'.' term doExprCatOperator
|
|
'/' look-ahead doExprCatOperator
|
|
'{' n tag-open doExprCatOperator
|
|
'|' n term doExprOrOperator
|
|
')' n pop doExprRParen
|
|
default pop doExprFinished
|
|
|
|
|
|
#
|
|
# look-ahead Scanning a '/', which identifies a break point, assuming that the
|
|
# remainder of the expression matches.
|
|
#
|
|
# Generate a parse tree as if this was a special kind of input symbol
|
|
# appearing in an otherwise normal concatenation expression.
|
|
#
|
|
look-ahead:
|
|
'/' n expr-cont-no-slash doSlash
|
|
default errorDeath
|
|
|
|
|
|
#
|
|
# expr-cont-no-slash Expression, continuation. At a point where additional terms are
|
|
# allowed, but not required. Just like
|
|
# expr-cont, above, except that no '/'
|
|
# look-ahead symbol is permitted.
|
|
#
|
|
expr-cont-no-slash:
|
|
escaped term doExprCatOperator
|
|
white_space n expr-cont
|
|
rule_char term doExprCatOperator
|
|
'[' term doExprCatOperator
|
|
'(' term doExprCatOperator
|
|
'$' term doExprCatOperator
|
|
'.' term doExprCatOperator
|
|
'|' n term doExprOrOperator
|
|
')' n pop doExprRParen
|
|
default pop doExprFinished
|
|
|
|
|
|
#
|
|
# tags scanning a '{', the opening delimiter for a tag that identifies
|
|
# the kind of match. Scan the whole {dddd} tag, where d=digit
|
|
#
|
|
tag-open:
|
|
white_space n tag-open
|
|
digit_char tag-value doStartTagValue
|
|
default errorDeath doTagExpectedError
|
|
|
|
tag-value:
|
|
white_space n tag-close
|
|
'}' tag-close
|
|
digit_char n tag-value doTagDigit
|
|
default errorDeath doTagExpectedError
|
|
|
|
tag-close:
|
|
white_space n tag-close
|
|
'}' n expr-cont-no-tag doTagValue
|
|
default errorDeath doTagExpectedError
|
|
|
|
|
|
|
|
#
|
|
# expr-cont-no-tag Expression, continuation. At a point where additional terms are
|
|
# allowed, but not required. Just like
|
|
# expr-cont, above, except that no "{ddd}"
|
|
# tagging is permitted.
|
|
#
|
|
expr-cont-no-tag:
|
|
escaped term doExprCatOperator
|
|
white_space n expr-cont-no-tag
|
|
rule_char term doExprCatOperator
|
|
'[' term doExprCatOperator
|
|
'(' term doExprCatOperator
|
|
'$' term doExprCatOperator
|
|
'.' term doExprCatOperator
|
|
'/' look-ahead doExprCatOperator
|
|
'|' n term doExprOrOperator
|
|
')' n pop doExprRParen
|
|
default pop doExprFinished
|
|
|
|
|
|
|
|
|
|
#
|
|
# Variable Name Scanning.
|
|
#
|
|
# The state that branched to here must have pushed a return state
|
|
# to go to after completion of the variable name scanning.
|
|
#
|
|
# The current input character must be the $ that introduces the name.
|
|
# The $ is consumed here rather than in the state that first detected it
|
|
# so that the doStartVariableName action only needs to happen in one
|
|
# place (here), and the other states don't need to worry about it.
|
|
#
|
|
scan-var-name:
|
|
'$' n scan-var-start doStartVariableName
|
|
default errorDeath
|
|
|
|
|
|
scan-var-start:
|
|
name_start_char n scan-var-body
|
|
default errorDeath doVariableNameExpectedErr
|
|
|
|
scan-var-body:
|
|
name_char n scan-var-body
|
|
default pop doEndVariableName
|
|
|
|
|
|
|
|
#
|
|
# scan-unicode-set Unicode Sets are parsed by the the UnicodeSet class.
|
|
# Within the RBBI parser, after finding the first character
|
|
# of a Unicode Set, we just hand the rule input at that
|
|
# point of to the Unicode Set constructor, then pick
|
|
# up parsing after the close of the set.
|
|
#
|
|
# The action for this state invokes the UnicodeSet parser.
|
|
#
|
|
scan-unicode-set:
|
|
'[' n pop doScanUnicodeSet
|
|
'p' n pop doScanUnicodeSet
|
|
'P' n pop doScanUnicodeSet
|
|
default errorDeath
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#
|
|
# assign-or-rule. A $variable was encountered at the start of something, could be
|
|
# either an assignment statement or a rule, depending on whether an '='
|
|
# follows the variable name. We get to this state when the variable name
|
|
# scanning does a return.
|
|
#
|
|
assign-or-rule:
|
|
white_space n assign-or-rule
|
|
'=' n term ^assign-end doStartAssign # variable was target of assignment
|
|
default term-var-ref ^break-rule-end # variable was a term in a rule
|
|
|
|
|
|
|
|
#
|
|
# assign-end This state is entered when the end of the expression on the
|
|
# right hand side of an assignment is found. We get here via
|
|
# a pop; this state is pushed when the '=' in an assignment is found.
|
|
#
|
|
# The only thing allowed at this point is a ';'. The RHS of an
|
|
# assignment must look like a rule expression, and we come here
|
|
# when what is being scanned no longer looks like an expression.
|
|
#
|
|
assign-end:
|
|
';' n start doEndAssign
|
|
default errorDeath doRuleErrorAssignExpr
|
|
|
|
|
|
|
|
#
|
|
# errorDeath. This state is specified as the next state whenever a syntax error
|
|
# in the source rules is detected. Barring bugs, the state machine will never
|
|
# actually get here, but will stop because of the action associated with the error.
|
|
# But, just in case, this state asks the state machine to exit.
|
|
errorDeath:
|
|
default n errorDeath doExit
|
|
|
|
|