String tokenizer is the simplest parser of all. In java, arguably, it
is much more frequently used than regular expressions. Yet I fail to
see any parsing theory book ever mentioning it.
In my practical experience it is easer write a scanner on string
tokenizer foundation rather than to use off the shellf reg exp engine.
I'm interested however what is the language theory perspective onto
this phenomenon.
Formally, a set of terminals is partitioned into a separator or a set
of separators, and the rest of terminals. Then, string tokenizer
translates a given word into a set (or list) of words. Here we have
the first technical difficulty, what exactly this translation is? Is
it a mapping from Monoid to Idempotent Semiring? Certainly not if we
insist on the result being a list of words, and not the set. Then if
we try exrtend this mapping naturally to the domain of sets, then we
have to deal with set of sets (or set of lists) on the range side of
the mapping?
Consider an example, an alphabet of terminals {a,b,c} with the "a"
being a separator. Can you suggest a grammar that be able to tokenize
the word "babcac" into {b,bc,c}? Sure something like
s >= a
t >= b
t >= c
w >= wt
w >= 1
u >= ws
v >= 1
v >= uv
would work, but does this 8(!) rule grammar present any new insight to
what string tokenizor is? Besides, assuming this grammar produces an
unambigious parse tree, there is still an extra step of extracting the
set of words from the tree.


|