Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Compilers > Re: Leftmost lo...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 5 of 8 Topic 2416 of 2474
Post > Topic >>

Re: Leftmost longest match with DFA search

by "Russ Cox" <rsc@[EMAIL PROTECTED] > May 12, 2008 at 08:09 AM

Stefan Monnier wrote:
> Can someone point me to articles that discuss various ways to get the
> leftmost longest match when implementing regexp search using a DFA?
>
> The "obvious" solution ...

Actually, the "obvious" solution is to implement unanchored search
using repeated anchored searches.  Running an anchored DFA search and
remembering the last position that produces a matching state will give
you the longest match at that point in the string.  You can create an
unanchored search by trying an anchored search at position 0, then
position 1, and so on.  If you stop the DFA when it gets to a dead
state, then if you are lucky most of the searches will process very
little of the string, and the whole thing will run in something like
linear time in many cases.  In the worst case, it is quadratic in the
length of the text just to find one match.

This method--implementing unanchored search via repeated anchored
search--is what Bell Labs awk, GNU libc (and gawk), and PCRE all do when
using DFAs.  (In fact the backtracking implementations all do this too,
and the DFAs picked up the bad habit from them.)

> ... of turning the problem "search for RE" into the
> problem "match .*RE" (where I use "match" here to mean "anchored
> search") only gives you the leftmost shortest match.

Turning RE into .*RE gives you an unanchored search, but the particular
match it finds depends on other implementation details, specifically
whether you stop and return the position of the first matching state or
continue on and return the last matching state (as above).

If you are searching for RE and you stop at the first matching state,
you do get the shortest match, but what you've really found is the
earliest rightmost endpoint among all (anchored) matches.  Since this
is an anchored search, that's the shortest match.

If you are searching for .*RE and you stop at the first matching state,
you get the earliest rightmost endpoint among all (unanchored) matches.
You don't know where the (RE part of the) match begins, and it is not
guaranteed to be a leftmost match.  For example, if you are searching
for a*bb|b in abb, you will stop at the position between the two b's,
marking the right endpoint of the "b" match, but the leftmost match is
the entire string "abb".

What you actually find this way is the right endpoint of the leftmost
shortest non-overlapping match.  To find the left endpoint, you can,
adapting Daniel Villeneuve's suggestion, match the reverse expression
ER starting at that right endpoint and stop at the earliest match.
You can repeat this operation to find a set containing all the shortest
non-overlapping matches.

Tradtionally, when you ask for all matches for a RE in a text, you get
a set of leftmost-longest non-overlapping matches.  Clarke and Cormack
make a case for using the shortest non-overlapping matches instead in
their paper ``On the use of Regular Expressions for Searching Text''
(TOPLAS 1995).  http://www.cs.uwaterloo.ca/research/tr/1995/07/regexp.pdf

Russ
 




 8 Posts in Topic:
Leftmost longest match with DFA search
Stefan Monnier <monnie  2008-05-10 04:36:57 
Re: Leftmost longest match with DFA search
Daniel Villeneuve <dan  2008-05-11 11:42:22 
Re: Leftmost longest match with DFA search
Stefan Monnier <monnie  2008-05-13 04:51:39 
Re: Leftmost longest match with DFA search
Danny.Dube@[EMAIL PROTECT  2008-05-15 11:58:45 
Re: Leftmost longest match with DFA search
"Russ Cox" <  2008-05-12 08:09:04 
Re: Leftmost longest match with DFA search
Stefan Monnier <monnie  2008-05-13 05:02:56 
Re: Leftmost longest match with DFA search
Danny.Dube@[EMAIL PROTECT  2008-05-13 13:58:56 
Re: Leftmost longest match with DFA search
"Russ Cox" <  2008-05-14 16:30:00 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Sat Jul 26 1:11:14 CDT 2008.