Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Less greedy...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 14 of 33 Topic 2223 of 2236
Post > Topic >>

Re: Less greedy pattern match

by pk <pk@[EMAIL PROTECTED] > Apr 18, 2008 at 07:05 PM

On Friday 18 April 2008 16:27, Ed Morton wrote:

> I'm actually pretty impressed with the elegance of that solution (thanks
> Rajan!) and it's easily generalised to:
> 
> gawk -v RS='(begin|end)' -v ORS= 'NR%2'
> 
> where "begin" is the pattern that starts the segment you want to remove
> (<t1> in this case) and "end" is the pattern that terminates it (</t1>).
> 
> The only thing gawk-specific (or non-POSIX at least) about it now is the
> use of REs in the RS, but I'd like to see if there's a POSIX version
> anyway. This is close:
> 
> awk -F'</?t1>' -v RS= '{for (i=1;i<=NF;i+=2) printf "%s",$i; print ""}'
> 
> but would add a trailing blank line at the end of the file, and won't
> properly handle blank lines between the "<tl>" and "</tl>". This would
> solve those problems:
> 
> awk -F'</?t1>' 'BEGIN{RS=SUBSEP}{for (i=1;i<=NF;i+=2) printf "%s",$i;
> print ""}'
> 
> but means the whole file has to be read in as one record which may not
be
> very efficient for large files. Any ideas?

Well, my take is as follows. The basic idea is to read the file with the
usual RS and FS='(begin|end)'. Then, we have to pay attention to these
patterns:

........begin...\n
.....end........

When a line that ends with "begin...\n" is found, remove the even-numbered
fields (including the last field that ends the line) and note that we are
in the middle of something that must be removed. When a line that begins
with "...end" is encountered, instead, exit from the "in the middle"
condition, and remove the odd-numbered fields (including the "...end"
one).
If this line has an odd number of fields, we are again in the "in the
middle" condition for the next line; otherwise not. Along with all this,
there are some checks to avoid printing newlines when not needed.

This is the script (m represents the "in the middle" flag):

m==0 {for(i=1;i<=NF;i+=2) {printf "%s",$i};
      if (NF && !(NF%2)) {m=1} else {print""}
      next }

m==1 {for(i=2;i<=NF;i+=2) {printf "%s",$i};
      if (NF) {m=(NF%2)?1:0};
      if (NF>1) print "" }

I did some tests with various input file formats, and it seems to work,
but
there may (read: most certainly) be bugs, and it can probably optimized a
bit, although that would probably make it less readable (not that
currently
it's much readable, but anyway...).

$ cat file1
<tag>
<section>
<t1>
foo
</t1>
<t2>blah</t2>
<t1>bar<subt1>bah
</subt1>
</t1></section>
<section>
<t1>baz</t1>
<t2>blah</t2>
<t1>baz</t1>
</section></tag>

<t1>

foo

</t1>
<t2>bar</t2>



<section>gkjkjk</section>
$
$ awk -F'</?t1>' '
m==0 {for(i=1;i<=NF;i+=2) {printf "%s",$i};
      if (NF && !(NF%2)) {m=1} else {print""}
      next }

m==1 {for(i=2;i<=NF;i+=2) {printf "%s",$i};
      if (NF) {m=(NF%2)?1:0};
      if (NF>1) print "" }
' file1
<tag>
<section>

<t2>blah</t2>
</section>
<section>

<t2>blah</t2>

</section></tag>


<t2>bar</t2>



<section>gkjkjk</section>
$
$ perl -p0e 's%<t1>.+?</t1>%%gs' file1
<tag>
<section>

<t2>blah</t2>
</section>
<section>

<t2>blah</t2>

</section></tag>


<t2>bar</t2>



<section>gkjkjk</section>
$ # I put the script in the file f.awk here
$ diff <(perl -p0e 's%<t1>.+?</t1>%%gs' file1) <(awk -f f.awk file1)
$

I'd like to hear what you guys have to say about this script.
Thanks!

-- 
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.




 33 Posts in Topic:
Less greedy pattern match
Prateek <prateek.a@[EM  2008-04-15 19:59:12 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-15 22:23:49 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-16 00:02:06 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-17 20:32:10 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-17 21:12:47 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-17 14:33:44 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-17 22:01:46 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-17 22:12:04 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-17 19:20:05 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-17 23:13:31 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-17 23:36:59 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-18 09:50:29 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-18 09:27:00 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-18 19:05:23 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-18 19:24:27 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-18 22:06:17 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-18 21:19:26 
Re: Less greedy pattern match
Cesar Rabak <csrabak@[  2008-04-19 13:15:05 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-20 08:36:44 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-20 09:58:54 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-20 10:21:46 
Re: Less greedy pattern match
Janis Papanagnou <Jani  2008-04-19 18:53:19 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-20 14:32:54 
Re: Less greedy pattern match
Janis Papanagnou <Jani  2008-04-20 16:30:25 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-21 09:16:17 
Re: Less greedy pattern match
pk <pk@[EMAIL PROTECTE  2008-04-22 10:09:12 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-22 06:14:10 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-18 06:28:14 
Re: Less greedy pattern match
Prateek <prateek.a@[EM  2008-04-15 21:21:23 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-16 00:54:18 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-16 07:06:52 
Re: Less greedy pattern match
"Rajan" <svr  2008-04-16 18:01:13 
Re: Less greedy pattern match
Ed Morton <morton@[EMA  2008-04-16 23:48:21 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 6:45:51 CDT 2008.