On Friday 18 April 2008 16:27, Ed Morton wrote:
> I'm actually pretty impressed with the elegance of that solution (thanks
> Rajan!) and it's easily generalised to:
>
> gawk -v RS='(begin|end)' -v ORS= 'NR%2'
>
> where "begin" is the pattern that starts the segment you want to remove
> (<t1> in this case) and "end" is the pattern that terminates it (</t1>).
>
> The only thing gawk-specific (or non-POSIX at least) about it now is the
> use of REs in the RS, but I'd like to see if there's a POSIX version
> anyway. This is close:
>
> awk -F'</?t1>' -v RS= '{for (i=1;i<=NF;i+=2) printf "%s",$i; print ""}'
>
> but would add a trailing blank line at the end of the file, and won't
> properly handle blank lines between the "<tl>" and "</tl>". This would
> solve those problems:
>
> awk -F'</?t1>' 'BEGIN{RS=SUBSEP}{for (i=1;i<=NF;i+=2) printf "%s",$i;
> print ""}'
>
> but means the whole file has to be read in as one record which may not
be
> very efficient for large files. Any ideas?
Well, my take is as follows. The basic idea is to read the file with the
usual RS and FS='(begin|end)'. Then, we have to pay attention to these
patterns:
........begin...\n
.....end........
When a line that ends with "begin...\n" is found, remove the even-numbered
fields (including the last field that ends the line) and note that we are
in the middle of something that must be removed. When a line that begins
with "...end" is encountered, instead, exit from the "in the middle"
condition, and remove the odd-numbered fields (including the "...end"
one).
If this line has an odd number of fields, we are again in the "in the
middle" condition for the next line; otherwise not. Along with all this,
there are some checks to avoid printing newlines when not needed.
This is the script (m represents the "in the middle" flag):
m==0 {for(i=1;i<=NF;i+=2) {printf "%s",$i};
if (NF && !(NF%2)) {m=1} else {print""}
next }
m==1 {for(i=2;i<=NF;i+=2) {printf "%s",$i};
if (NF) {m=(NF%2)?1:0};
if (NF>1) print "" }
I did some tests with various input file formats, and it seems to work,
but
there may (read: most certainly) be bugs, and it can probably optimized a
bit, although that would probably make it less readable (not that
currently
it's much readable, but anyway...).
$ cat file1
<tag>
<section>
<t1>
foo
</t1>
<t2>blah</t2>
<t1>bar<subt1>bah
</subt1>
</t1></section>
<section>
<t1>baz</t1>
<t2>blah</t2>
<t1>baz</t1>
</section></tag>
<t1>
foo
</t1>
<t2>bar</t2>
<section>gkjkjk</section>
$
$ awk -F'</?t1>' '
m==0 {for(i=1;i<=NF;i+=2) {printf "%s",$i};
if (NF && !(NF%2)) {m=1} else {print""}
next }
m==1 {for(i=2;i<=NF;i+=2) {printf "%s",$i};
if (NF) {m=(NF%2)?1:0};
if (NF>1) print "" }
' file1
<tag>
<section>
<t2>blah</t2>
</section>
<section>
<t2>blah</t2>
</section></tag>
<t2>bar</t2>
<section>gkjkjk</section>
$
$ perl -p0e 's%<t1>.+?</t1>%%gs' file1
<tag>
<section>
<t2>blah</t2>
</section>
<section>
<t2>blah</t2>
</section></tag>
<t2>bar</t2>
<section>gkjkjk</section>
$ # I put the script in the file f.awk here
$ diff <(perl -p0e 's%<t1>.+?</t1>%%gs' file1) <(awk -f f.awk file1)
$
I'd like to hear what you guys have to say about this script.
Thanks!
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.


|