On 19 Mrz., 14:05, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
> Hermann Peifer wrote:
>
> > Here the xgawk version of the same script. It works fine for me with
> > your testdata. No pre-formatting of bigfile.xml is needed. However,
for
> > this solution you need to have xgawk and the library xmlcopy.awk
> > available. In xmlcopy.awk, I made a minor change at the very end:
>
> > =A0 =A0# printf( "%s", token )
> > =A0 =A0return token
>
> > Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>
> > $ cat split_big_xmlfile.awk
>
> > # Include the xmlcopy.awk library
> > # Make sure that xgawk finds it
> > @[EMAIL PROTECTED]
xmlcopy
>
> > BEGIN { new_chunk =3D 1 ; size =3D 100 }
>
> > # Remember XML declaration of bigfile.xml
> > XMLDECLARATION { header =3D XmlCopy() }
>
> > # Remember root element, define the footer
> > XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> > =A0 =A0 header =3D header XmlCopy()
> > =A0 =A0 footer =3D "</" XMLSTARTELEM ">"
> > }
>
> > # Only care about OfferInfos and their children
> > XMLPATH ~ /OfferInfo/ {
> > =A0 =A0 if (new_chunk) {
> > =A0 =A0 =A0 =A0 outfile =3D "chunk" sprintf("%07d", num) ".xml"
> > =A0 =A0 =A0 =A0 printf "%s", header > outfile
> > =A0 =A0 =A0 =A0 new_chunk =3D 0
> > =A0 =A0 }
> > =A0 =A0 printf "%s", XmlCopy() > outfile
> > }
>
> > # Decide if it's time to add a footer and start a new chunk
> > XMLENDELEM =3D=3D "OfferInfo" {
> > =A0 =A0 num =3D int(++count/size)
> > =A0 =A0 if (num > prev_num) {
> > =A0 =A0 =A0 =A0 print footer > outfile
> > =A0 =A0 =A0 =A0 new_chunk =3D 1
> > =A0 =A0 }
> > =A0 =A0 prev_num =3D num
> > }
>
> > # Avoid double footers, if at the end: count%size =3D 0
> > END { if (!new_chunk) print footer > outfile }
>
> Just in case someone would be interested, here yet another version of
> the same script, where chunk size is defined in bytes (and checked via
> XMLLEN, as suggested by Juergen).
>
> Hermann
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @[EMAIL PROTECTED]
xmlcopy
>
> # new_chunk can be anything here, but not 0 or ""
> # size value defines approx. chunk size in bytes
> # you might have to worry about XMLCHARSET (or not)
> BEGIN {
> =A0 =A0 =A0 =A0 =A0new_chunk =3D size =3D 250000000
> =A0 =A0 =A0 =A0 =A0# XMLCHARSET =3D "ISO-8859-1"
>
> }
>
> # Remember original XML declaration
> XMLDECLARATION { header =3D XmlCopy() }
>
> # Remember original root element, define the footer
> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> =A0 =A0 =A0 =A0 =A0header =3D header ORS XmlCopy() ORS
> =A0 =A0 =A0 =A0 =A0footer =3D ORS "</" XMLSTARTELEM ">"
>
> }
>
> # Only care about these elements and their children
> XMLPATH ~ /OfferInfo/ {
> =A0 =A0 =A0 =A0 =A0if (new_chunk) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outfile =3D "chunk" sprintf("%07d",
num=
) ".xml"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf "%s", header > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new_chunk =3D ""
> =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0 =A0printf "%s", XmlCopy() > outfile
> =A0 =A0 =A0 =A0 =A0chunk_size +=3D XMLLEN
>
> }
>
> # Decide if it's time to add a footer and start with a new chunk
> XMLENDELEM =3D=3D "OfferInfo" && chunk_size > size {
> =A0 =A0 =A0 =A0 =A0printf "%s", footer > outfile
> =A0 =A0 =A0 =A0 =A0num++
> =A0 =A0 =A0 =A0 =A0new_chunk =3D "it's time now"
> =A0 =A0 =A0 =A0 =A0chunk_size =3D 0
>
> }
>
> END {
> =A0 =A0 =A0 =A0 =A0# Footer for the last chunk, but avoid double footers
> =A0 =A0 =A0 =A0 =A0if (!new_chunk) printf "%s", footer > outfile
>
> =A0 =A0 =A0 =A0 =A0# Print XMLERRORs, if any. Xgawk is somewhat lazy in
> =A0 =A0 =A0 =A0 =A0# this respect and might silently die, if you don't
hav=
e:
> =A0 =A0 =A0 =A0 =A0if (XMLERROR)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("XMLERROR '%s' at row %d col
%d =
len %d\n",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XMLERROR, XMLROW,
XMLCO=
L, XMLLEN)
>
> }
I am missing the words!.. Thanks alot. BTW I already searched for the
XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
but not XMLCOPY. Do you have some url?
Malapha


|