On 20 Mrz., 09:52, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
> Malapha wrote:
> > On 19 Mrz., 14:05, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
> >> Hermann Peifer wrote:
>
> >>> Here the xgawk version of the same script. It works fine for me with
> >>> your testdata. No pre-formatting of bigfile.xml is needed. However,
fo=
r
> >>> this solution you need to have xgawk and the library xmlcopy.awk
> >>> available. In xmlcopy.awk, I made a minor change at the very end:
> >>> =A0 =A0# printf( "%s", token )
> >>> =A0 =A0return token
> >>> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
> >>> $ cat split_big_xmlfile.awk
> >>> # Include the xmlcopy.awk library
> >>> # Make sure that xgawk finds it
> >>> @[EMAIL PROTECTED]
xmlcopy
> >>> BEGIN { new_chunk =3D 1 ; size =3D 100 }
> >>> # Remember XML declaration of bigfile.xml
> >>> XMLDECLARATION { header =3D XmlCopy() }
> >>> # Remember root element, define the footer
> >>> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> >>> =A0 =A0 header =3D header XmlCopy()
> >>> =A0 =A0 footer =3D "</" XMLSTARTELEM ">"
> >>> }
> >>> # Only care about OfferInfos and their children
> >>> XMLPATH ~ /OfferInfo/ {
> >>> =A0 =A0 if (new_chunk) {
> >>> =A0 =A0 =A0 =A0 outfile =3D "chunk" sprintf("%07d", num) ".xml"
> >>> =A0 =A0 =A0 =A0 printf "%s", header > outfile
> >>> =A0 =A0 =A0 =A0 new_chunk =3D 0
> >>> =A0 =A0 }
> >>> =A0 =A0 printf "%s", XmlCopy() > outfile
> >>> }
> >>> # Decide if it's time to add a footer and start a new chunk
> >>> XMLENDELEM =3D=3D "OfferInfo" {
> >>> =A0 =A0 num =3D int(++count/size)
> >>> =A0 =A0 if (num > prev_num) {
> >>> =A0 =A0 =A0 =A0 print footer > outfile
> >>> =A0 =A0 =A0 =A0 new_chunk =3D 1
> >>> =A0 =A0 }
> >>> =A0 =A0 prev_num =3D num
> >>> }
> >>> # Avoid double footers, if at the end: count%size =3D 0
> >>> END { if (!new_chunk) print footer > outfile }
> >> Just in case someone would be interested, here yet another version of
> >> the same script, where chunk size is defined in bytes (and checked
via
> >> XMLLEN, as suggested by Juergen).
>
> >> Hermann
>
> >> $ cat split_big_xmlfile.awk
>
> >> # Include the xmlcopy.awk library
> >> # Make sure that xgawk finds it
> >> @[EMAIL PROTECTED]
xmlcopy
>
> >> # new_chunk can be anything here, but not 0 or ""
> >> # size value defines approx. chunk size in bytes
> >> # you might have to worry about XMLCHARSET (or not)
> >> BEGIN {
> >> =A0 =A0 =A0 =A0 =A0new_chunk =3D size =3D 250000000
> >> =A0 =A0 =A0 =A0 =A0# XMLCHARSET =3D "ISO-8859-1"
>
> >> }
>
> >> # Remember original XML declaration
> >> XMLDECLARATION { header =3D XmlCopy() }
>
> >> # Remember original root element, define the footer
> >> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> >> =A0 =A0 =A0 =A0 =A0header =3D header ORS XmlCopy() ORS
> >> =A0 =A0 =A0 =A0 =A0footer =3D ORS "</" XMLSTARTELEM ">"
>
> >> }
>
> >> # Only care about these elements and their children
> >> XMLPATH ~ /OfferInfo/ {
> >> =A0 =A0 =A0 =A0 =A0if (new_chunk) {
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outfile =3D "chunk"
sprintf("%07d", =
num) ".xml"
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf "%s", header > outfile
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new_chunk =3D ""
> >> =A0 =A0 =A0 =A0 =A0}
> >> =A0 =A0 =A0 =A0 =A0printf "%s", XmlCopy() > outfile
> >> =A0 =A0 =A0 =A0 =A0chunk_size +=3D XMLLEN
>
> >> }
>
> >> # Decide if it's time to add a footer and start with a new chunk
> >> XMLENDELEM =3D=3D "OfferInfo" && chunk_size > size {
> >> =A0 =A0 =A0 =A0 =A0printf "%s", footer > outfile
> >> =A0 =A0 =A0 =A0 =A0num++
> >> =A0 =A0 =A0 =A0 =A0new_chunk =3D "it's time now"
> >> =A0 =A0 =A0 =A0 =A0chunk_size =3D 0
>
> >> }
>
> >> END {
> >> =A0 =A0 =A0 =A0 =A0# Footer for the last chunk, but avoid double
footer=
s
> >> =A0 =A0 =A0 =A0 =A0if (!new_chunk) printf "%s", footer > outfile
>
> >> =A0 =A0 =A0 =A0 =A0# Print XMLERRORs, if any. Xgawk is somewhat lazy
in=
> >> =A0 =A0 =A0 =A0 =A0# this respect and might silently die, if you
don't =
have:
> >> =A0 =A0 =A0 =A0 =A0if (XMLERROR)
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("XMLERROR '%s' at row %d
col =
%d len %d\n",
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XMLERROR, XMLROW,
XM=
LCOL, XMLLEN)
>
> >> }
>
> > I am missing the words!.. Thanks alot. BTW I already searched for the
> > XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
> > but not XMLCOPY. Do you have some url?
>
> > Malapha
>
> On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk
>
> It is part of the latest xgawk release:
xgawk-3.1.6-20080101.tar.gzhttps:/=
/sourceforge.net/project/showfiles.php?group_id=3D133165
>
> A third place is the source code repository, see
here:http://xmlgawk.cvs.s=
ourceforge.net/xmlgawk/xmlgawk/awklib/xml/
>
> Hermann
Thanks again. I got everything up and running - and it worked :-) I
also modified XMLCOPY as suggested.
Here are some benchmarks:
Type Minutes Size
BYTESHRED_XMLCOPY 7,966666667 322 MB
COUNTSHRED 0,583333333 322 MB
COUNTSHRED_XMLCOPY 7,55 322 MB
Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
textbased method (Hermans first) is by ways the fastest. Having the
disadvantage, that the xml-input file has to be well formed. I am
still struggling which methodology to use. As I have filesizes of up
to 3 GB "COUNTSHRED" seems to be the one.
One more question: In my XML Files there is another tag next to the
<OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
the code, so that it also gets processed?
Many thanks
Mala


|