Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Splitting h...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 13 of 16 Topic 2194 of 2236
Post > Topic >>

Re: Splitting huge XML Files into fixsized wellformed parts

by Malapha <malapha@[EMAIL PROTECTED] > Mar 25, 2008 at 03:39 AM

On 20 Mrz., 09:52, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
> Malapha wrote:
> > On 19 Mrz., 14:05, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
> >> Hermann Peifer wrote:
>
> >>> Here the xgawk version of the same script. It works fine for me with
> >>> your testdata. No pre-formatting of bigfile.xml is needed. However,
fo=
r
> >>> this solution you need to have xgawk and the library xmlcopy.awk
> >>> available. In xmlcopy.awk, I made a minor change at the very end:
> >>> =A0 =A0# printf( "%s", token )
> >>> =A0 =A0return token
> >>> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
> >>> $ cat split_big_xmlfile.awk
> >>> # Include the xmlcopy.awk library
> >>> # Make sure that xgawk finds it
> >>> @[EMAIL PROTECTED]
 xmlcopy
> >>> BEGIN { new_chunk =3D 1 ; size =3D 100 }
> >>> # Remember XML declaration of bigfile.xml
> >>> XMLDECLARATION { header =3D XmlCopy() }
> >>> # Remember root element, define the footer
> >>> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> >>> =A0 =A0 header =3D header XmlCopy()
> >>> =A0 =A0 footer =3D "</" XMLSTARTELEM ">"
> >>> }
> >>> # Only care about OfferInfos and their children
> >>> XMLPATH ~ /OfferInfo/ {
> >>> =A0 =A0 if (new_chunk) {
> >>> =A0 =A0 =A0 =A0 outfile =3D "chunk" sprintf("%07d", num) ".xml"
> >>> =A0 =A0 =A0 =A0 printf "%s", header > outfile
> >>> =A0 =A0 =A0 =A0 new_chunk =3D 0
> >>> =A0 =A0 }
> >>> =A0 =A0 printf "%s", XmlCopy() > outfile
> >>> }
> >>> # Decide if it's time to add a footer and start a new chunk
> >>> XMLENDELEM =3D=3D "OfferInfo" {
> >>> =A0 =A0 num =3D int(++count/size)
> >>> =A0 =A0 if (num > prev_num) {
> >>> =A0 =A0 =A0 =A0 print footer > outfile
> >>> =A0 =A0 =A0 =A0 new_chunk =3D 1
> >>> =A0 =A0 }
> >>> =A0 =A0 prev_num =3D num
> >>> }
> >>> # Avoid double footers, if at the end: count%size =3D 0
> >>> END { if (!new_chunk) print footer > outfile }
> >> Just in case someone would be interested, here yet another version of
> >> the same script, where chunk size is defined in bytes (and checked
via
> >> XMLLEN, as suggested by Juergen).
>
> >> Hermann
>
> >> $ cat split_big_xmlfile.awk
>
> >> # Include the xmlcopy.awk library
> >> # Make sure that xgawk finds it
> >> @[EMAIL PROTECTED]
 xmlcopy
>
> >> # new_chunk can be anything here, but not 0 or ""
> >> # size value defines approx. chunk size in bytes
> >> # you might have to worry about XMLCHARSET (or not)
> >> BEGIN {
> >> =A0 =A0 =A0 =A0 =A0new_chunk =3D size =3D 250000000
> >> =A0 =A0 =A0 =A0 =A0# XMLCHARSET =3D "ISO-8859-1"
>
> >> }
>
> >> # Remember original XML declaration
> >> XMLDECLARATION { header =3D XmlCopy() }
>
> >> # Remember original root element, define the footer
> >> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> >> =A0 =A0 =A0 =A0 =A0header =3D header ORS XmlCopy() ORS
> >> =A0 =A0 =A0 =A0 =A0footer =3D ORS "</" XMLSTARTELEM ">"
>
> >> }
>
> >> # Only care about these elements and their children
> >> XMLPATH ~ /OfferInfo/ {
> >> =A0 =A0 =A0 =A0 =A0if (new_chunk) {
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outfile =3D "chunk"
sprintf("%07d", =
num) ".xml"
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf "%s", header > outfile
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new_chunk =3D ""
> >> =A0 =A0 =A0 =A0 =A0}
> >> =A0 =A0 =A0 =A0 =A0printf "%s", XmlCopy() > outfile
> >> =A0 =A0 =A0 =A0 =A0chunk_size +=3D XMLLEN
>
> >> }
>
> >> # Decide if it's time to add a footer and start with a new chunk
> >> XMLENDELEM =3D=3D "OfferInfo" && chunk_size > size {
> >> =A0 =A0 =A0 =A0 =A0printf "%s", footer > outfile
> >> =A0 =A0 =A0 =A0 =A0num++
> >> =A0 =A0 =A0 =A0 =A0new_chunk =3D "it's time now"
> >> =A0 =A0 =A0 =A0 =A0chunk_size =3D 0
>
> >> }
>
> >> END {
> >> =A0 =A0 =A0 =A0 =A0# Footer for the last chunk, but avoid double
footer=
s
> >> =A0 =A0 =A0 =A0 =A0if (!new_chunk) printf "%s", footer > outfile
>
> >> =A0 =A0 =A0 =A0 =A0# Print XMLERRORs, if any. Xgawk is somewhat lazy
in=

> >> =A0 =A0 =A0 =A0 =A0# this respect and might silently die, if you
don't =
have:
> >> =A0 =A0 =A0 =A0 =A0if (XMLERROR)
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("XMLERROR '%s' at row %d
col =
%d len %d\n",
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XMLERROR, XMLROW,
XM=
LCOL, XMLLEN)
>
> >> }
>
> > I am missing the words!.. Thanks alot. BTW I already searched for the
> > XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
> > but not XMLCOPY. Do you have some url?
>
> > Malapha
>
> On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk
>
> It is part of the latest xgawk release:
xgawk-3.1.6-20080101.tar.gzhttps:/=
/sourceforge.net/project/showfiles.php?group_id=3D133165
>
> A third place is the source code repository, see
here:http://xmlgawk.cvs.s=
ourceforge.net/xmlgawk/xmlgawk/awklib/xml/
>
> Hermann

Thanks again. I got everything up and running - and it worked :-) I
also modified XMLCOPY as suggested.

Here are some benchmarks:
Type	                Minutes	        Size
BYTESHRED_XMLCOPY	7,966666667	322 MB
COUNTSHRED       	0,583333333	322 MB
COUNTSHRED_XMLCOPY	7,55	        322 MB

Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
textbased method (Hermans first) is by ways the fastest. Having the
disadvantage, that the xml-input file has to be well formed. I am
still struggling which methodology to use. As I have filesizes of up
to 3 GB "COUNTSHRED" seems to be the one.

One more question: In my XML Files there is another tag next to the
<OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
the code, so that it also gets processed?


Many thanks
Mala




 16 Posts in Topic:
Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 03:43:20 
Re: Splitting huge XML Files into fixsized wellformed parts
Janis Papanagnou <Jani  2008-03-17 13:37:27 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 06:35:37 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-17 20:20:36 
Re: Splitting huge XML Files into fixsized wellformed parts
=?ISO-8859-1?Q?J=FCrgen_K  2008-03-17 21:33:46 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 00:01:43 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:42:50 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:43:54 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 20:49:03 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-19 14:05:17 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-19 15:11:08 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-20 09:52:22 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-25 03:39:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-25 06:32:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-26 10:01:38 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-26 19:57:10 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 9:28:37 CDT 2008.