Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Splitting h...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 10 of 16 Topic 2194 of 2236
Post > Topic >>

Re: Splitting huge XML Files into fixsized wellformed parts

by Hermann Peifer <peifer@[EMAIL PROTECTED] > Mar 19, 2008 at 02:05 PM

Hermann Peifer wrote:
> 
> Here the xgawk version of the same script. It works fine for me with 
> your testdata. No pre-formatting of bigfile.xml is needed. However, for 
> this solution you need to have xgawk and the library xmlcopy.awk 
> available. In xmlcopy.awk, I made a minor change at the very end:
> 
>    # printf( "%s", token )
>    return token
> 
> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
> 
> $ cat split_big_xmlfile.awk
> 
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @[EMAIL PROTECTED]
 xmlcopy
> 
> BEGIN { new_chunk = 1 ; size = 100 }
> 
> # Remember XML declaration of bigfile.xml
> XMLDECLARATION { header = XmlCopy() }
> 
> # Remember root element, define the footer
> XMLSTARTELEM && XMLDEPTH == 1 {
>     header = header XmlCopy()
>     footer = "</" XMLSTARTELEM ">"
> }
> 
> # Only care about OfferInfos and their children
> XMLPATH ~ /OfferInfo/ {
>     if (new_chunk) {
>         outfile = "chunk" sprintf("%07d", num) ".xml"
>         printf "%s", header > outfile
>         new_chunk = 0
>     }
>     printf "%s", XmlCopy() > outfile
> }
> 
> # Decide if it's time to add a footer and start a new chunk
> XMLENDELEM == "OfferInfo" {
>     num = int(++count/size)
>     if (num > prev_num) {
>         print footer > outfile
>         new_chunk = 1
>     }
>     prev_num = num
> }
> 
> # Avoid double footers, if at the end: count%size = 0
> END { if (!new_chunk) print footer > outfile }


Just in case someone would be interested, here yet another version of 
the same script, where chunk size is defined in bytes (and checked via 
XMLLEN, as suggested by Juergen).

Hermann

$ cat split_big_xmlfile.awk

# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@[EMAIL PROTECTED]
 xmlcopy

# new_chunk can be anything here, but not 0 or ""
# size value defines approx. chunk size in bytes
# you might have to worry about XMLCHARSET (or not)
BEGIN {
         new_chunk = size = 250000000
         # XMLCHARSET = "ISO-8859-1"
}

# Remember original XML declaration
XMLDECLARATION { header = XmlCopy() }

# Remember original root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
         header = header ORS XmlCopy() ORS
         footer = ORS "</" XMLSTARTELEM ">"
}

# Only care about these elements and their children
XMLPATH ~ /OfferInfo/ {
         if (new_chunk) {
                 outfile = "chunk" sprintf("%07d", num) ".xml"
                 printf "%s", header > outfile
                 new_chunk = ""
         }
         printf "%s", XmlCopy() > outfile
         chunk_size += XMLLEN
}

# Decide if it's time to add a footer and start with a new chunk
XMLENDELEM == "OfferInfo" && chunk_size > size {
         printf "%s", footer > outfile
         num++
         new_chunk = "it's time now"
         chunk_size = 0
}

END {
         # Footer for the last chunk, but avoid double footers
         if (!new_chunk) printf "%s", footer > outfile

         # Print XMLERRORs, if any. Xgawk is somewhat lazy in
         # this respect and might silently die, if you don't have:
         if (XMLERROR)
                 printf("XMLERROR '%s' at row %d col %d len %d\n",
                         XMLERROR, XMLROW, XMLCOL, XMLLEN)
}




 16 Posts in Topic:
Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 03:43:20 
Re: Splitting huge XML Files into fixsized wellformed parts
Janis Papanagnou <Jani  2008-03-17 13:37:27 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 06:35:37 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-17 20:20:36 
Re: Splitting huge XML Files into fixsized wellformed parts
=?ISO-8859-1?Q?J=FCrgen_K  2008-03-17 21:33:46 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 00:01:43 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:42:50 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:43:54 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 20:49:03 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-19 14:05:17 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-19 15:11:08 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-20 09:52:22 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-25 03:39:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-25 06:32:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-26 10:01:38 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-26 19:57:10 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 9:29:26 CDT 2008.