Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Splitting h...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 12 of 16 Topic 2194 of 2236
Post > Topic >>

Re: Splitting huge XML Files into fixsized wellformed parts

by Hermann Peifer <peifer@[EMAIL PROTECTED] > Mar 20, 2008 at 09:52 AM

Malapha wrote:
> On 19 Mrz., 14:05, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
>> Hermann Peifer wrote:
>>
>>> Here the xgawk version of the same script. It works fine for me with
>>> your testdata. No pre-formatting of bigfile.xml is needed. However,
for
>>> this solution you need to have xgawk and the library xmlcopy.awk
>>> available. In xmlcopy.awk, I made a minor change at the very end:
>>>    # printf( "%s", token )
>>>    return token
>>> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>>> $ cat split_big_xmlfile.awk
>>> # Include the xmlcopy.awk library
>>> # Make sure that xgawk finds it
>>> @[EMAIL PROTECTED]
 xmlcopy
>>> BEGIN { new_chunk = 1 ; size = 100 }
>>> # Remember XML declaration of bigfile.xml
>>> XMLDECLARATION { header = XmlCopy() }
>>> # Remember root element, define the footer
>>> XMLSTARTELEM && XMLDEPTH == 1 {
>>>     header = header XmlCopy()
>>>     footer = "</" XMLSTARTELEM ">"
>>> }
>>> # Only care about OfferInfos and their children
>>> XMLPATH ~ /OfferInfo/ {
>>>     if (new_chunk) {
>>>         outfile = "chunk" sprintf("%07d", num) ".xml"
>>>         printf "%s", header > outfile
>>>         new_chunk = 0
>>>     }
>>>     printf "%s", XmlCopy() > outfile
>>> }
>>> # Decide if it's time to add a footer and start a new chunk
>>> XMLENDELEM == "OfferInfo" {
>>>     num = int(++count/size)
>>>     if (num > prev_num) {
>>>         print footer > outfile
>>>         new_chunk = 1
>>>     }
>>>     prev_num = num
>>> }
>>> # Avoid double footers, if at the end: count%size = 0
>>> END { if (!new_chunk) print footer > outfile }
>> Just in case someone would be interested, here yet another version of
>> the same script, where chunk size is defined in bytes (and checked via
>> XMLLEN, as suggested by Juergen).
>>
>> Hermann
>>
>> $ cat split_big_xmlfile.awk
>>
>> # Include the xmlcopy.awk library
>> # Make sure that xgawk finds it
>> @[EMAIL PROTECTED]
 xmlcopy
>>
>> # new_chunk can be anything here, but not 0 or ""
>> # size value defines approx. chunk size in bytes
>> # you might have to worry about XMLCHARSET (or not)
>> BEGIN {
>>          new_chunk = size = 250000000
>>          # XMLCHARSET = "ISO-8859-1"
>>
>> }
>>
>> # Remember original XML declaration
>> XMLDECLARATION { header = XmlCopy() }
>>
>> # Remember original root element, define the footer
>> XMLSTARTELEM && XMLDEPTH == 1 {
>>          header = header ORS XmlCopy() ORS
>>          footer = ORS "</" XMLSTARTELEM ">"
>>
>> }
>>
>> # Only care about these elements and their children
>> XMLPATH ~ /OfferInfo/ {
>>          if (new_chunk) {
>>                  outfile = "chunk" sprintf("%07d", num) ".xml"
>>                  printf "%s", header > outfile
>>                  new_chunk = ""
>>          }
>>          printf "%s", XmlCopy() > outfile
>>          chunk_size += XMLLEN
>>
>> }
>>
>> # Decide if it's time to add a footer and start with a new chunk
>> XMLENDELEM == "OfferInfo" && chunk_size > size {
>>          printf "%s", footer > outfile
>>          num++
>>          new_chunk = "it's time now"
>>          chunk_size = 0
>>
>> }
>>
>> END {
>>          # Footer for the last chunk, but avoid double footers
>>          if (!new_chunk) printf "%s", footer > outfile
>>
>>          # Print XMLERRORs, if any. Xgawk is somewhat lazy in
>>          # this respect and might silently die, if you don't have:
>>          if (XMLERROR)
>>                  printf("XMLERROR '%s' at row %d col %d len %d\n",
>>                          XMLERROR, XMLROW, XMLCOL, XMLLEN)
>>
>> }
> 
> I am missing the words!.. Thanks alot. BTW I already searched for the
> XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
> but not XMLCOPY. Do you have some url?
> 
> Malapha

On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk

It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gz
https://sourceforge.net/project/showfiles.php?group_id=133165

A third place is the source code repository, see here:
http://xmlgawk.cvs.sourceforge.net/xmlgawk/xmlgawk/awklib/xml/

Hermann




 16 Posts in Topic:
Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 03:43:20 
Re: Splitting huge XML Files into fixsized wellformed parts
Janis Papanagnou <Jani  2008-03-17 13:37:27 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 06:35:37 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-17 20:20:36 
Re: Splitting huge XML Files into fixsized wellformed parts
=?ISO-8859-1?Q?J=FCrgen_K  2008-03-17 21:33:46 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 00:01:43 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:42:50 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:43:54 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 20:49:03 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-19 14:05:17 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-19 15:11:08 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-20 09:52:22 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-25 03:39:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-25 06:32:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-26 10:01:38 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-26 19:57:10 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 9:32:37 CDT 2008.