Malapha wrote:
> On 19 Mrz., 14:05, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
>> Hermann Peifer wrote:
>>
>>> Here the xgawk version of the same script. It works fine for me with
>>> your testdata. No pre-formatting of bigfile.xml is needed. However,
for
>>> this solution you need to have xgawk and the library xmlcopy.awk
>>> available. In xmlcopy.awk, I made a minor change at the very end:
>>> # printf( "%s", token )
>>> return token
>>> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>>> $ cat split_big_xmlfile.awk
>>> # Include the xmlcopy.awk library
>>> # Make sure that xgawk finds it
>>> @[EMAIL PROTECTED]
xmlcopy
>>> BEGIN { new_chunk = 1 ; size = 100 }
>>> # Remember XML declaration of bigfile.xml
>>> XMLDECLARATION { header = XmlCopy() }
>>> # Remember root element, define the footer
>>> XMLSTARTELEM && XMLDEPTH == 1 {
>>> header = header XmlCopy()
>>> footer = "</" XMLSTARTELEM ">"
>>> }
>>> # Only care about OfferInfos and their children
>>> XMLPATH ~ /OfferInfo/ {
>>> if (new_chunk) {
>>> outfile = "chunk" sprintf("%07d", num) ".xml"
>>> printf "%s", header > outfile
>>> new_chunk = 0
>>> }
>>> printf "%s", XmlCopy() > outfile
>>> }
>>> # Decide if it's time to add a footer and start a new chunk
>>> XMLENDELEM == "OfferInfo" {
>>> num = int(++count/size)
>>> if (num > prev_num) {
>>> print footer > outfile
>>> new_chunk = 1
>>> }
>>> prev_num = num
>>> }
>>> # Avoid double footers, if at the end: count%size = 0
>>> END { if (!new_chunk) print footer > outfile }
>> Just in case someone would be interested, here yet another version of
>> the same script, where chunk size is defined in bytes (and checked via
>> XMLLEN, as suggested by Juergen).
>>
>> Hermann
>>
>> $ cat split_big_xmlfile.awk
>>
>> # Include the xmlcopy.awk library
>> # Make sure that xgawk finds it
>> @[EMAIL PROTECTED]
xmlcopy
>>
>> # new_chunk can be anything here, but not 0 or ""
>> # size value defines approx. chunk size in bytes
>> # you might have to worry about XMLCHARSET (or not)
>> BEGIN {
>> new_chunk = size = 250000000
>> # XMLCHARSET = "ISO-8859-1"
>>
>> }
>>
>> # Remember original XML declaration
>> XMLDECLARATION { header = XmlCopy() }
>>
>> # Remember original root element, define the footer
>> XMLSTARTELEM && XMLDEPTH == 1 {
>> header = header ORS XmlCopy() ORS
>> footer = ORS "</" XMLSTARTELEM ">"
>>
>> }
>>
>> # Only care about these elements and their children
>> XMLPATH ~ /OfferInfo/ {
>> if (new_chunk) {
>> outfile = "chunk" sprintf("%07d", num) ".xml"
>> printf "%s", header > outfile
>> new_chunk = ""
>> }
>> printf "%s", XmlCopy() > outfile
>> chunk_size += XMLLEN
>>
>> }
>>
>> # Decide if it's time to add a footer and start with a new chunk
>> XMLENDELEM == "OfferInfo" && chunk_size > size {
>> printf "%s", footer > outfile
>> num++
>> new_chunk = "it's time now"
>> chunk_size = 0
>>
>> }
>>
>> END {
>> # Footer for the last chunk, but avoid double footers
>> if (!new_chunk) printf "%s", footer > outfile
>>
>> # Print XMLERRORs, if any. Xgawk is somewhat lazy in
>> # this respect and might silently die, if you don't have:
>> if (XMLERROR)
>> printf("XMLERROR '%s' at row %d col %d len %d\n",
>> XMLERROR, XMLROW, XMLCOL, XMLLEN)
>>
>> }
>
> I am missing the words!.. Thanks alot. BTW I already searched for the
> XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
> but not XMLCOPY. Do you have some url?
>
> Malapha
On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk
It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gz
https://sourceforge.net/project/showfiles.php?group_id=133165
A third place is the source code repository, see here:
http://xmlgawk.cvs.sourceforge.net/xmlgawk/xmlgawk/awklib/xml/
Hermann


|