Hermann Peifer wrote:
>
> Here the xgawk version of the same script. It works fine for me with
> your testdata. No pre-formatting of bigfile.xml is needed. However, for
> this solution you need to have xgawk and the library xmlcopy.awk
> available. In xmlcopy.awk, I made a minor change at the very end:
>
> # printf( "%s", token )
> return token
>
> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @[EMAIL PROTECTED]
xmlcopy
>
> BEGIN { new_chunk = 1 ; size = 100 }
>
> # Remember XML declaration of bigfile.xml
> XMLDECLARATION { header = XmlCopy() }
>
> # Remember root element, define the footer
> XMLSTARTELEM && XMLDEPTH == 1 {
> header = header XmlCopy()
> footer = "</" XMLSTARTELEM ">"
> }
>
> # Only care about OfferInfos and their children
> XMLPATH ~ /OfferInfo/ {
> if (new_chunk) {
> outfile = "chunk" sprintf("%07d", num) ".xml"
> printf "%s", header > outfile
> new_chunk = 0
> }
> printf "%s", XmlCopy() > outfile
> }
>
> # Decide if it's time to add a footer and start a new chunk
> XMLENDELEM == "OfferInfo" {
> num = int(++count/size)
> if (num > prev_num) {
> print footer > outfile
> new_chunk = 1
> }
> prev_num = num
> }
>
> # Avoid double footers, if at the end: count%size = 0
> END { if (!new_chunk) print footer > outfile }
Just in case someone would be interested, here yet another version of
the same script, where chunk size is defined in bytes (and checked via
XMLLEN, as suggested by Juergen).
Hermann
$ cat split_big_xmlfile.awk
# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@[EMAIL PROTECTED]
xmlcopy
# new_chunk can be anything here, but not 0 or ""
# size value defines approx. chunk size in bytes
# you might have to worry about XMLCHARSET (or not)
BEGIN {
new_chunk = size = 250000000
# XMLCHARSET = "ISO-8859-1"
}
# Remember original XML declaration
XMLDECLARATION { header = XmlCopy() }
# Remember original root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header ORS XmlCopy() ORS
footer = ORS "</" XMLSTARTELEM ">"
}
# Only care about these elements and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = ""
}
printf "%s", XmlCopy() > outfile
chunk_size += XMLLEN
}
# Decide if it's time to add a footer and start with a new chunk
XMLENDELEM == "OfferInfo" && chunk_size > size {
printf "%s", footer > outfile
num++
new_chunk = "it's time now"
chunk_size = 0
}
END {
# Footer for the last chunk, but avoid double footers
if (!new_chunk) printf "%s", footer > outfile
# Print XMLERRORs, if any. Xgawk is somewhat lazy in
# this respect and might silently die, if you don't have:
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
}


|