Malapha wrote:
> On 18 Mrz., 00:01, Hermann Peifer <pei...@[EMAIL PROTECTED]
> wrote:
>> Malapha wrote:
>>
>>> As I come from the VBA world - I tried to get familiar with awk. What
>>> I do have is theoretical solution in form of a structured process
>>> diagram :-)
>>> Copy Header and Footer from Original to Var
>>> Set Start_Offer = First Offer (from <Offer> to </Offer>)
>>> Set End_Transaction = 0
>>> Set Part = 0
>>> Set FileSize = 0
>>> Set MaxFileSize = 250
>>> while not Start_Offer < EOF(OriginalXMLFile)
>>> Part=part+1
>>> Open NewFile OriginalXMLFileName + Part + ".xml"
>>> Paste Header from Var to NewFile
>>> While filesize(NewFile)<MaxFileSize do
>>> Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
>>> Start_Offer=Start_Offer + 1
>>> wend
>>> Paste Footer from Var to NewFile
>>> wend
>>> I am right now trying to translate this into awk.. Please dont ask me
>>> how far i am, its frustrating :-)
>> Below one solution for splitting in well-formed chunks, here: 100
>> OfferInfos each. There might be better solutions (I just don't know
>> them ;-) It only works if the XML data is in "pretty print format", as
>> the sample data you posted.
>>
>> $ cat split_bigfile.awk
>>
>> BEGIN { new_chunk = 1 ; size = 100 }
>>
>> NR == 1 { header = $0 ; next }
>> NR == 2 { header = header ORS $0 ; footer = "</" substr($1,2) ">" ;
next }
>>
>> $0 !~ footer {
>> if (new_chunk) {
>> outfile = "chunk" sprintf("%07d", num) ".xml"
>> print header > outfile
>> new_chunk = 0
>> }
>> print > outfile
>>
>> }
>>
>> /<\/OfferInfo>/ {
>> num = int(count++/size)
>> if (num > prev_num) {
>> print footer > outfile
>> new_chunk = 1
>> }
>> prev_num = num
>>
>> }
>>
>> END { if (!new_chunk) print footer > outfile }
>
> Herman you are great. As I have written in to Jürgen, I am unable to
> check it. But as soon as possible I ll give it a try!!
>
> Thanks again
> Malapha
Here the xgawk version of the same script. It works fine for me with
your testdata. No pre-formatting of bigfile.xml is needed. However, for
this solution you need to have xgawk and the library xmlcopy.awk
available. In xmlcopy.awk, I made a minor change at the very end:
# printf( "%s", token )
return token
Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
$ cat split_big_xmlfile.awk
# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@[EMAIL PROTECTED]
xmlcopy
BEGIN { new_chunk = 1 ; size = 100 }
# Remember XML declaration of bigfile.xml
XMLDECLARATION { header = XmlCopy() }
# Remember root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header XmlCopy()
footer = "</" XMLSTARTELEM ">"
}
# Only care about OfferInfos and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = 0
}
printf "%s", XmlCopy() > outfile
}
# Decide if it's time to add a footer and start a new chunk
XMLENDELEM == "OfferInfo" {
num = int(++count/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}
# Avoid double footers, if at the end: count%size = 0
END { if (!new_chunk) print footer > outfile }


|