Malapha wrote:
> On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@[EMAIL PROTECTED]
>
> wrote:
>> Malapha wrote:
>>> Hi,
>>> I am kind of depressed :-) I want to split xml-files with sizes
>>> greater than 2 gb into smaler chunks. As I dont want to end up with
>>> billions of files, I want those splitted files to have configurable
>>> sizes like 250 MB. Each file should be well formed having an exact
>>> copy of the header (and footer as the closing of the header) from the
>>> original file. Forthermore, a table should be generated were I can
>>> see, that the File X is seperated into Part N with timestamp:
>> A nice and well described little homework with clear requirements.
>>
>> I'd abstain from splitting the file according to file sizes in MB
>> but suggest to take a more simple measure for splitting, like number
>> of XML-blocks or number of lines.
>>
>
> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.
> The problem I see is, using linecounts works in cases where an EOL is
> implemented in the xml document. In case the input data file has no
> EOL I run into problems. So I came to the solution to use the xgawk
> framework in order to make use of the "node hopping" technique. This
> gives me the possibility to count the Offers without having to solve
> the problems mentioned above.
>
Missing line breaks could be added via a preprocessing step with
$ xmllint --format bigfile.xml > formatted_bigfile.xml
I don't know how xmllint performs with a 2G file. On my old laptop, I am
running out of memory when trying to re-format a 600M file. However, you
might have better hardware available.
There are also other XML command line tools around that have some
"pretty print" option. xmlstarlet is one of them.
>>> All in all I ended up with reading the XML processing docus with gawk,
>>> but as it seems I am lacking some deeper programming skills..
>> Given your data above you can solve that all with basic awk pattern
>> matching capabilities, no deeper skills required. What have you tried
>> so far?
>
Before going deeper into xgawk: try to reformat the file as suggested
above. Then, as suggested by Janis, you could make use regular awk for
the splitting task.
Hermann


|