Talk About Network



Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > Awk > Re: Splitting h...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 4 of 16 Topic 2194 of 2236
Post > Topic >>

Re: Splitting huge XML Files into fixsized wellformed parts

by Hermann Peifer <peifer@[EMAIL PROTECTED] > Mar 17, 2008 at 08:20 PM

Malapha wrote:
> On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@[EMAIL PROTECTED]
>
> wrote:
>> Malapha wrote:
>>> Hi,
>>> I am kind of depressed :-) I want to split xml-files with sizes
>>> greater than 2 gb into smaler chunks. As I dont want to end up with
>>> billions of files, I want those splitted files to have configurable
>>> sizes like 250 MB. Each file should be well formed having an exact
>>> copy of the header (and footer as the closing of the header) from the
>>> original file. Forthermore, a table should be generated were I can
>>> see, that the File X is seperated into Part N with timestamp:
>> A nice and well described little homework with clear requirements.
>>
>> I'd abstain from splitting the file according to file sizes in MB
>> but suggest to take a more simple measure for splitting, like number
>> of XML-blocks or number of lines.
>>
> 
> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.
> The problem I see is, using linecounts works in cases where an EOL is
> implemented in the xml document. In case the input data file has no
> EOL I run into problems. So I came to the solution to use the xgawk
> framework in order to make use of the "node hopping" technique. This
> gives me the possibility to count the Offers without having to solve
> the problems mentioned above.
> 

Missing line breaks could be added via a preprocessing step with
$ xmllint --format bigfile.xml > formatted_bigfile.xml

I don't know how xmllint performs with a 2G file. On my old laptop, I am 
running out of memory when trying to re-format a 600M file. However, you 
might have better hardware available.

There are also other XML command line tools around that have some 
"pretty print" option. xmlstarlet is one of them.

>>> All in all I ended up with reading the XML processing docus with gawk,
>>> but as it seems I am lacking some deeper programming skills..
>> Given your data above you can solve that all with basic awk pattern
>> matching capabilities, no deeper skills required. What have you tried
>> so far?
> 

Before going deeper into xgawk: try to reformat the file as suggested 
above. Then, as suggested by Janis, you could make use regular awk for 
the splitting task.

Hermann




 16 Posts in Topic:
Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 03:43:20 
Re: Splitting huge XML Files into fixsized wellformed parts
Janis Papanagnou <Jani  2008-03-17 13:37:27 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-17 06:35:37 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-17 20:20:36 
Re: Splitting huge XML Files into fixsized wellformed parts
=?ISO-8859-1?Q?J=FCrgen_K  2008-03-17 21:33:46 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 00:01:43 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:42:50 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-18 08:43:54 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-18 20:49:03 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-19 14:05:17 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-19 15:11:08 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-20 09:52:22 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-25 03:39:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-25 06:32:31 
Re: Splitting huge XML Files into fixsized wellformed parts
Malapha <malapha@[EMAI  2008-03-26 10:01:38 
Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer <peifer  2008-03-26 19:57:10 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri May 16 9:28:20 CDT 2008.