On Sun, 24 Feb 2008 19:47:46 -0800, droid wrote:
> I have "received" emails from the early nineties to present. These are
> concatenated together in two huge files.
>
> My problem is that segments of one file need to be inserted into the
other
> file. Also, there would be many dupes.
>
> I was hoping to find a utility that could do the sorting after joining
> them with 'cat', then use Thunderbird to remove the dupes.
>
> But from the replies here, it appears this is more complicated than I
> supposed.
A better approach, based on personal experience, is to split the files
into separate year files, then let Thunderbird sort them any way you want
for index display. I've been doing this for people who never clean out
their inbox for quite a few years. It's fairly easy in awk ... provided
the header marker is either constant or there are a limited number of
them.
I haven't written a script for Thunderbird files yet, but I need to -
maybe I can do it today.
....
This script, written if full procedural format for ease of conversion to
other formats, splits Thunderbird mailboxes into year files.
BEGIN{
OutFile = ""
}
{
# First line of header recognition and year extraction (Thunderbird)
if( $0 ~ /^From - / ) {
# Verification: 7 fields, the last of which is all numbers.
if( NF == 7 ) {
if( $7 !~ /[^0-9]/ ) {
# If the filename changes, it is necessary to close the previous one.
if( $7 != OutFile ) {
close( OutFile )
OutFile = $7 ".mailbox"
}
}
}
}
print $0 > OutFile
}
The first line of the headers is in this format
From - Thu Feb 02 09:10:14 2006
To help prevent triggering on spurious lines in messages and forwards from
Outlook, the line is identified, then tested for number of fields and for
a pure number in the seventh field. Since I wrote it in a very open
format, with each test on a separate line, it should be reasonably easy to
convert it for other first line formats.
--
T.E.D. (tdavis@[EMAIL PROTECTED]
)


|