On May 6, 2:15 pm, p...@[EMAIL PROTECTED]
(Pascal J. Bourguignon)
wrote:
> Vincent SHAO <vincent.shao.b...@[EMAIL PROTECTED]
> writes:
> > Search engine have to record all of the query string. Now i
> > have a search engine log which contains 10 milllion query
> > strings, but almost of them are repeated, not more than 3
> > million of them are non-repeated. My task is to pick the
> > top 10 most popular query string, memory < 1G, the length of
> > the query string is no more than 255.
> > The faster, the better. the principal solutions, algorithm
> > and data structure.
> Sounds like homework... Is it homework?
It sounds like a problem that really could come up
professionally (which doesn't mean it isn't homework---a well
conceived course will use realistic examples).
> In real "search engine" queries, it's not strings, but _sets_ of
> keywords that are searched, so already your data structures will be
> suboptimal... That is, unless it's homework.
What about SQL or LDAP? The problem is that he probably should
consider requests that different only in white space to be the
same; for that matter (considering SQL), something like "SELECT
* FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
from table1 where b < 10 and a > 5". He probably needs to
normalize the input somewhat, but how far depends on the project
requirements.
Anyway, the obvious solution is to use an std::map< Query, long >
for starters, then copy the tabluation into a vector (of
Map::value_type) and sort that on the criteria he's interested
in. If that's not fast enough, replace std::map with a not
yet standard hash_map, and only if that's not fast enough, to
start worrying about other solutions.
> In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter
That saves space, not time, and probably isn't appropriate for
such a small table. (Space savings become time savings when the
result in less disk accesses. In his case, however, the
complete table will almost certainly fit into real memory.)
--
James Kanze (GABI Software) email:james.kanze@[EMAIL PROTECTED]
en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34


|