Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Programming > C++ > Re: Algorithm o...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 4 of 4 Topic 45772 of 47017
Post > Topic >>

Re: Algorithm on search

by James Kanze <james.kanze@[EMAIL PROTECTED] > May 7, 2008 at 01:21 AM

On May 6, 2:15 pm, p...@[EMAIL PROTECTED]
 (Pascal J. Bourguignon)
wrote:
> Vincent SHAO <vincent.shao.b...@[EMAIL PROTECTED]
> writes:
> > Search engine have to record all of the query string. Now i
> > have a search engine log which contains 10 milllion query
> > strings, but almost of them are repeated, not more than 3
> > million of them are non-repeated.  My task is to pick the
> > top 10 most popular query string, memory < 1G, the length of
> > the query string is no more than 255.

> > The faster, the better.  the principal solutions, algorithm
> > and data structure.

> Sounds like homework...  Is it homework?

It sounds like a problem that really could come up
professionally (which doesn't mean it isn't homework---a well
conceived course will use realistic examples).

> In real "search engine" queries, it's not strings, but _sets_ of
> keywords that are searched, so already your data structures will be
> suboptimal...  That is, unless it's homework.

What about SQL or LDAP?  The problem is that he probably should
consider requests that different only in white space to be the
same; for that matter (considering SQL), something like "SELECT
* FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
from table1 where b < 10 and a > 5".  He probably needs to
normalize the input somewhat, but how far depends on the project
requirements.

Anyway, the obvious solution is to use an std::map< Query, long >
for starters, then copy the tabluation into a vector (of
Map::value_type) and sort that on the criteria he's interested
in.  If that's not fast enough, replace std::map with a not
yet standard hash_map, and only if that's not fast enough, to
start worrying about other solutions.

> In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter

That saves space, not time, and probably isn't appropriate for
such a small table.  (Space savings become time savings when the
result in less disk accesses.  In his case, however, the
complete table will almost certainly fit into real memory.)

--
James Kanze (GABI Software)             email:james.kanze@[EMAIL PROTECTED]
 en informatique orient=E9e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34
 




 4 Posts in Topic:
Algorithm on search
Vincent SHAO <vincent.  2008-05-06 03:12:21 
Re: Algorithm on search
pjb@[EMAIL PROTECTED] (P  2008-05-06 14:15:56 
Re: Algorithm on search
"Jim Langston"   2008-05-06 05:47:30 
Re: Algorithm on search
James Kanze <james.kan  2008-05-07 01:21:11 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Thu Jul 24 15:56:03 CDT 2008.