[Cdash] ITK appears in GDCM cdash

Jeff Baumes jeff.baumes at kitware.com
Thu Sep 3 14:14:47 UTC 2009

The table in the link below is nice, it describes the number of bits you
need in a hash to keep a low probability of collision given the number of
entries you plan to scale to. The 50% chance of one collision of CRC32, or
any 32-bit hash algorithm, happens at 77K items (assuming it is completely
uniformly distributed), so you need to go well over 32 bits to avoid
collisions on 12 million or more items (as is obvious by the reason this
thread started). 64-bit gives a 1% chance of collision at 600 million
records, which may or may not be acceptable for cdash.
The reason 77K is so much less than the total space of 2^32 possible values
relates to the "birthday paradox" that probability professors love to



On Thu, Sep 3, 2009 at 9:41 AM, David Cole <david.cole at kitware.com> wrote:

> Julien's not only worried about the speed of the SQL query. He's also (and
> probably more) worried about the speed of computing the hash on large input
> data (test output ranges from a few bytes up to several megabytes depending
> on the test...)
> For every test.xml file submitted that contains hundreds (or even
> thousands) of tests, we have to compute the hash on the test output, look it
> up to see if it's already in the db and then insert it if it's not.
> Computing the hash on the test output is the dominant thing to worry about
> at test submission processing time.
> On Thu, Sep 3, 2009 at 9:35 AM, Eric Noulard <eric.noulard at gmail.com>wrote:
>> 2009/9/3 Amitha Perera <amitha.perera at kitware.com>:
>> > On Thu, Sep 3, 2009 at 8:43 AM, Julien Jomier<julien.jomier at kitware.com>
>> wrote:
>> >> Sorry that's what I meant: could we find a hashing algorithm which
>> return
>> >> value fits in one integer?. Also, what are the complexity of the
>> different
>> >> hashing algorithms? in another word, which one will be fast enough with
>> the
>> >> minimal chance of collisions?
>> >
>> > I'm pretty sure the SHA-1 and MD5 algorithms will prove to be fast
>> > enough for your needs. And I'm sure there are implementations of them
>> > in any language you want.
>> >
>> > Also, I suspect that
>> > will run pretty much just as fast as
>> >  SELECT * WHERE HASH_INT1=myInt1 AND HASH_INT2=myInt2 ... AND
>> HASH_INT4=myInt4;
>> >
>> > It's been a while since I did any serious SQL, but I think a SQL CHAR
>> > can represent any value between 0 and 255, so it may be just as easy
>> > to represent the hash as a CHAR(16) or what ever.  And again, I'm
>> > pretty sure the SELECT statement would execute without any noticeable
>> > difference in speed.
>> Julien seems to think
>> that string comparison was slow (compared to int comparison)
>>   >> "we will need to do a string matching in SQL and I'd like to
>> avoid this if we can for speed reasons"
>> However since as far as I understand you don't need cryptographic hash
>> function
>> you may look into some non-cryptographic hash function
>> that has been designed to be fast, like that one
>> http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
>> Seems to be used with MySQL too:
>> http://www.xaprb.com/blog/2008/03/09/a-very-fast-fnv-hash-function-for-mysql/
>> http://www.radwin.org/michael/blog/2007/03/mysql_user_defined_functio.html
>> Note that FNV may not be the sole choice in this area.
>> --
>> Erk
>> Membre de l'April - « promouvoir et défendre le logiciel libre » -
>> http://www.april.org
>> _______________________________________________
>> Cdash mailing list
>> Cdash at public.kitware.com
>> http://public.kitware.com/cgi-bin/mailman/listinfo/cdash
> _______________________________________________
> Cdash mailing list
> Cdash at public.kitware.com
> http://public.kitware.com/cgi-bin/mailman/listinfo/cdash

Jeff Baumes, Ph.D.
R&D Engineer, Kitware Inc.
(518) 881-4932
jeff.baumes at kitware.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/cdash/attachments/20090903/32badf7a/attachment-0002.htm>

More information about the CDash mailing list