Thu 17 Aug 2006
Short paper on large scale data indexing for retrieval.
When dealing with large amount of data, at times it is necessary to index outside of your current database, particularly when space is an issue.
The new MySQL archive storage engine is perfect this scenario, however you still need to index the data as the archive engine is compressed and does not support indexes (and rightfully so).
There are several camps of thought in this arena:
- hashed indexes
- Bucket systems (a data-structure)
- bloom filters (hash filtering)
with a hashed index, you need to deal with very specific queries and the hash tables will take up large amounts of space.
with a bucketed system the data is highly organized, but space will still be a major factor as will be retreival and again you will need to store an additional database that lists available data. Each piece of an index is broken down. You set up directories of the data like 0-9_a-z_A-Z (62 directories) and in each create an additional set of 0-9_a-z_A-Z, thus when retrieving the data for record number “DJ192″ you would find that item in /hashdirs/D/J/DJ192 (very simply put this is a backet system, there are obviously much more advanced versions of bucketing, even within databases)
Bloom filters were designed (quite a while back) for a different usage, but have proven to be fast, small and can be tweaked for reliability in this area.
A bloom filter is a “ONE-WAY” hash, meaning that you cannot derive the type of data stored in a filter, you can only check for the existance of a piece of data. Traditional bloom filters are used in spellchecking applications, RFID identification for stores and anywhere that you may have enourmous numbers of distinct chunks of data.
De Gan filters is my improvement on bloom filters which has been able to store enourmous amounts of data in a hashed thread-table. It is not perfect and needs to have some tweaks made on it, but for one-way hashing tables in a scripting language, it is well formed and fast (even without bitshifting implemented)
Well, that is great but how is this useful?
First, you will never search your indexes on items that are not in the database, second you can greatly speed up your searches by using this filter and third when you create the record in the filter, you can also store a combine hash (md5 over the sha1/md5 combo) and use that to reference your data. In the case of a text article, this is useful for word pairs/pairing hashing so that you can easily search them and know that your data retreival will not be wasting any milliseconds checking for data that is non-existent or actually combing through the data.
For small(ish) sets mysql fulltext is fine, I have had no problems or cpu issues on a database with over 5,000,000 records running fulltext searches over multiple fields. However, in the case where you go to 50M, 100M or 500M (depending on your hardware) you are going to see some drastic reductions in performance of searches, especially if the database is being heavily used.
With the filters I developed, it is entirely possible to fit 100M records in a one-way hash in about 20-25megs of space, in a mysql database, this would equate to the records and the indexes at most likely 200meg to 500m depending obviously on what is in the table, though for even an indexed lookup it will be much larger than 25meg for a hash. Additionally, you will need to store alongside the table a further index of the hashes, the best way to do this is bucketing the databases and tables (database A-Z etc and tables A-Z) which will be large, but will be much faster than a straight single table.
does this work?
Ask google, the founders original intent and design was published for all to view and study. It is like a roadmap on how to build a monster ever-scalable search engine
edit also found the google spasehash project that google released.