data structures - Organizing disk file for very fast access -
i developing key value database similar apache cassandra , leveldb.
it utilize lsm trees
(https://en.wikipedia.org/wiki/log-structured_merge-tree)
keys strings , using c++.
currently data stored on disk in several immutable "sstables", each 2 files.
data file - "flat" file - record after record, sorted key, without disk block alignment.
index file - contains number of records , array of 8 byte numbers (uint64_t) offcets each record. can position (seek) , find offset record data file.
i have mmap-ed these 2 files in memory.
if program need locate record binary search. means program lots of disk seeks. did optimizations, example jump search , sequentially read last 40-50 records. on file 1 billion keys, still 20-25 seeks (instead of 30).
this works pretty fast - 4-5 seconds 1 billion keys without virtual memory cache (e.g. first ever request) , way under 1 second virtual memory cache.
however thinking of building additional data structure on disk can speed lookup more. thought use "level ordered array", e.g. instead of:
1, 2, 3, 4, 5, 6, 7
to be
4, 2, 6, 1, 3, 5, 7
in case, used keys @ beginning of file, not 100% sure much.
second though b tree or b+tree, creation seems complex lot of disk syncs - or @ least how see it.
apache cassandra using key samples - have every 1024th key in memory - not want go way, because of memory consumption. however, if have them on disk in "flat" file, still needs lots of seeks find "sample" key.
is there alternative missing?
Comments
Post a Comment