data structures - Organizing disk file for very fast access -


i developing key value database similar apache cassandra , leveldb.

it utilize lsm trees (https://en.wikipedia.org/wiki/log-structured_merge-tree)

keys strings , using c++.

currently data stored on disk in several immutable "sstables", each 2 files.

  • data file - "flat" file - record after record, sorted key, without disk block alignment.

  • index file - contains number of records , array of 8 byte numbers (uint64_t) offcets each record. can position (seek) , find offset record data file.

i have mmap-ed these 2 files in memory.

if program need locate record binary search. means program lots of disk seeks. did optimizations, example jump search , sequentially read last 40-50 records. on file 1 billion keys, still 20-25 seeks (instead of 30).

this works pretty fast - 4-5 seconds 1 billion keys without virtual memory cache (e.g. first ever request) , way under 1 second virtual memory cache.

however thinking of building additional data structure on disk can speed lookup more. thought use "level ordered array", e.g. instead of:

1, 2, 3, 4, 5, 6, 7

to be

4, 2, 6, 1, 3, 5, 7

in case, used keys @ beginning of file, not 100% sure much.

second though b tree or b+tree, creation seems complex lot of disk syncs - or @ least how see it.

apache cassandra using key samples - have every 1024th key in memory - not want go way, because of memory consumption. however, if have them on disk in "flat" file, still needs lots of seeks find "sample" key.

is there alternative missing?


Comments