php - How scalable is this file-based DB approach? -


i have simple php script calculates things given string input. caches results database , delete entries older number of days.

our programmers implemented database as:

function cachedcalculatething($input) {   $cachefile = 'cache/' . sha1($input) . '.dat';   if (file_exists($cachefile) {     return json_decode(file_get_contents($cachefile));   }   $retval = ...   file_put_contents(json_encode($retval)); } function cleancache() {   $stale = time() - 7*24*3600;   foreach (new directoryiterator('cache/') $fileinfo) {     if ($fileinfo->isfile() && $fileinfo->getctime() < $stale) {       unlink($fileinfo->getrealpath());     } } 

we use ubuntu lamp , ext3. @ number of entries cache lookup become non-constant or violate hard limit?

while particular code not "scalable"* @ all, there number of things can improve it:

  1. sha1 takes string. non-string $input variable have serialized or json_encoded first, before calculating hash. change order protect unexpected inputs.
  2. use crc32 instead of sha1, it's faster (fastest hash non-cryptographic uses?)
  3. the directory 'cache/' relative current directory, page working dir changes, cache dir. there'll artificially high number of cache misses.
  4. everytime store file in cachedcalculatething(), store name of file in index in /dev/shm/indexedofcaches (or that). check index before calling file_exists. ext3 slow, , caches, along kernel ext3 index paged out. means directory scan hit every time ask if file_exists. small caches fast enough, big ones, you'll see slowdown.
  5. writes block, server load limit hit when cache empty, , collisions occur on cache filename when 2 or more php writers come along @ same time trying write non-existent cachefile. may want try catch errors and/or lock file testing.
  6. we're considering code in virgin environment. truth writes block indeterminate amount of time based upon current disk utilization. if disk spinning one, or older ssd, may see very slow writes. check iostat -x 4 , current disk utilization. if higher 25% already, putting disk caching on spike 100% @ random times , slow web service down. (because requests disk have queued , serviced (generally) in order (not always, don't bank on it)).
  7. depending upon size of cachefiles, maybe directly store them in /dev/shm/my_cache_files/ . if fit memory, gain keeping disk entirely out of service chain. have put cron job check overall cache size , make sure doesn't eat memory. disadvantage = non-persistent. can backup scheduling on though.
  8. do not call cleancache() in runtime/service code. directory iteration scan going super slow , block.

'*for scalability, defined in terms of linear request speed or parallel server resources. code:

  1. (-) depending upon when/where cleancache() function run -- blocks on directory indexing, until items in cache dir scanned. should go cron job. if in cron/shell job, there faster ways delete expired caches. instance: find ./cache -type f -mtime +7 -exec rm -f "{}" \;
  2. (-) right mention ext3 -- ext3's indexing , result speed small files , big directory contents relatively poor. google noatime index, , if can move cache directory separate volume, can turn off journal, avoiding double-writes, or use separate filesystem type. or see if have dir_index available mount option. here benchmark link: http://fsi-viewer.blogspot.com/2011/10/filesystem-benchmarks-part-i.html
  3. (+) directory cache entries easier distribute other servers rsync database replication.
  4. (+/-) depends on how many different cache items storing , how accessed. small numbers of files, 10-100, less 100k, frequent hits, kernel keep cache-files paged in memory , you'll see no serious slowdown @ (if implemented).

the main takeaway point achieve real scalability , performance out of caching system, little more consideration has taken short block of code shows. there may more limits ones i've enumerated, subject variables such size, number of entries, number of requests/sec, current disk load, file system type, etc -- things external code. expected, because cache persists outside of code. code listed can perform small boutique set of caching low numbers of requests, may not bigger sizes 1 comes need caching for.

also, running apache in thread or prefork mode? going affect how php blocks reads , writes.

-- um, should have added want track object , key/hash.. if $input string, in it's base form/has been computed, retrieved, serialized, etc. if $input key, file_put_contents() needs put else (the actual variable/contents). if $input object (which long string, or short one), needs lookup key, otherwise no computation being bypassed/saved.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -