sql - What declarative language is good at analysis of tree-like data? -


i'm thinking developing system perform highly parallel queries on nested (but tree-like) data. potential users data analysts (physicists, specifically), not programmers. user interface, want use well-known query language avoid proliferating new languages.

most of data structured (imagine following schema billions of event structures):

event: struct   |   +--- timestamp: bigint   +--- missing energy: float   +--- tracks: array of struct   |      |   |      +--- momentum: float   |      +--- theta angle: float   |      +--- hits: array of struct   |             |   |             +--- detector id: int   |             +--- charge: float   |             +--- time: float   |             +--- ...   +--- showers: array of struct          |          +--- ... 

the database read-only, , of queries things like:

  • momentum of track hits theta between -2.4 , 2.4
  • average charge of hits time in 0-10 ps on tracks momentum greater 10 gev/c
  • weighted average theta of 2 tracks highest momentum

et cetera. these queries have in common resolve 1 scalar per event, though delve arrays of structs it. perform "reduce" operations (generally fold in scala, aggregate in spark, daf in sql) across filtered, transformed subsets of arrays. write them in scala this:

// missing check when 0 tracks passed filter! {event => event.tracks                      // list of tracks                .filter(abs(_.theta) < 2.4)  // in theta range                .maxby(_.hits.size)          // take 1 hits                .momentum                    // return momentum }  {event => mean(             event.tracks                    // list of tracks                  .filter(_.momentum > 10)   // in momentum range                  .flatmap(_.hits)           // explode hits                  .filter(_.time < 10)       // in time range                  .map(_.charge)             // return charges               )}                            // ... mean function  // again missing check less 2 tracks! {event => val list(one, two) =              // unpack , assign "one" , "two"               event.tracks                  // list of tracks                    .sortby(_.momentum)      // sort momentum                    .take(2)                 // take first 2           // compute weighted mean of structs "one" , "two"           (one.theta*one.momentum + two.theta*two.momentum) /               (one.momentum + two.momentum) } 

why not use scala? program implemented in c , run on gpus. whatever scala bring reimplemented subset--- in other words, invented language. (the same said haskell, javascript, or other language makes heavy use of functions arguments.)

also, these queries ought declarative. if implement of general purpose programming language, details order of function calls might become relevant.

why not use sql? possible write queries above easily, such they're readable other author? queries above norm, not complex extremes.

sql supports nested arrays of structs, examples can find of using substructure horrendously complicated. 1 has explode table of events table of tracks (or double-explode hits), , complex accounting needed unexplode , 1 scalar per event.

i suppose use sql new functions maximal(collection, function) return struct array, similar track[12] using user-provided function objective function maximizing, minimizing, finding top/bottom n, etc. don't think sql supports passing functions arguments. if write sql does, non-standard.

is there used dialect of sql supports passing functions arguments?

or there declarative language should consider?

i posted in comment earlier, moving here.

i'm others on use of graph database. i'm not familiar neo4j queries, expect them capable. similarly, sparql work kind of thing.

for first query, sparql query might like:

prefix : <http://yournamespace.com/accelerator/> .  select ?momentum (max(?hitcount) ?maxhits) {     select ?momentum (count(?hits) ?hitcount)     ?track :momentum ?momentum .           ?track :theta ?theta .           filter (?theta > -2.4 , ?theta < 2.4) .           ?track :hits ?hits     group ?track } group ?momentum; 

identifiers have : prefix on them because need encoded uris. that's internal detail moving rdf (the data format sparql databases).

the above query doing sub-queries because you're looking aggregate (on count), , aggregate again (with max of count). can see it's handled in sql-like way, , not require post-processing.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -