Gathering heterogeneous data with hadoop -


we have system, including oracle , microsoft sql dbms, data different sources , in different formats, stores , process it. "different formats" means files: dbf, xls , others, including binary formats (images), imported dbms different tools, , direct access databases. want isolate incoming data , store "forever" , want them later source , creation time. after studies want try hadoop ecosystem, not quite sure, if it's adequate solution goal. , parts of ecosystem should use? hdfs alone, hive, may else? give me piece of advise?

i assume want store files contain data -- searchable file archive.

the files can stored in hdfs ... or may find system amazon's s3 cheaper , more flexible. store files, manage other data data, namely: location, source, , creation time appending file -- simple tab-separated file or several other formats supported hadoop make easy.

you can manage , query file hive or other sql-on-hadoop tools. in effect, you're creating simple file system special attributes, trick make sure each time write file, write metadata. may have handle cases write failures, happens when delete, rename, or move files (i know, "never").

your solution might simpler depending on needs, may find storing data in subdirectories within hdfs (or aws s3) simpler. perhaps if wanted store dbf files source "foo", , xls files "bar" created on december 1, 2015, create directory structure like

/2015/12/01/foo/dbf/myfile.dbf /2015/12/01/bar/xls/myexcel.xls 

this solution has advantage of being self-maintaining -- file path stores metadata makes portable , simple, requiring nothing more shell script implement.

i don't think there's reason make solution more complicated necessary. hadoop or s3 both fine long-term, high-durability storage , querying. company has found storing information file in hadoop (which use many other purposes) , storing files on aws s3 far simpler, more secured , cheaper.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -