How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl? -
thousands of files ends *.tab. first column in each file header. every file have own headers (so different). don't mind have 1 header file.
number of rows equal in files , have order. desired output have same order.
example files in directory
test_1.tab test_2.tab . . . . test_1990.tab test_2000.tab
test_1.tab
pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0 pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1 pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1 . . . pro_200 0 0 0 0 1 1 1 1 1 1 0 .....0
test_2000.tab
pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0 pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1 pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1 . . . pro_2000 1 0 0 0 0 1 1 1 1 1 0 .....0
desired output
pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0 pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1 pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1 . . . pro_200 0 0 0 0 1 1 1 1 1 1 0 0 ..... 1 0 0 0 0 1 1 1 1 1 0 0
my code
for in *.tab/; paste allcol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allcol.tab ; done paste <(cut -f1 test1.tab) allcol.tab > final.tab rm allcol.tab
it takes quite time 3 hrs. better way? also, there other command cross check output file vs input files? diff or wc?
try this.
#!/bin/bash tmp=tmp mkdir "$tmp" result=result #read each file , append contents of each line in them #to new file each line in tmp directory f in *.tab; i=1 while read -r l; echo "$l" >> "$tmp"/"$i" ((i++)) done < <(cut -f2- "$f") done #integrate each file in tmp dir single line of $result file exec 1>>$result f in "$tmp"/*; while read -r l; printf '%s\t' "$l" done < <(cat "$f") echo done rm -r "$tmp"
this algorithm can split on number of processors , task done faster.
you can add things checking if $tmp
created successfully.
Comments
Post a Comment