How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl? -

- July 15, 2015

thousands of files ends *.tab. first column in each file header. every file have own headers (so different). don't mind have 1 header file.

number of rows equal in files , have order. desired output have same order.

example files in directory

test_1.tab test_2.tab . . . . test_1990.tab test_2000.tab

test_1.tab

pro_01 0 0 0 0 0 1 1 1 0 1 1 0 .....0 pro_02 0 0 0 0 0 1 1 0 0 0 0 0 .....1 pro_03 1 1 1 1 1 0 0 1 0 1 1 0 .....1 . . . pro_200 0 0 0 0 1 1 1 1 1 1 0  .....0

test_2000.tab

pro_1901 1 1 1 1 0 1 1 0 0 0 0 1 .....0 pro_1902 1 1 1 0 0 0 1 0 0 0 0 0 .....1 pro_1903 1 1 0 1 0 1 0 0 0 0 0 1 .....1 . . . pro_2000 1 0 0 0 0 1 1 1 1 1 0  .....0

desired output

pro_01 0 0 0 0 0 1 1 1 0 1 1 0 0 ..... 1 1 1 1 0 1 1 0 0 0 0 1 0 pro_02 0 0 0 0 0 1 1 0 0 0 0 0 1 ..... 1 1 1 0 0 0 1 0 0 0 0 0 1 pro_03 1 1 1 1 1 0 0 1 0 1 1 0 1 ..... 1 1 0 1 0 1 0 0 0 0 0 1 1 . . . pro_200 0 0 0 0 1 1 1 1 1 1 0 0  ..... 1 0 0 0 0 1 1 1 1 1 0 0

my code

for in *.tab/; paste allcol.tab <(cut -f 2- "$i") > itermediate.csv; mv intermediate.csv allcol.tab ; done  paste <(cut -f1 test1.tab) allcol.tab > final.tab rm allcol.tab

it takes quite time 3 hrs. better way? also, there other command cross check output file vs input files? diff or wc?

try this.

#!/bin/bash      tmp=tmp mkdir "$tmp" result=result  #read each file , append contents of each line in them #to new file each line in tmp directory  f in *.tab;     i=1     while read -r l;         echo "$l" >> "$tmp"/"$i"         ((i++))     done < <(cut -f2- "$f") done  #integrate each file in tmp dir single line of $result file exec 1>>$result     f in "$tmp"/*;     while read -r l;         printf '%s\t' "$l"     done < <(cat "$f")     echo done  rm -r "$tmp"

this algorithm can split on number of processors , task done faster.

you can add things checking if $tmp created successfully.

Search This Blog

Image

How to join 'n' number of files in ordered way efficiently using paste/join or linux or perl? -

example files in directory

test_1.tab

test_2000.tab

desired output

my code

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -