python - Traverse through .tar.gz directories and concatenate files (without uncompressing the folders) -


i have folder of 20000 tar.gz directories, each containing bunch of files. want go in source folder, traverse through tar.gz directories (without decompressing) , concatenate files @ end have 3 big files.

for e.g. have root folder pnoc has .tar.gz directories, each compressed folder has 3 folders - kallisto, rsem , hugo. have uncompressed 1 such directory , looks this:

pnoc/ ├── c021_0001_20140916_tumor_rnaseq.tar.gz ├── c021_0002_001113_tumor_rnaseq.tar.gz ├── c021_0003_001409_tumor_rnaseq.tar.gz ├── c021_0004_001418_tumor_rnaseq.tar.gz ├── c021_0005_001661_tumor_rnaseq.tar.gz ├── c021_0007_001669_tumor_rnaseq.tar.gz ├── c021_0008_001699_tumor_rnaseq.tar.gz ├── c021_0009_001766_tumor_rnaseq.tar.gz ├── c021_0010_001774_tumor_rnaseq.tar.gz ├── c021_0011_001786_tumor_rnaseq.tar.gz ├── c021_0012_001825_tumor_rnaseq.tar.gz ├── c021_0013_001872_tumor_rnaseq.tar.gz ├── cpbt_0001_1_tumor_rnaseq.tar.gz ├── cpbt_0003_1_tumor_rnaseq.tar.gz ├── cpbt_0004_1_tumor_rnaseq.tar.gz ├── cpbt_0005_1_tumor_rnaseq.tar.gz ├── cpbt_0006_1_tumor_rnaseq.tar.gz ├── cpbt_0007_1_tumor_rnaseq.tar.gz ├── cpbt_0008_1_tumor_rnaseq.tar.gz ├── cpbt_0009_1_tumor_rnaseq.tar.gz ├── improperly_paired.c021_0006_001666_tumor_rnaseq.tar.gz └── pnoc-manifest   c021_0001_20140916_tumor_rnaseq ├── kallisto │   ├── c021_0001_20140916_tumor_rnaseq.abundance.h5 │   ├── c021_0001_20140916_tumor_rnaseq.abundance.tsv │   └── c021_0001_20140916_tumor_rnaseq.run_info.json └── rsem     ├── c021_0001_20140916_tumor_rnaseq.rsem.genes.norm_counts.tab     ├── c021_0001_20140916_tumor_rnaseq.rsem.genes.raw_counts.tab     ├── c021_0001_20140916_tumor_rnaseq.rsem.isoform.norm_counts.tab     ├── c021_0001_20140916_tumor_rnaseq.rsem.isoform.raw_counts.tab     ├── c021_0001_20140916_tumor_rnaseq.rsem_genes.results     ├── c021_0001_20140916_tumor_rnaseq.rsem_isoforms.results     └── hugo         ├── c021_0001_20140916_tumor_rnaseq.rsem.genes.norm_counts.hugo.tab         ├── c021_0001_20140916_tumor_rnaseq.rsem.genes.raw_counts.hugo.tab         ├── c021_0001_20140916_tumor_rnaseq.rsem.isoform.norm_counts.hugo.tab         ├── c021_0001_20140916_tumor_rnaseq.rsem.isoform.raw_counts.hugo.tab         ├── c021_0001_20140916_tumor_rnaseq.rsem_genes.hugo.results         └── c021_0001_20140916_tumor_rnaseq.rsem_isoforms.hugo.results 

so want concatenate *.abundance.tsv in one, *.rsem.genes.norm_counts.tab in second , *.rsem_genes.hugo.results in third file. what's best , efficient way that? okay - r, python or bash.

$ find --version find (gnu findutils) 4.5.11 copyright (c) 2012 free software foundation, inc. license gplv3+: gnu gpl version 3 or later <http://gnu.org/licenses/gpl.html>. free software: free change , redistribute it. there no warranty, extent permitted law.  written eric b. decker, james youngman, , kevin dalley. features enabled: d_type o_nofollow(enabled) leaf_optimisation selinux fts(fts_cwdfd) cbo(level=2)  

thanks!

using bash find command below; cat command in exec applied files returned command. + option ensure no more 1 instance of cat spawned shell.

here {} denotes files returned find command. refer more find -exec

find . -type f -name '*.abundance.tsv' -exec cat "{}" + >> ../abundancetsv.tsv find . -type f -name '*.rsem.genes.norm_counts.tab' -exec cat "{}" + >> ../genesnormcounts.tab find . -type f -name '*.rsem_genes.hugo.results' -exec cat "{}" + >> ../hugoresults.results 

Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -