Main Content

Split Input SAM Files and Assemble Transcriptomes Using Bioinformatics Pipeline

Import the pipeline and block objects needed for the example.

import bioinfo.pipeline.Pipeline
import bioinfo.pipeline.block.*

Create a pipeline.

P = Pipeline
P = 
  Pipeline with properties:

        Blocks: [0×1 bioinfo.pipeline.Block]
    BlockNames: [0×1 string]

Use a FileChooser block to select the provided SAM files. The files contain aligned reads for Mycoplasma pneumoniae from two samples.

fileChooserBlock = FileChooser([which("Myco_1_1.sam"); which("Myco_1_2.sam")]);

Create a Cufflinks block.

cufflinksBlock = Cufflinks;

Add the blocks to the pipeline.

addBlock(P,[fileChooserBlock,cufflinksBlock]);

Connect the blocks.

connect(P,fileChooserBlock,cufflinksBlock,["Files","GenomicAlignmentFiles"]);

Set SplitDimension to 1 for the GenomicAlignmentFiles input port. The value of 1 corresponds to the row dimension of the input, which means that the Cufflinks block will run on each individual SAM files (Myco_1_1.sam and Myco_1_1.sam).

cufflinksBlock.Inputs.GenomicAlignmentFiles.SplitDimension = 1;

Run the pipeline. The pipeline runs Cufflinks block two times independently and generates a set of four files for each SAM file.

run(P);

Get the block results.

cufflinksResults = results(P,cufflinksBlock)
cufflinksResults = struct with fields:
           TranscriptsGTFFile: [2×1 bioinfo.pipeline.datatype.File]
             IsoformsFPKMFile: [2×1 bioinfo.pipeline.datatype.File]
                GenesFPKMFile: [2×1 bioinfo.pipeline.datatype.File]
    SkippedTranscriptsGTFFile: [2×1 bioinfo.pipeline.datatype.File]

Use the process table to check the total number of runs for each block. Cufflinks ran two times independently.

t = processTable(P,Expanded=true);

Set SplitDimension to empty [] (which is the default). In this case, the pipeline does split the input files and runs Cufflinks just once for both SAM files, processing each SAM file one after another.

cufflinksBlock.Inputs.GenomicAlignmentFiles.SplitDimension = [];
deleteResults(P,IncludeFiles=true);
run(P);
cufflinksResults = results(P,cufflinksBlock)
cufflinksResults = struct with fields:
           TranscriptsGTFFile: [2×1 bioinfo.pipeline.datatype.File]
             IsoformsFPKMFile: [2×1 bioinfo.pipeline.datatype.File]
                GenesFPKMFile: [2×1 bioinfo.pipeline.datatype.File]
    SkippedTranscriptsGTFFile: [2×1 bioinfo.pipeline.datatype.File]

Check the process table, which confirms that Cufflinks ran just once.

t2 = processTable(P,Expanded=true);

Tip: you can speed up the pipeline run by setting UseParallel=true if you have Parallel Computing Toolbox™. The pipeline can schedule independent executions of blocks on parallel pool workers.

run(P,UseParallel=true)

See Also

| |

Related Topics