Problem 939. DNA Pattern Match: Performance Metric - Speed
The Challenge is to Rapidly find matches of DNA sequences, Length=6, in a 1,800,000 long DNA file.
At IMACST the paper An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern claimed an astounding time improvement from 9.94 seconds to 7.84 seconds, 21% time reduction, to match six segments of length 6 in a 1.8M long DNA file. Basic probability asserts 1.8M/4^6 * 6 = 2637 matches. The paper's test case produced 2346 matches. The method employed used text processing in C++. The paper's L=25 and L=50 cases will be later challenges.
Matlab can achieve matching a six pattern set of L=6 in <15 msec (i5/16GB). This is merely a 99.8% time reduction.
Challenge Description: DNA is made of letters ACGT, wiki DNA, which for the purposes of this Matlab Cody Challenge are given values 0 thru 3. (ACGT= 0123)
Input: [DNA, DNA_ID, Patterns]
- DNA is a 1.8M long uint8 row vector of values 0 thru 3.
- DNA_ID identifies the DNA segment being processed. Multiple calls using the same DNA_ID will be performed. The first call of a DNA_ID is not timed.
- Patterns is an Nx6 uint8 array where each row corresponds to a search pattern.
Locations of all start indices that match any of the patterns
Scoring: Average Time (msec) for a block of L=6 patterns
- DNA= [0 1 2 3 3 2 1 0 1 2 3 3]
- Pattern = [2 3 3 2 1 0]
- Locations = 
- Vectorization (Base 4 to Base 10 of 6 character words)
- Bitshift/Reshape to create all words
- Logical Indexing
Coming soon: Genome DNA sequencing of PhagePhix174 and Haempphilus Influenza
Solution CommentsShow comments
Problem Recent Solvers8