Problem 969. Genome Sequence 003: DNA Sequence with random positioned segments

This Challenge series will evolve the complexity of Genome DNA Sequencing. DNA Sequencing and the Shot Gun Method will be naively simplified into Cody Challenges. Genome sizes is another interesting wiki page.
DNA is represented by symbols ACGT, which for Matlab will be encoded as 0123. The basic goal is to reconstruct the original serial string of ACGT given multiple short segments. Segments are gleaned from multiple copies of the Virus/Bacteria/Chromosome thus there are overlapping, duplicative, and flipped segments. There are potential errors and duplicative stretches in the created segments. Chromosome 20 in its 59,187,298 base pairs has a segment of 820 that is repeated in at least two locations. The data being non-random largely increase lengths of duplicative stretches.
Given three overlapping pieces, ACGTCGGCCA,CTACAGGTACCG, and GACATTACG these can be readily seen to overlap and create the original if the middle is recognized as having asymetric overlaps with its adjacent segments.
ssssCTACAGGTACCGsssss Middle
Genome_003 Challenge is to reconstruct a genome under near ideal error free segment creation conditions. No segments are reversed. The segments start at random locations.
  1. Segments start at random positions (Genome_003 change from 001)
  2. Genome length is unconstrained
  3. Length of each segment - 48
  4. All segments may overlap by 16 to 47 characters
  5. No errors in the segments
  6. Genome is random (No duplicate starts or ends for 16 symbols of segments)
  7. Segment order is scrambled
Input: segs, Array of M rows of 48 value segments. Values are [0, 1, 2, 3].
Output: Gout, Genome vector of values [0,1,2,3]
Example: [0 1 2 3 2; 1 2 3 2 2; 2 2 1 1 2] creates [0 1 2 3 2 2 1 1 2] W=5, Overlap=varies
Future: Flipped segments(002), Random Position of Segment start locations(003), Extra Segments, Phage Phi X174, Parallel Processing Simulation(Shot Gun Approach), Haemophilus Influenza, Sequence with Segment Errors, and Chromosome 20 with its 59M length using 100K 4K-segments

Solution Stats

30.0% Correct | 70.0% Incorrect
Last Solution submitted on Jul 19, 2021

Problem Comments

Solution Comments

Show comments

Problem Recent Solvers3

Suggested Problems

More from this Author308

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!