Darkness 2002-11-05 00:00:00 UTC
Twilight 2002-11-06 00:00:00 UTC
Daylight 2002-11-07 00:00:00 UTC
Finish 2002-11-12 00:00:00 UTC

Protein Folding - Rules

Arrow_down  Download Sample Code

Protein Folding Contest Rules

This contest is similar to the last one in that it deals with molecules, but the problem is more current: protein folding. Proteins, which are initially synthesized in the cell as long chains of amino acid building blocks, automatically fold down into more compact working shapes. Exactly how they do this, and how they do it quickly and consistently, is poorly understood.

The problem is so notoriously difficult that even dramatic simplifications are difficult to solve. One stripped-down version of the protein-folding problem is the so-called HP lattice model introduced by Lau and Dill in 1989 (See the references at the bottom of this page). Our contest is based on a version of this problem.

This contest concerns a simplified two-dimensional version of the protein folding problem. Given the amino acid sequence of a simplified protein, your job is to determine a final "optimally folded" configuration. Each potential arrangement of the protein sequence is associated with a free energy level, which we call the "result" of your folding algorithm. An optimal folding (and there may be more than one) corresponds to the configuration with the lowest possible result, which can be thought of as a minimum energy configuration. In our idealized world, there are only two kinds of amino acids: hydrophobic amino acids (denoted by 1) and hydrophilic amino acids (denoted by 0). Since proteins exist in a watery environment, the hydrophobic amino acids prefer to cling to each other, compact and apart from the water as much as possible.

The sequence for the protein shown in the diagram below is

[0 1 1 1 1 0]
where hydrophobic amino acids are shown in red and hydrophilic amino acids are shown in blue.

Think of the amino acids that make up the protein as links in a chain. Each link may bend by a multiple of 90 degrees, but the distance between links may not change (and is assumed to be 1). No self-intersections are permitted, and the final result is constrained to lie in a single plane. Thus you can think of the chain as a line snaking through a Manhattan-style grid, one amino acid at a time. This is sometimes referred to as a self-avoiding walk.

We will refer to the complex plane to simplify the configuration of any given chain. The folding sequence your function returns will define the turn taken at each link by means of the complex numbers 1 (east), i (north), -1 (west), and -i (south). If you are given the amino acid sequence a

>> a = [0 1 1 1 1 0];
you might return the configuration sequence
>> c = [1 1 1 1 1];
which defines the straight line configuration shown in the diagram above. The configuration sequence is necessarily one element shorter than the sequence itself, since each element defines the relationship between two amino acids. If we assume the first amino acid is always anchored at 0, then the position of each amino acid can be succintly calculated as
>> p = [0 cumsum(c)]
p =

     0     1     2     3     4     5
So c can be interpreted as: "from your starting location, go east five consecutive times."

To introduce a small turn in the configuration, we can put a northward turn (given by i) in the fourth element of the configuration vector. We expect to see a picture like the one below.

And indeed, the following MATLAB code confirms our expectation.

>> c = [1 1 i 1 1];
>> p = [0 cumsum(c)]
p =

     0     1     2     2+1i  3+1i  4+1i

>> plot(p,'.-')
>> axis([-2 6 -2 3])

Complex arithmetic makes the folding calculations straightforward. If we want to fold the entire right half of the original sequence up by 90 degrees, we can multiply the original c vector by a fold vector that bends it like so

>> c = [1 1 1 1 1];
>> f = [1 1 i i i];
>> c2 = c.*f
c2 =

     1     1     i     i    i

>> p2 = [0 cumsum(c2)]
p2 =

     0     1     2     2+1i  2+2i  2+3i
thereby making the protein shown below.

By applying successive folding vectors, your code should try to minimize the result. Here is one final configuration vector that tries to minimize the fold energy by bringing the four red amino acids close together. We apply yet another 90 degree turn to the last two elements.

>> f2 = [1 1 1 i i];
>> c3 = c2.*f2
c3 =

     1     1     i    -1    -1

>> p3 = [0 cumsum(c3)]
p3 =

     0     1     2     2+1i  1+1i  1i


The factors, in order of importance, that determine your score are
  1. Free energy level (which we call the "result")
  2. Runtime of your code
The total score is a combination of these two, with linear dependence on the energy level and an exponential dependence on the runtime. Lower scores are better. Your code will time out and fail after 180 seconds, so beware of combinatoric explosions in your algorithm. Furthermore, the exponential contribution of the runtime penalty gets quite high as you approach 180 seconds.

The free energy (result) is calculated by totaling the distance between all pairs of hydrophobic amino acids. Your best configuration will have the hydrophobic amino acids in the most compact mass. The code to do this calculation is shown below. Suppose you have the last sequence shown above. We ignore the location of the hydrophilic amino acids, since they do not contribute to the result, and then total up the distance matrix for the remaining elements.

The first line removes the hydrophilic elements

>> p3 = p3(logical(a));    
>> [x,y] = meshgrid(1:length(p3));
>> dist = abs(p3(x) - p3(y))        

dist =

         0    1.0000    1.4142    1.0000
    1.0000         0    1.0000    1.4142
    1.4142    1.0000         0    1.0000
    1.0000    1.4142    1.0000         0
Here is the distance matrix. The result is the sum of all remaining distances: 1 + 1 + 1 + 1 + sqrt(2) + sqrt(2)
>> result = sum(sum(dist))/2       
result =

This result, along with the associated runtime, gets passed to our scoring algorithm before returning a final score, according to the equation

score = k1 * result + k2* e (k3*runtime)

We don't publish the values k1, k2, and k3.

Developing your entry

The files you need to get started on the contest are included in a ZIP-file available here at the MATLAB Central File Exchange. If you download and uncompress this zip-file, you will have the files described below.

The routine that does the algorithmic work is solver.m, and the answer it returns is very simple: a list of ones. This means, take the amino acid vector a and string it out eastward.

function c = solver(a)
% Direction: 1 = east, i = north, -1 = west, -i = south
c = [ones(length(a)-1,1)];

A few points to keep in mind about this function and your contest entries:

  • The function must have the right signature: one input argument, one output argument. Variable names are unimportant.
    function c = solver(a)
  • The function must return a configuration vector c of the appropriate length (length(a)-1) containing only 1, -1, i, and -i values.

To test this function with the test suite in the zip-file, run runcontest.m.

 >> [output, message] = runcontest

The first column of results contains your result, and the second column gives you an estimate of the runtime. message returns a string describing the overall statistics of your submission.

It's sometimes useful to visualize what your entry is doing; you can do this by passing runcontest an input of 1.

 >> [output, message] = runcontest(1)
This provides a plot of the resulting "protein" as each test is completed.

Collaboration and editing existing entries

Once an entry has been submitted, it cannot be changed. However, any entry can be viewed, edited, and resubmitted as a new entry. You are free to view and modify any entries in the queue. The contest server maintains a history for each modified entry. If your modification of an existing entry improves its score, then you are the "author" for the purpose of determining the winners of this contest. We encourage you to examine and optimize existing entries.

We also encourage you to discuss your solutions and strategies with others. You can do this by posting to the comp.soft-sys.matlab thread that we've started from our newsreader.

Some notes on MATLAB 6.5

The version of MATLAB used for this contest, MATLAB 6.5, uses new JIT-Accelerator technology for fast execution of code. This can have a fairly dramatic effect on the speed of your code. In particular, some things, like loops, that you would have avoided in the past, might be acceptably fast now. If you are not running MATLAB 6.5, you may notice surprising differences between the runtime on your machine and the performance on the contest machine. Read this analysis for more information on what's change.

Fine Print

The allowable functions are those contained in the basic MATLAB package available in $MATLAB/toolbox/matlab, where $MATLAB is the root MATLAB directory. Functions from other toolboxes will not be available. Entries will be tested against MATLAB version 6.5 (R13).

The following are prohibited:

    Java commands or object creation
    eval, feval, etc.
    Shell escape such as !, dos, unix
    Handle Graphics commands
    ActiveX commands
    File I/O commands
    Debugging commands
    Printing commands
    Simulink commands
    Benchmark commands such as tic, toc, and flops


Check our FAQ for answers to frequently asked questions about the contest.


Lau K.F. and Dill K.A. (1989) A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22: 3986-3997

Lau K.F. and Dill K.A. (1990) Theory for protein mutability and biogenesis. Proceedings of the National Academy of Sciences USA 87: 638-642.

About named visibility periods

Contests are divided into segments where some or all of the scores and code may be hidden for some users. Here are the segments for this contest:

  • Darkness - You can't see the code or scores for any of the entries.
  • Twilight - You can see scores but no code.
  • Daylight - You can see scores and code for all entries.
  • Finish - Contest end time.