CIS 1068: Homework 11

Handed out: 04/06/10; Due: by 10pm on 04/12/10
Email program to TA

Playing with DNA

We know that DNA is structured as a double helix. If we consider just one of the strands of the helix we see that it is a sequence of elements called nucleotides. There are 4 (may be 5?) different possible nucleotides, called adenine (A), guanine (G), thymine (T), and cytosine (C). Corresponding elements in the two strands are complementary: if on one side we have A, on the other is T, if on one side G, on the other C (and viceversa). Thus all the information of the helix is available on just one strand and we can think of that strand as just a string on the characters, A, T, C, G.

Let's not worry that DNA is transcribed into RNA and other mechanisms before it can be used to synthesize proteins. Instead let's assume that DNA is used directly to specify proteins.
It goes as follows: proteins are sequences of amino acids. In our discussion we assume a total of 20 possible amino acids, and that each amino acid is identified by a letter. Thus a protein can be seen as a string on these 20 letters.
A sequence of 3 consecutive nucleotides is called a codon. And codons map into amino acids as indicated in the attached table. [A possible use of the information in that table is the following Java variable:

    private static final String[][] CODON_AMINO =
        {
          {"att", "i"}, {"atc", "i"}, {"ata", "i"}, {"ctt", "l"},
          {"ctc", "l"}, {"cta", "l"}, {"ctg", "l"}, {"tta", "l"},
          {"ttg", "l"}, {"gtt", "v"}, {"gtc", "v"}, {"gta", "v"},
          {"gtg", "v"}, {"ttt", "f"}, {"ttc", "f"}, {"atg", "m"}, 
	  {"tgt", "c"}, {"tgc", "c"}, {"gct", "a"}, {"gcc", "a"}, 
          {"gca", "a"}, {"gcg", "a"}, {"ggt", "g"}, {"ggc", "g"}, 
          {"gga", "g"}, {"ggg", "g"}, {"cct", "p"}, {"ccc", "p"}, 
	  {"cca", "p"}, {"ccg", "p"}, {"act", "t"}, {"acc", "t"}, 
	  {"aca", "t"}, {"acg", "t"}, {"tct", "s"}, {"tcc", "s"}, 
	  {"tca", "s"}, {"tcg", "s"}, {"agt", "s"}, {"agc", "s"}, 
          {"tat", "y"}, {"tac", "y"}, {"tgg", "w"}, {"caa", "q"}, 
          {"cag", "q"}, {"aat", "n"}, {"aac", "n"}, {"cat", "h"}, 
          {"cac", "h"}, {"gaa", "e"}, {"gag", "e"}, {"gat", "d"}, 
          {"gac", "d"}, {"aaa", "k"}, {"aag", "k"}, {"cgt", "r"}, 
          {"cgc", "r"}, {"cga", "r"}, {"cgg", "r"}, {"aga", "r"}, 
          {"agg", "r"}
        };

] A specific codon, ATG, is called the start codon, i.e. the translation from DNA to protein starts at such a codon. Three codons, TAA, TAG, TGA, are called stop codons. The definition of a protein (a gene) starts at a start codon (excluded) and ends at the first stop codon following it (excluded) that includes a multiple of 3 nucleotides.

You are to write a program that is given as command line parameter the name of a file containing DNA information as a string (here is an example of such a file). The string may be broken into multiple lines and contain spaces. You should pay no attention to such line breaks and spaces. You should:

  1. Write to a new file, say proteins.txt, the proteins that are defined in the given file. A protein will just be a string of the single letter codes of its constituent amino acids. This string should be broken across lines to make sure that no line has more than 70 characters. Proteins will be separated by blank lines.
  2. Print out to the screen for each amino acid found in the output file the total number of such occurrences in absolute and as a percentage of the amino acids in the output proteins. Also, specify the identity and number of the codons that identified such an amino acid (remember, an amino acid may be identified by more than one codon).
Notice that if we had the DNA string: ATGCCCAATAGGTAG, the substring between the start codon ATG and the first the stop codon TAG would be CCCAA, which is not a multiple of 3, thus not a whole number of codons. In this case we will continue until the next stop codon and now we have CCCAATAGG which has length that is a multiple of 3 thus we have what we have defined as a protein.

Send to the TA a case analysis for this problem: problem statement, analysis, design, implementation, and testing.