Haemophilus Influenzae Questions

Describe three reasons why the authors chose to sequence H. influenzae as the genome of the first cellular organism.

First, authors note that the size of this microbe’s genome is typical for bacteria, meaning that if they can successfully sequence this organism with their method then many more bacterial species could be sequences with the same method. Second, they note that the G+C content of the organism is comparable with that of human, suggesting that this method could be used, at least in part, to sequence the human genome. Third, they note that a physical clone map of this bacteria doesn’t exist, which means this organism couldn’t be sequenced at the time by other generally-accepted methods.

Why, do you think, that the authors chose to sequence two different size classes, 2kb and 15-20kb?

Choosing these two sizes allowed a two-way approach to covering the full genome. The authors note that they chose small, 2kb regions to avoid capturing a full gene, which was likely to be expressed and therefore deleterious to the E. coli cells in which the plasmid was grown. This means that their approach was likely to miss regions which are gene-dense. In order to cover these gene dense regions, they took the alternative approach of creating lamda-fragments, which did not previously exist for this organism, to supplement their approach and cover these difficult-to-sequence regions.

Using Table 2, calculate the coverage achieved in this project.

Genome size (bp): 1,830,137

Successful forward sequencing reactions: 16,240

Average forward sequencing length (bp): 485

Successful reverse sequencing reactions: 7,744

Average reverse sequencing length (bp): 444

Number of base pairs in random assembly: 11,631,485

Total number of base pairs sequenced: 16,240*485 + 7,744*444 + 11,631,485 = 22,946,221

Total coverage: 22,946,221/1,830,137 = 12.54

The total coverage of this approach was therefore 12.54. However, if we only consider the shotgun sequencing reads, then this number drops to 6.18.

Using the Lander-Waterman coverage calculator (Table 1 HERE), find the percentage of the H. influenzae genome that was sequenced by random shotgun sequencing.

According to the table in the provided link, the percent of the clone which has sequence covered is 99.75% if the coverage of sequencing is about 6x. This means that the random shotgun sequencing covered 99.75% of the clones. Assuming a truly random distribution of clones, this suggests that the amount of the genome covered was 99.75%. However, authors have noted that these clones were not truly random, but were less likely to contain gene-dense regions. Therefore, this is a maximum amount, and is likely an overestimate.

In your own words, explain why a region of a genome can be translated conceptually in 6 different ways.

Genomes are translated in terms of codons, which are 3bp stretches of DNA. Each codon corresponds to a specific amino acid. In the stretch of DNA “GATTACA”, there are 5 codons: “GAT,” “ATT,” “TTA,” “TAC,” and “ACA.” When translation occurs, the machinery reads codons only in three base-pair chunks—the machinery reads the first three base pair chunk, then jumps three base pairs ahead and reads the next, rather than jumping only one base pair ahead. This means that if the machinery started at “GAT,” then the next codon the machinery would translate is “TAC.” This is important, because it means that whichever base pair the machinery starts on determines the set of codons to be translated. If instead of starting at “G,” the machinery had started at the first “A,” the first two codons would be “ATT,” and “ACA.” Following this logic, we can see that for any stretch of DNA there are three possible translations that occur, determined by which base pair the machinery starts on. If the machinery were to start on the “4th” base pair in the sequence, it would be the same as starting on the “1st” base pair of a sequence, with the first codon skipped.

This number is doubled from 3 to 6 because for each sequence there is a reverse compliment sequence. The machinery could be on the opposite strand of the DNA, and thus translating a different set of codons as determined by the base-pairing rules. It would also be starting from the opposite end of the sequence.

Which of the 6 reading frames do you think is the correct reading frame for the protein encoded in this region?

I believe that “5’3’ Frame 3” is correct, because it contains a single, large protein rather than several smaller proteins.

Paste the amino acid sequence of the protein encoded in this region from the reading frame you selected in the previous question.


Lets try to identify this protein. Copy the amino acid sequence from Question 7 and go to the NCBI Blast:Protein search tool. This tool will search your amino acid sequence against a database of known proteins to try to identify it by sequence similarity. Paste the amino acid sequence into the large box and press the “Blast” button. Wait while the database is searched.

After performing the search as instructed, I received the prediction of hemoglobin subunit beta (also called beta-globin, depending on the species).