Bioinformatics: Co-Evolution between residues and proteins

I stopped studying biology when I was 15, and so my knowledge in the field has been limited to popular culture and what I can find the time to read in scientific journals. This is why I decided this year to take the time to go deeper in the field by taking a class in Bioinformatics.

Today I want to share with you the knowledge I gathered on a promising topic called Co-Evolution through a summary of the most important related scientific papers.

This article is written in a scientific style and requires basic knowledge in biology (DNA, Proteins).

Introduction

Evoked by Darwin in [1], co-evolution is defined by Yp et al. in [5] as “the change of a biological object triggered by the change of a related object”.

In the recent years, there has been a growing interest in the field of bioinformatics on how to use co-evolution to infer new data. The current growth of the known number of proteins sequences is exponential, which means that the knowledge base on genetic variations becomes more and more consequent. This sudden surge of data is supporting this increase of interest in co-evolution.

The main challenges raised by the co-evolution principle in bioinformatics are to efficiently detect co-evolving entities based on a set of variations of these entities and use this information to make predictions. This means for example looking at a huge amounts of different variations in the DNA sequences of living organisms and matching pairs of entities possibly related based on an evolutionary hypothesis. We will present here two approaches. The first one based on the hypothesis that residues interacting in a protein are evolutionary correlated, and the application of the results to predict the 3D structure of proteins based on sequence data alone [4]. The second one builds on top of the first one and propose that proteins inside proteins complexes are evolutionary correlated and display how we can use this approach to predict the 3D structure of protein complexes [2]. We will also discuss, along the way, the differences and limitations of the methods proposed in these two approaches.

Measuring co-evolution between residues

Detecting co-evolved entities is the main challenge rising from the co-evolution approach.

In [4], Marks et al. focus on the evolutionary correlation between a protein’s residues. In the attempt to detect possible couplings, their first approach was to use Mutual Information (MI). MI is a measure of two variables mutual dependence [3], to predict the co-evolution score between two residues. It takes two elements and then use the frequencies of evolutionary correlation measured experimentally to infer a co-evolution score to the two elements. This method produced poor results for their application. The local aspect of the MI approach is proposed as the source of the errors. One plausible explanation stated that pairs might be predicted as evolutionary correlated by the MI method even without physical proximity. This would be due to the the transitive nature of the evolutionary correlation. A co-evolve with B, B co-evolve with C, but A doesn’t directly co-evolve with C. The MI method might associate A as co-evolving with C since it only looks at A and C when deciding. In fact all approaches using a local approach to measure the correlation are sensitive to this issue.

They decided to experiment with a global maximum entropy modeling approach. They are looking to design a general probability model fitting the following conditions [3]:

“consistency with observed data (pair and single residue frequencies)”
“maximum entropy of the global probability over the set of all possible sequences”

Using the model developed through this approach, they introduce the Direct Information (DI), a coupling score measuring direct correlations.

They visually compared the predicted contacts evaluated by both approach with known contact maps, and the DI approach was shown to be superior to the MI at predicting contacts.

In the end compared to the first MI approach using a “pair probabilities estimated based on local frequency counts”, the resulting model was using a “doubly constrained pair probabilities” [3].

Measuring co-evolution between proteins

In [2], Hopf et al. use an adaptation of the previously presented work [3]. In the original probability model a mean field approximation was used, whereas the new method uses a pseudo-likelihood maximization. They call it EC score. They also excluded all columns in the alignment with more than 80% gaps and scaled the weight of each sequence “to represent its cluster size in the alignment thus reducing the influence of identical or near-identical sequences in the calculation” [2].

But the major difference is in the calculation of a new co-evolution score to evaluate inter-proteins evolutionary correlation, the EVcomplex score. Indeed, when in [3] they were interested at residues coupling, they are here interested in co-evolved proteins.

They make 3 hypothesis on the correlations:

“most pairs of positions in an alignment are not coupled, i.e., have an EC score close to zero, and tend to be distant in the 3D structure”
“the background distribution of EC scores between non-coupled positions is approximately symmetric around a zero mean”
“higher-scoring positive score outliers capture 3D proximity more accurately than lower-scoring outliers”

To filter the noise, they scale the EC score to get what they call a “raw reliability score”. Then normalize it using the number of sequences in the alignment and the length of the concatenated alignment to be able to compare the resulting EVcomplex score between protein pairs. The value of the EVcomplex score was used to classify possible interactions between the proteins. The authors report that when considering pairs with a score above 0.8, 74% of the interactions were predicted with an accuracy up to 10Å.

Although the experiment was a success, this method is limited. For example, the high number of false positives which could come from possibly false assumptions. The authors made the hypothesis that proteins interactions are conserved across species and that co-evolved inter-proteins’ residues are always close on the genome, which might not be entirely true.

Conclusion

We presented different approaches solving the challenge introduced by co-evolution. They show that the potential of co-evolution approach to predict biological processes is high.

The presented papers can be complex to understand due to their very technical writing style, but the knowledge required is close to the skill-set of a general computer science student. Although the background required in probabilistic theory and biology might be a problem to understand the details. In general, co-evolution’s challenge can be defined as an efficient search in a space, the sequence space, which is a common modern computational problem.

References

[1] Charles Darwin. On the origin of species, 1859.
[2] Thomas A Hopf, Charlotta PI Schärfe, João PGLM Rodrigues, Anna G
Green, Oliver Kohlbacher, Chris Sander, Alexandre MJJ Bonvin, and Debora
S Marks. Sequence co-evolution gives 3d contacts and structures of
protein complexes. Elife, 3:e03430, 2014.
[3] David S Horner, Walter Pirovano, and Graziano Pesole. Correlated substitution
analysis and the prediction of amino acid structural contacts. Briengs
in bioinformatics, 9(1):4656, 2008.
[4] Debora S Marks, Lucy J Colwell, Robert Sheridan, Thomas A Hopf, Andrea
Pagnani, Riccardo Zecchina, and Chris Sander. Protein 3d structure
computed from evolutionary sequence variation. PloS one, 6(12):e28766,
2011.
[5] Kevin Y Yip, Prianka Patel, Philip M Kim, Donald M Engelman, Drew
McDermott, and Mark Gerstein. An integrated system for studying residue
coevolution in proteins. Bioinformatics, 24(2):290292, 2008.

Introduction

Measuring co-evolution between residues

Measuring co-evolution between proteins

Conclusion

References

Leave a Reply Cancel reply