Protein Folding
Despite the explosion in protein structural and functional data, our understanding of protein and movement is still very limited. Experimental methods cannot operate at the time scales necessary to record protein folding and motions, and traditional simulation techniques such as molecular dynamics and Monte Carlo methods are too computationally expensive to simulate long enough time periods for anything other than small peptide fragments. However, it is critical that we better understand protein motion and the folding process for several reasons: understanding the folding process can give insight into how to develop better structure prediction algorithms, treatments for diseases such as Alzheimer’s and Mad Cow disease can be found by studying protein misfolding, and many biochemical processes are regulated by protein motion. Predicting protein structures and simulating protein folding motions are two of the most important problems in computational biology today. Modern folding simulation methods rely on a scoring function, which attempts to distinguish the native structure (the most energetically stable 3D structure) from one or more non-native structures. Decoy databases are collections of non-native structures that are widely used to test and verify these scoring functions.
Protein Folding
Our technique for computing protein folding pathways and protein motions is based on the successful probabilistic roadmap (PRM) method for robotics motion planning. We were inspired to apply PRMs to protein folding based on our sccess in applying it to paper folding. The figure below demonstrates the parallelism between paper folding and protein folding.

We have obtained promising results for several small proteins (about 60–100 amino acids), and we have validated our pathways by comparing their secondary structure formation order with known experimental results, examining folding rates, and studying population kinetics. A major feature of our technique is that it produces many folding pathways in just a few hours on a desktop PC. These pathways provide global information about the protein’s energy landscape.

Native 3D structures of proteins G, L, NuG1, and NuG2. Mutated residues in NuG1 and NuG2 are indicated in wireframe.
All four proteins are composed of an alpha-helix (green) and two beta-hairpin 1 (orange) and hairpin 2 (blue). Native state out-exchange experiments, pulse labeling/competition experiments, and phi-value analysis indicate that hairpin 1 forms first in protein L and hairpin 2 forms first in protein G. Protein G was then mutated by submitting a few residues in hairpin 2 to decrease its relative stability. Phi-value analysis indicates that the mutations switch the hairpin formation order from the wildtype to that of the protein L. We were able to successfully replicate these results with our technique.

Figure: Comparison of secondary structure formation order for proteins G, L, NuG1, and NuG2 with experimental results.
In addition to detecting the correct folding behavior of each protein, our rigidity-based technique can also help explain the stability shift in NuG1 and NuG2 from the wildtype.

A rigidity map is similar to a contact map except residues pairs are marked according to their rigidity relationship: black if they are in the same rigid component, green if they are in the same dependently flexible set, and white otherwise. In all four proteins, the central alpha-helix remains completely rigid, as expected. We also increased rigidity from our protein G to NuG1 and NuG2 in hairpin 1 as suggested by experiment. For the other regions, the rigidity maps show more similarity between the mutated forms of protein G and protein L than with protein G, reflecting the similarity/difference in folding behaviors. Our roadmaps provide an approxiamte view of a protein’s folding landscape. In our recent work, we have explored various methodologies for studying the kinetics of protein folding in these approximate models. Our two methodologies, map-based Master Equation (MME) and map-based Monte Carlo (MMC), allow us to study two important features of protein folding kinetics: relative folding rate and population kinetics. While master equation calculation and Monte Carlo simulation have been performed on protein models before, their computational cost have prohibitied them from being used for large protein models. An important benefit of our map-based apporximation approach is that it enables us to study the kinetics of much larger proteins than can be handled by traditional master equation solution or Monte Carlo simulation. We can use the MME approach to study relative folding rate as extracted from our roadmaps. For this study, we looked at Protein G and its variants NuG1 and NuG2. Experimental studies have shown that these two mutants folds 100 times faster than protein G. We applied the MME approach to study the relative folding rates of these proteins.

In the figure above, the simulated folding rates are shown for Protein G and its mutants NuG1 and NuG2. The plotted eigenvalues were calculated using MME. Note that the smalled non-zero eigenvalues correspond to the folding rate (index 2). Our simulations show that Protein G folds much slower than its mutants NuG1 and NuG2. This is what has been seen in lab experiments. Also shown above is the performance of MME roadmaps ranging in size from 2000 to 15000 nodes. The running time of MME scales linearly with roadmap size (i.e. the size of the landscape model).

Displayed in the figure above is the 86 residue all alpha protein, Acyl-coenzyme A Binding Protein (ACBP), we have studied through the MMC technique. This protein has five helices and two tryptophans in the core of the protein. The folding of ACBP has been studied in the lab thourgh tryptophan fluorescence, and it has been shown that it has a fast, two-state folder. From our MMC kinetics, we see that ACBP exhibits a property similar to otehr all alpha proteins we have studied: continual formation of helix contacts and reaching the native state after teh formation of many helix contacts. Also, since ACBP has two tryptophans in the core of the protein, we see a quick increase in the formation of these contacts around the same time we see the native state beginning to be populated, around time step 100. This could correspond to the packing of the structure and the formation of long-range interactions in the core of the protein. We also show this protein has similar kinetics to other all alpha-proteins and other lab experiments on alpha proteins. Through the technique we have been able to study population kinetics for proteins of various lengths and mixed structures. We have also used our roadmaps, or approxiamte landscape models, to simulate relative hydrogen exchange rates and infer the folding core. The folding core is the set of residues that are the first to form structure during folding and the last to lose structure during denaturation. One of the most prevalent experimental methods to infer the folding core is hydrogen exchange. It measures the rate at which residues exchange their hydrogens for Deuterium during either folding or unfolding. We use rigidity analysis to examine the folding pathways on our roadmaps (typically extracted by MMC) and compute an approxiamte exchange rate for each residue. We developed two approaches for approximating exchange: one looks at the indiviual residue’s rigidity/flexibility and one looks at the formation of mutually rigid, non-local subsequences. We found good correlation between the residues identified experimentally as part of the folding core and residues we identify with slow exchange rates:

mRRG: A Multi-Directional Rapidly Exploring Random Graph for Protein Folding
Robotic motion planning algorithms, such as Rapidly Exploring Random Trees (RRTs), have been successful in simulating protein folding pathways. We developed a novel variant, multi-directional Rapidly Exploring Random Graph (mRRG), specifically tailored for proteins. Unlike traditional RRGs which only expand a parent conformation in a single direction, our strategy expands the parent conformation in multiple directions to generate new samples. Resulting samples are connected to the parent conformation and its nearest neighbors. By leveraging multiple directions, mRRG can model the protein motion landscape with reduced computational time compared to several other robotics-based methods for mall to moderate-sized proteins. Our results on several proteins agree with experimental hydrogen out-exchange, pulse-labeling, and phi-value analysis. We also demonstrated that mRRG covers the conformation space better as compared to the other computational methods.

Improving Decoy Databases for Protein Folding Algorithms
Predicting protein structures and simulating protein folding motions are two of the most important problems in computational biology today. Modern folding simulation methods rely on a scoring function, which attempts to distinguish the native structure (the most energetically stable 3D structure) from one or more non-native structures. Decoy databases are collections of non-native structures that are widely used to test and verify these scoring functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and/or removing redundant structures. We test our approach on decoy databases of varying size and type and show significant improvement across a variety of metrics. Most improvement comes from the addition of novel structures indicating that our improved databases have more informative structures that are more likely to fool scoring functions. This work can aid the development and testing of better scoring functions, which in turn, will improve the quality of protein folding simulations. We apply our methods to existing decoy sets and show that our algorithms are able to generate sets with lower energies and more diverse structures that are more likely to fool scoring functions of protein folding algorithms. All decoy sets were obtained from the Decoys ‘R’ Us database and are listed in the following table.

Because our methods improve existing decoy sets, we first develop strategies for analyzing the quality of decoy sets. We present several quantitative metrics to compare decoy sets and describe how their values are calculated in the experiments. Z-Score - The z-score (or standard score) indicates the number of standard deviations between the native structure energy adn teh average energy of a decoy set. MinDist - The minimum distance metric measures the average minimum distance from each decoy structure to any other decoy structure in the set. Improvement - Given an original decoy set and an improved decoy set, the improvement score returns the change in z-score per sample between the two sets. There are two main phases in the improvement of decoy sets. First, samples are generated on the protein’s energy landscape. Second, in the decoy selection phase, some structures are chosen from the original set D to be removed and some are chosen from the sample set S to be added. The original decoy set D and the sample set S can be broken down into four subsets: redundant decoy structures DD from D, viable decoy structures DV from D, redundant sampled structures SD from S, and viable sampled structures SV from S. The following figure shows the relationship between these four subsets.

The next three figures summarize the resulting z-score, improvement score, and minimum distance value for each protein studied. For each metric, we show the contribution from each operation (removing redundant decoys (DV), adding new samples (D U SV) and from their combination (DV U SV).

When the z-score approached zero, the native structure energy is harder to distinguish among the energies of the other structures in the set. For every protein in this figure, the z-scores of D and DV are very similar. Thus, simply removing structures does not greatly impact the z-score. However, once we add new structures from our sampling approach (D U SV), the z-score drops drastically with comparable z-scores to the final set (DV U SV). Therefore, the main contributors to z-score improvement are the structures generated by our sampling approach.

Recall that the improvement score shows the change in z-score per sample between two sets. A higher value indicates that the change (either structure addtion, removal, or both) has a greater impact on the z-score. This figure displays the improvement scores across all test proteins. We again see that adding structures provide a decoy set with better quality than simply removing redundant structures. Proteins 1ash and 1gdm with the smallest original sets show the largest improvement scores.

The last metric we examine is the minimum distance between neighboring structures which indicates how varied the structures. A larger distance signifies greater structural diversity and implies a greater ability to fool different scoring functions. This figure shows how this metric changes for each operation. As expected, when decoys are removed (D), the minimum distance increases, and when V adding decoys (D U SV), the minimum distance decreases. For all protein studied, the final decoy set (DV U SV) has smaller minimum distance than the original set (D) yielding a set with greater diversity.
Publications
- Ekenna, C. , Thomas, S. , & Amato, N.M. (2016). Adaptive Local Learning in Sampling Based Motion Planning for Protein Folding. BMC Syst Biol, Special Issue from the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 10(49) , 165--179. https://doi.org/10.1186/s12918-016-0297-9
- Yeh, H.(. , Lindsey, A. , Wu, C. , Thomas, S. , & Amato, A.N.M. (2015). Decoy Database Improvement for Protein Folding. Journal of Computational Biology , 22(9) , 823-836. https://doi.org/10.1089/cmb.2015.0116
- Thomas, S. , Tapia, L. , Ekenna, C. , Yeh, H.(. , & Amato, N.M. (2013). Rigidity Analysis for Protein Motion and Folding Core Identification. Association for the Advancement of Artificial Intelligence (AAAI) Workshop , WS-13-06 , 38-43. View publication
- Nath, S. , Thomas, S. , Ekenna, C. , & Amato, N.M. (2012). A Multi-Directional Rapidly Exploring Random Graph (mRRG) for Protein Folding. Proc. ACM Conference on Bioinformatics, Computational Biology and Biomedicine (BCB) , 44-51. https://doi.org/10.1145/2382936.2382942
- Tapia, L. , Thomas, S. , & Amato, N.M. (2010). A Motion Planning Approach to Studying Molecular Motions. Communications in Information and Systems , 10 , 52-68. https://doi.org/10.4310/cis.2010.v10.n1.a4
- Tapia, L. , Tang, X. , Thomas, S. , & Amato, N.M. (2007). Kinetics Analysis Methods for Approximate Folding Landscapes. Bioinformatics , 23(13) , i539--i548. https://doi.org/10.1093/bioinformatics/btm199
- Thomas, S. , Tanase, G. , Dale, L.K. , Moreira, J.E. , Rauchwerger, L. , & Amato, N.M. (2005). Parallel Protein Folding with STAPL. Concurrency and Computation: Practice and Experience , 17(14) , 1643-1656. https://doi.org/https://doi.org/10.1002/cpe.950
- Thomas, S. , Song, G. , & Amato, N.M. (2005). Protein Folding by Motion Planning. Physical Biology , 2(4) , S148--S155. https://doi.org/10.1088/1478-3975/2/4/S09
- Amato, N.M. , Dill, K.A. , & Song, G. (2004). Using Motion Planning to Map Protein Folding Landscapes and Analyze Folding Kinetics of Known Native Structures. Journal of Computational Biology , 10(3-4) , 239-255. https://doi.org/10.1089/10665270360688002
- Thomas, S. & Amato, N.M. (2004). Parallel Protein Folding with STAPL. 18th International Parallel and Distributed Processing Symposium (IPDPS) , 189-. https://doi.org/10.1109/IPDPS.2004.1303204
- Song, G. , Thomas, S. , Dill, K.A. , Scholtz, J.M. , & Amato, N.M. (2003). A Path Planning-based Study of Protein Folding With a Case Study of Hairpin Formation in Protein G and L. In Proc. Pac. Symp. of Biocomputing (PSB) , 8 , 240-251. View publication
- Song, G. & Amato, N.M. (2001). A motion planning approach to folding: from paper craft to protein folding. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , 1 , 948-953 vol.1. https://doi.org/10.1109/ROBOT.2001.932672