Margarita Colberg

Hi, and welcome to my home page. I was formerly a student in the Chemical Physics Theory Group at the University of Toronto, where I completed my PhD under the guidance of Prof. Jeremy Schofield.

Education

PhD Research Projects

Configurational entropy, transition rates, and optimal interactions for rapid folding in coarse-grained model proteins

For more information, see the code for this work, the publication, or the preprint.

Proteins are a class of biological macromolecules which participate in almost all biological processes in living systems. Each protein must be folded correctly into a unique 3D conformation, known as the native state, to be able to perform its task. This is achieved via a complex folding pathway, which is dictated by the protein’s primary structure comprised of a linear chain of amino acids and the surrounding cellular environment. The protein in its final state participates in vital processes such as regulating the speed of reactions via catalysis, transporting molecules such as oxygen, generating movement through muscle contractions, transmitting nerve impulses, and combatting pathogenic viruses and bacteria, among many others.

Investigations into the mechanism by which proteins fold first began in the 1960s. In the years that followed, an array of theoretical models were developed to explain this phenomenon, such as the framework, hydrophobic collapse, and nucleation-growth models. Experimental techniques such as fluorescence spectroscopy, and simulations using packages such as GROMACS or algorithms such as molecular dynamics (MD) were carried out to support or refute predictions of the protein folding mechanism. One drawback of the simulation packages and algorithms used was their inability to explore more than one folding pathway at a time; to draw conclusions about the nature of a protein’s folding pathway, the simulation of many trajectories in a computationally cheap way is critical to obtaining statistically meaningful results. One efficient approach to study the dynamics of solvated proteins is a simple coarse-grained model with step interactions, which played a central role in our work.

In this paper, we developed an adaptive method based on event-driven MD in a massively parallel architecture, which was used to obtain the configurational entropy and mean first passage times of any state of a coarse-grained polypeptide. From these components, the Markov state model (MSM) was constructed. The insight into the way proteins fold relies on our ability to characterize a protein’s free energy landscape. To achieve this, we must first identify how configurations are connected to unravel the correct sequence of intermediate structures in a folding pathway. Investigating the conformational changes a protein undergoes while folding either experimentally or computationally is challenging. Most simulations are unable to reach the timescales on the order of proteins folding in biological systems. MSMs offer a solution: In this approach, an MD simulation is used to build a collection of short trajectories which begin at different locations in the free energy landscape, and are allowed to evolve. The dynamical progress of these trajectories can be predicted using an MSM.

The simplicity of the interactions in the coarse-grained representation enabled us to address how the choice of each state’s energy can optimize dynamical properties such as the average folding time of a model protein. The energy optimization procedure was demonstrated for two systems: the rapidly-folding crambin protein, due to its small size and broad range in secondary structures, and a helical protein with metastable trapping states and qualitatively different folding behavior (see below). Our findings showed that the folding pathways for both systems are comprised of two regimes: first, the rapid establishment of local bonds, followed by the subsequent formation of more distant contacts. The state energies that lead to the most rapid folding encourage multiple pathways, and either penalize folding pathways through kinetic traps by raising the energies of trapping states, or establish an escape route from the trapping states by lowering free energy barriers to other states that rapidly reach the native state. These results should prove useful in understanding why secondary structures in proteins exist, their importance in protein folding, and how they arise from an evolutionary process to minimize misfolding.

Crambin ribbon model

The ball-and-stick model of the 46-residue protein crambin. The residues participating in the α-helices are in red, β-sheets are in yellow, and disulfide bridges are in blue and cyan.

Helical model

The ball-and-stick model of the 14-residue helical protein with frustration effects. Four pairs of residues participate in an α-helix in yellow, which folds in on itself due to a bond between the residues in red and cyan. This long-range interaction introduces kinetic traps during folding (defined as intermediate states whose bonding constraints prevent the formation of additional bonds required to reach the native state).

Diffusive dynamics of a model protein chain in solution

For more information, see the code for the penetrating, MPCD, or hard-sphere solvent models for this work, the publication, or the preprint.

One main shortcoming of MSMs is the applicability of the transition rate matrix--- a component central to the MSM framework--- only under specific conditions of the dynamical system, such as the temperature and force field. This problem can be avoided by using a diffusive MSM, which was the focus of our first paper (described in the previous section). The objective of our second paper was to verify the dynamics of the diffusive MSM using a system in which a protein is suspended in a solvent bath. Unlike the diffusive MSM, this system is computationally costly for all but small models, and requires the simulation to be restarted each time a new bonding energy or temperature is to be sampled. The transition rate matrix was built using three solvent models: the penetrating solvent model, which is implicit, and the multi-particle collision dynamics (MPCD) and hard-sphere solvent models, which are explicit. We found that the dynamics predicted by the diffusive MSM agreed quantitatively with the dynamics observed in all three solvent models, even at low densities with relatively large monomer self-diffusion coefficients. These results suggested that the internal friction arising from the elastic collisions of monomers due to excluded volume and local geometric constraints that maintain the linear chain provide sufficient dissipation, when combined with weak solvent collisions, to establish a separation of timescale between the timescale of bond forming and breaking events, and the decorrelation time of bead velocities. Furthermore, for the solvent densities studied here, hydrodynamic flow and solvent structure are not significant in determining the transition rates.

Predicting the configurational entropy of a model protein chain using machine learning methods

Our final work combined the Markov state model with artificial neural networks. The exploration of the protein folding mechanism using machine learning methods first arose in the early 1990s. For the nearly three decades that followed, this field saw little success, since the primitive architecture of the neural networks at the time could not address complex biological processes such as protein folding with sufficient accuracy. As the neural networks grew in depth, the quality of their predictions improved until the notable success of DeepMind's AlphaFold program during the CASP competitions of 2018 and 2020, which marked a milestone in single protein chain structure prediction. Despite this achievement, AlphaFold and its analogs do not offer insights into the dynamics of protein folding. Instead, Markov state models can be used to explore the folding pathways of proteins. While the adaptive event-driven sampling program outlined in our first paper can be used to obtain the differences between the configurational entropies and mean first passage times of any pair of intermediate states of a small protein fairly easily, the construction of the transition rate matrix may become too computationally intensive for proteins with many native contacts. Machine learning methods provide an alternative means to determining the entropies and mean first passage times quickly and cheaply.

In our work, two artificial neural networks--- the multilayer perceptron and the convolutional neural network--- were implemented to predict the entropy of each intermediate and native state of a coarse-grained model of crambin. The preliminary results of the multilayer perceptron showed promise, with roughly half of the configurational entropy estimates falling within the acceptable margin of error of 5% of those obtained using event-driven dynamics in our first paper. In contrast, the convolutional neural network performed poorly, yielding an underfitted result which we ascribed to an insufficient number of features in the training set. Since this work is still ongoing, the future directions we will explore include the application of graph neural networks to those features in the training data which are undirected graphs, and the creation of training data whose coarse-grained protein-like chains feature variable bond lengths. Once the biased entropies for all configurations of crambin are obtained, the predictions of the mean first passage times will follow, and the resulting transition rates can be compared to their analogs from the adaptive method described in our first paper.