Hi, and welcome to my home page. I am currently a PhD candidate in the Chemical Physics Theory Group at the University of Toronto.
For more information, see the code for this project, or the preprint.
Proteins are a class of biological macromolecules which participate in almost all biological processes in living systems. Each protein must be folded correctly into its unique 3D conformation to be able to perform its task. This is achieved via a complex folding pathway, which is dictated by the protein’s primary structure and the surrounding cellular environment.
One common and well-known model frequently used to visualize the pathways available for a protein to take as it folds is the protein folding funnel. Initially, at a high free energy, the protein can take on a wide range of structural possibilities. As the free energy decreases, so does the number of available conformations. Eventually, the protein reaches the tip of the funnel at the lowest free energy. This tip represents the protein in its native state, which is stable and biologically active.
The folding funnel shows that there are a large number of pathways that can connect the unfolded state to the folded (or native) state. According to Levinthal's paradox, locating the native state of a protein via a random search through all possible configurations would take an enormously long time. In nature, however, proteins take seconds to fold, suggesting that they do not search through every configuration. Researchers in both the theoretical and experimental communities have attempted to address the mystery of how proteins fold over the last several decades. Although many models and experiments have been developed in an attempt to understand the folding process, currently there is no agreement on which explanation most accurately describes how proteins fold.
To understand why secondary structures in proteins exist, their importance in protein folding, and how they arise from an evolutionary process to minimize misfolding, I developed an event-driven program in C++ to simulate the folding of model proteins with discontinuous potentials using parallel architecture. One protein I used was crambin (pictured below), which I chose because of its small size and the broad range in secondary structures.
I used novel molecular dynamics (MD)/Monte Carlo (MC)-based techniques to estimate the configurational entropy and mean first passage times of the intermediate states of the model proteins in my project. The results from my simulations showed that short-range bonds in proteins such as those in α-helices are rapidly formed first, followed by long-range bonds, such as disulfide bridges.
A protein suspended in a viscous solvent can experience either a transition between two microstates, in which the positions of the amino acids change as a result of collisions with the solvent particles while the protein's bond pattern (the set of turned-on secondary structure bonds) remains unaffected, or a transition between two macrostates, or configurations. A configuration is defined by which bonds in the protein's secondary structure are turned on or off, since the point at which a bond forms or breaks is well-defined in a model with discontinuous potentials. The change of an amino acid's position corresponds to a fast degree of freedom, whereas a transition between two configurations is a slow degree of freedom. The Markov state model is concerned with the slow evolution of a population of configurations of a protein, which occurs at timescales longer than microstate transitions, but faster than the overall time required for a protein to reach the native state, starting from the unfolded state. Thus, the Markov state model can be used to describe the folding process as a series of transitions involving configurations which differ by one bond. The breaking or forming of each bond in the protein as it transitions between states represents a major structural change.
In a Markov state model, a population of configurations, P, evolves according to the equation
dP(t) / dt = K · P(t).
Here, K is the transition rate matrix. Each off-diagonal element in this matrix corresponds to the probability of transitioning from one state to another, where the two states differ by one bond in the secondary structure. The transition rate matrix is computationally expensive to construct for a protein in the presence of a solvent bath, whereas in the MD project described above, the transition rate matrix is much more easily obtained. To ensure that the MD project accurately predicts protein dynamics, and therefore produces a correct transition rate matrix, I designed and implemented an event-driven program in C++ to simulate protein folding in the presence of a solvent bath. The bath was based on one of three models: multi-particle collision dynamics (MPCD), the penetrating solvent model, and the hard sphere solvent model.
In MPCD, the solvent is coarse-grained to eliminate the calculation of solvent-solvent interactions. While hydrodynamic flow is present in the solvent, the collisions between the solvent particles are carried out at specific steps, rather than continuously throughout the simulation. In the penetrating solvent model, the solvent particles are simulated implicitly by specifying only the random forces that occur on the surface of the protein to mimic solvent-protein collisions. The solvent in this model excludes hydrodynamic flow. In the hard sphere model, each solvent particle is represented by a hard sphere which travels in a straight line until it encounters a bead or another solvent particle, with which it collides. This model includes hydrodynamic flow, as well as hard sphere interactions. The aim of these three solvent models is to verify the correctness of the elements of the transition rate matrix from the MD project. This project is still ongoing.
Since the simulation of model proteins with highly bonded secondary structures is computationally expensive, machine learning methods may be a more efficient way to estimate the configurational entropy and mean first passage times of proteins with complicated bonding patterns. I collaborated with, and directed an undergraduate student to use supervised machine learning methods to predict the configurational entropy of crambin's intermediate states, as part of the student's NSERC project. We generated training data using a large range of model proteins with different numbers of beads, bonds, secondary structure types, and distances at which a two beads can be considered bonded. The machine learning methods we used included multilayer perceptrons, convolutional neural nets, and graph neural nets. This project is still ongoing.