[Openchemistry-developers] GSOC Open Babel project 2018

Geoffrey Hutchison geoff.hutchison at gmail.com
Thu Mar 8 22:11:35 EST 2018


> I have recently studied about Bayesion theory,parametric and multivariate methods and dimensionality reduction by PCA and LDA in course Pattern Recognition and ML which I think will be quite helpful.
> Further, I would like to know how to get started with proposal and if I should fix some bugs or I should contact mentor directly any other specific thing 

There's more information on submitting a proposal here:
http://wiki.openchemistry.org/Applying_to_GSoC <http://wiki.openchemistry.org/Applying_to_GSoC>

I'll give you a summary, but part of your job in the application is to consider how the project would work. The summary is highly generic, since several students have inquired, so apologies if some of this includes things you know.

Finding a distribution of molecular conformer geometries relies on sampling from multiple, possibly correlated degrees of freedom, typically different dihedral angles. We often have initial prior beliefs about the dihedral angles (e.g., from likely angles in crystal structures or other calculations) but in particular molecules, those beliefs may be far from the optimal. Consider, for example, biphenyl, where optimal dihedrals for sp2-hybridized C-C bonds are typically flat, but the molecule has an optimal angle of +/- 45 degrees.

Bayesian optimization is a technique for efficiently sampling expensive unknown black-box functions and only optionally requires derivatives. It works well on problems with intermediate degrees of freedom (e.g., less than ~30) by using different acquisition functions to balance exploration of under-sampled space and exploitation of existing knowledge.

The project would require coding a Bayesian optimization strategy for sampling molecular conformations, using force fields or other free energy calculations. A key test would be to show performant speed and accuracy (e.g., finding 'good conformers'). Part of the project would likely include encoding prior distributions, testing different acquisition strategies, etc. Another key component would be evaluating different types of GP kernel.


If you write up a draft as indicated in the wiki, I'd be happy to take a look. As a warning, there are a *ton* of Bayesian / Gaussian process packages out there. I have not studied all of them, but a few that look interesting:
- GPFlow / GPFlowOpt (https://gpflowopt.readthedocs.io/en/latest/ <https://gpflowopt.readthedocs.io/en/latest/>)
- Phoenics (https://github.com/aspuru-guzik-group/phoenics <https://github.com/aspuru-guzik-group/phoenics>)
- GPyOpt (https://github.com/SheffieldML/GPyOpt <https://github.com/SheffieldML/GPyOpt>)
- COMBO (https://github.com/tsudalab/combo <https://github.com/tsudalab/combo>)
- pyGPGO (http://pygpgo.readthedocs.io/en/latest/ <http://pygpgo.readthedocs.io/en/latest/>)

There are undoubtedly more, and an evaluation of packages would be an important part of the proposal/project.

Hope that helps,
-Geoff

---
Prof. Geoffrey Hutchison
Department of Chemistry
University of Pittsburgh
tel: (412) 648-0492
email: geoffh at pitt.edu
twitter: @ghutchis
web: https://hutchison.chem.pitt.edu/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://public.kitware.com/pipermail/openchemistry-developers/attachments/20180308/4844e537/attachment.html>


More information about the Openchemistry-developers mailing list