<div dir="ltr"><img width="0" height="0" class="mailtrack-img" src="https://mailtrack.io/trace/mail/d274cac71495725d5b3ede50f060ec29e5aa91e1.png?u=1049094">Hello,<br><br>I had some last minute queries before I submit my first proposal draft,<br><br>1. I have decided to club together another project with the following project: Machine learning applied to Computational Chemistry data but I am not sure which other project will benefit the organisation more. <b>Basically, what project should I club together with the ML one?</b><br><br><div>2. I was also thinking of making a Point Group detection library in python but then I cam across Geoffrey's <a href="http://scicomp.stackexchange.com/questions/135/how-does-one-determine-the-point-group-of-a-molecule" class="mt-detrack-inspected">answer</a> where he mentioned that this was added to Avagadro. So I will be dropping this.<br><br>I will send a link to my first draft by <b>20th March, 4:30 AM GMT.</b></div><div><b><br></b></div><div>Regards,</div><div>Kunal Sharma</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 14, 2017 at 1:47 AM, Karol Langner <span dir="ltr"><<a href="mailto:karol.langner@gmail.com" target="_blank">karol.langner@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Geoff's and Adam's comments are spot on. I would add that both these ideas are on the more challenging/advanced side (which also make them interesting in my mind). <div><br></div><div>The first because, as Geoff mentioed, it is more of a research project, requiring some scouting what approaches might be appropriate for the kind of input/output cclib deals with. I don't have any specific suggestions here, since I haven't thought about the technical aspects much, but deifnitely some combination of supervised learning and constraints based on prior knowledge (relationships in the output space, like atom charges sum up to molecular charge). The ideal end state would be taking output from a program we don't support (like mpqc) and getting reasonable attribute coverage. I don't think this is unreasonable, since output from all programs have many things in common.<div><br></div><div>The second thing you listed is a design task in addition to coding. Currently, cclib's parsers have a simple design: some helper methods but they are mostly one gigantic parse method that goes through the file incrementally. Refactoring a single parser is probably a good way to start coding and testing ideas. But what I think we're really after here is some design concept. How should the parsers be structured so the code is more modular? What steps should we take to make them easier to maintain, test and extend?</div><div><br></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 13, 2017 at 11:07 AM, Adam Tenderholt <span dir="ltr"><<a href="mailto:atenderholt@gmail.com" target="_blank">atenderholt@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Kunal,<div><br></div><div>Geoff covered everything pretty well. Other than providing a correct link to our application guidelines (<a href="http://wiki.openchemistry.org/Applying_to_GSoC" target="_blank">http://wiki.openchemistry.org<wbr>/Applying_to_GSoC</a>), I'll just add that the project ideas on the wiki are just that—ideas. It's up to you to write a proposal, so feel free to combine ideas so that you find the project exciting with realistic milestones.</div><span class="m_7658461636814674650HOEnZb"><font color="#888888"><div><br></div><div>Adam</div></font></span><div><div class="m_7658461636814674650h5"><div><br><div><br><div class="gmail_quote"><div dir="ltr">On Mon, Mar 13, 2017 at 10:47 AM Geoffrey Hutchison <<a href="mailto:geoff.hutchison@gmail.com" target="_blank">geoff.hutchison@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg">Hi Kunal,</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">Thanks for your message. I think Adam and/or Karol can comment more, but I'll give some in-line comments to your message.</div><br class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"></div></div><div style="word-wrap:break-word" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><blockquote type="cite" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><b class="m_7658461636814674650m_2101459925453690728gmail_msg">1. Machine Learning applied to parsing computational chemistry output: </b>Since parser is used to get a very specific output from a specific input, what is it that we expect from the final ML pipeline. Do we want it to get all the available data from an output file (like most </div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><div dir="ltr" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg">(if not all) of the parameters mentioned in data.py)?</div></div></div></blockquote><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div></div></div><div style="word-wrap:break-word" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg">Right. The question is whether it's possible to teach a ML model to find all available data mentioned in data.py. This is clearly more of a research project than some of the other ideas.</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><br class="m_7658461636814674650m_2101459925453690728gmail_msg"><blockquote type="cite" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><div dir="ltr" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><h3 style="background-image:none;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;margin:0.3em 0px 0px;overflow:hidden;padding-top:0.5em;padding-bottom:0px;border-bottom:none;line-height:1.6" class="m_7658461636814674650m_2101459925453690728gmail_msg"><font size="2" class="m_7658461636814674650m_2101459925453690728gmail_msg">2. Refactoring parser and Implementing new parsers: <span style="font-weight:normal" class="m_7658461636814674650m_2101459925453690728gmail_msg">I was looking into this and saw that you thought about an approach which utilized decorators and partial parsing of the file, but maybe it was dropped? Also, can you please provide a list of the parsers you would like to extend in cclib in this GSoC ...</span></font></h3></div></div></div></blockquote><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">There are a lot of example files in the cclib data repository: </div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><a href="https://github.com/cclib/cclib-data" class="m_7658461636814674650m_2101459925453690728gmail_msg" target="_blank">https://github.com/cclib/cclib<wbr>-data</a></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">I think the idea here is that you would choose which parsers you'd want to refactor and/or add. There are, after all, no end to the number of computational packages.</div></div></div><div style="word-wrap:break-word" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><br class="m_7658461636814674650m_2101459925453690728gmail_msg"><blockquote type="cite" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><div dir="ltr" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><b class="m_7658461636814674650m_2101459925453690728gmail_msg">I was thinking that given the overall duration of GSoC I would like to attempt to do more than one project (Combining two projects). What are your thoughts on this? Given the duration, would it be possible?</b><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div></div></div></blockquote><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div></div><div style="word-wrap:break-word" class="m_7658461636814674650m_2101459925453690728gmail_msg"><div class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">It depends a bit on the projects, but I can imagine these two projects could be blended (e.g., refactoring and adding new parsers while trying the ML approach).</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">As for the application, here's a guide on the wiki:</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><a href="https://github.com/cclib/cclib-data" class="m_7658461636814674650m_2101459925453690728gmail_msg" target="_blank">https://github.com/cclib/cclib<wbr>-data</a></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">We usually recommend students start a proposal with Google Docs (or something similar) and share with mentors/admins to get feedback.</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg"><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">Hope that helps,</div><div class="m_7658461636814674650m_2101459925453690728gmail_msg">-Geoff</div><br class="m_7658461636814674650m_2101459925453690728gmail_msg"></div></blockquote></div></div></div></div></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>