[Openchemistry-developers] Regarding GSoC 2017

Karol Langner karol.langner at gmail.com
Mon Apr 3 12:10:01 EDT 2017


Hi,

Sounds good. The ML idea is more on the researchy side, although I don't
think it is unreasonable. Of course I do not expect straight NLP to be
particularly successful here. But some kind of classification with
constraints based on prior knowledge should work to some extent.

On Sat, Apr 1, 2017 at 11:31 AM, Kunal Sharma <ks05111996 at gmail.com> wrote:

> Good Evening,
>
> I am sorry for contacting you so late regarding this, but for the past
> week I have been very busy with my mid semester exams. Also, regarding my
> GSoC 2017 project selection I have selected to Refactor the existing
> parsers and Implement new parsers. I have doubts reagarding the feasibility
> of the NLP Parser Project though.
>
> The methods of NLP, including the terminology extraction, developed for
> texts written in natural language, are not necessarily well suited to the
> log files. That is due to the specific characteristics of log files, such
> as the heterogeneity of data and structure. The heterogeneity is not only
> within different file types but also within the file itself. Many times
> there are keywords which keep on repeating themselves many times in one
> documents. There is also the problem of specificity e.g. in two log files,
> QChem and Gaussian the method to extract the value of CCSD energy is very
> different and due to this structural dissimilarity we cannot simply hope to
> apply NLP and get good results.
>
> An approach to this problem was explained here
> <https://pdfs.semanticscholar.org/0653/a5c48bd99bbdffb70a452aa3c207891db228.pdf>
> :
>
> *(Please see page 3 of the paper provided in the link)*
>
> Also there isn't much literature on applying ML on log file parsing.It is
> mainly used for analysis of log files where good conclusions can be drawn
> rather than extracting good data values.
>
> *We can however use regular expression to try to create a general parser
> for different log files but since they have such different structures the
> parser will be very complicated and might not yield the best results.*
>
> Therefore, I think that this project is not suited for me at my level of
> knowledge and understanding.
>
> I have provided the link for the first draft of my proposal: Open
> Chemistry proposal
> <https://docs.google.com/document/d/1IDIFTmaTjXlIUpY9qsqnECrdqzu7B0EbLk2LwditpXw/edit?usp=sharing>
>
>
> I will finish it by tonight, since I had to delete all the approaches I
> had written for the ML Parsing problem.
> Please let me know your thoughts about this and I will get back to you
> regarding the same.
>
> Thank you,
>
> Kunal Sharma
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/openchemistry-developers/attachments/20170403/db288b73/attachment.html>


More information about the Openchemistry-developers mailing list