READ_ME
######################################################################### Title: Chemical shift prediction by exploratory machine-learning and quantum chemical calculation Date of Update: 28/MAY/2018 Version: 1.0 Auther(s): Kengo Ito, Ph.D Eisuke Chikayama, Professor, Ph.D Affiliation: RIKEN Center for Sustainable Resource Science (CSRS), Environmental Metabolic Analysis Research team URL: http://dmar.riken.jp/Rscripts/ Depends: Shell, openbabel, Gaussian software(>= 03) (for Quantum Chemical Calculation) Java(>= 8) (for Learning Data Set Generator) R(>= 2.10) (for Machine Learning) License: GPL-3 Reference: Ito, K., Obuchi, Y., Chikayama, E., Date, Y. and Kikuchi, J. "Exploratory machine-learned theoretical chemical shifts can closely predict metabolic mixture signals" Chem. Sci. (submitted) ######################################################################### [Programs and Scripts] (1) Quantum Chemical Calculation -- 5950.sh (CID.sh, Shell Script) -- 5950.com (CID.com, Gaussian Command File) -- 5922.sh (CID.sh, Shell Script) -- 5922.com (CID.com, Gaussian Command File) -- dss_d2o.sh (Shell Script) -- dss_d2o.com (Gaussian Command File) (2) Learning Data Set Generator -- toolgaussianlearndata.bat (Windows Batch File) ---- toolgaussianlearndata.jar (Compiled Java Program) ------ FileUtilities.java (Java Source Code) ------ SDFAtom.java (Java Source Code) ------ SDFBond.java (Java Source Code) ------ SDFBondType.java (Java Source Code) ------ SDFCompound.java (Java Source Code) ------ ToolGaussianLearnData.java (Java Source Code) -- CH3_processing.R (R Source Code) (3) Machine Learning -- Several_Predictive_Modeling_for_QM.R (R Source Code) -- Applying_Model.R (R Source Code) #************************************************************************ [Example Data] (1) Quantum Chemical Calculation -- 5950.sdf (CID.sdf, Structure File of Alanine) -- 5950.xyz (CID.xyz, Structure File) -- 5950.log (CID.log, Gaussian Log File) -- 5922.sdf (CID.sdf, Structure File of Isonicotinic Acid) -- 5922.xyz (CID.xyz, Structure File) -- 5922.log (CID.log, Gaussian Log File) -- dss_d2o.sdf (Structure File of DSS as Reference) -- dss_d2o.xyz (Structure File) -- dss_d2o.log (Gaussian Log File) (2) Learning Data Set Generator -- metid_list.txt (Listed CID File) -- experiment_database.txt (Experimental Data File) (3) Machine Learning -- metid_list.txt_H.txt (1H Learning Data Set of 150 Compounds) -- metid_list.txt_C.txt (13C Learning Data Set of 150 Compounds) -- Results_H.Rdata (1H Predictive Models Which were Learned 150 Compounds) -- Results_C.Rdata (13C Predictive Models Which were Learned 150 Compounds) #************************************************************************ [How to Use] (1) Quantum Chemical Calculation 1-1. Getting structure files (e.g. Alanine, Isonicotinic Acid, DSS) from PubChem website. Type as follows on unix console; curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5950/record/SDF/?record_type=3d&response_type=display" > 5950.sdf curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5922/record/SDF/?record_type=3d&response_type=display" > 5922.sdf curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/74873/record/SDF/?record_type=3d&response_type=display" > dss_d2o.sdf 1-2. Converting structure files using openbabel software. Type as follows on unix console; babel -isdf 5950.sdf -oxyz 5950.xyz babel -isdf 5922.sdf -oxyz 5922.xyz babel -isdf dss_d2o.sdf -oxyz dss_d2o.xyz 1-3. Creating shell script (*.sh) and command file (*.com) for Gaussian software. (xyz file is modified to com file.) 1-4. Structure optimization and NMR parameters calculation using Gaussian software. Type as follows on unix console; ./5950.sh ./5922.sh ./dss_d2o.sh 1-5. After successful completion, a log file is generated which is described as "Normal termination ..." in the last line. #********** (2) Learning Data Set Generator 2-1. List up compound name (CID is used in here) and write to metid_list.txt. 2-2. Summarizing the necessary information to experiment_database.txt as follows; column.1 ... CID. column.2 ... Atom No. in sdf file. Only hydrogen and carbon are used in here. column.3 ... Solvent No. This study defined that water is 1 and methanol is 2. column.4 ... Experimental chemical shift of 1H and 13C of each atom. column.5 ... Theoretical shielding constant of 1H and 13C of DSS from dss_d2o.log. 2-2. Storing files (*.sdf, *.log, metid_list.txt, experiment_database.txt, toolgaussianlearndata.jar, toolgaussianlearndata.bat) to same directory. 2-3. Executing toolgaussianlearndata.bat on Windows. 2-4. metid_list.txt_H.txt (for 1H learning data set) and metid_list.txt_C.txt (for 13C learning data set) are exported. 2-5. Averaging chemical shift of methyl. Type as follows on R console; source("CH3_processing.R") # choose metid_list.txt_H.txt, and choose save directory * Notice; File name (*.sdf and *.log), compound name in metid_list.txt, and column.1 in experiment_database.txt are unified by CID in here. But, users can define CID to other unified name. #********** (3) Machine Learning 3-1. Predivrive modeling. Type as follows on R console; source("Several_Predictive_Modeling_for_QM.R") # choose working directory, # and chooose file (metid_list.txt_H.txt or metid_list.txt_C.txt). # Results are saved as Results.Rdata. # # After calculation. # Main objects (e.g. 231 is ML algorythm "xgbLinear") # DATA # Learning data set from "metid_list.txt_*.txt" (matirx) # METHODS[231] # METHODS objest has ML algorythm name. (vector) # MODELs[[231]] # MODELs objest has predictive model. (list) # PREDs[[231]] # PREDs objest has predicted data. (scaling factor) (list) # PREDs[[231]] + DATA[,2] # *** Predicted chemical shifts *** # RMSEs[[231]] # RMSEs objest has average error between orginal and predicted data. (list) # IMPs[[231]] # IMPs objest has importance of explanatory variables. (list) # CV_RMSEs[[231]] # CV_RMSEs objest has result of error which is calculated by 10-fold cross validation. (list) 3-2. Applying predictive model to my (your) data. Type as follows on R console; file.rename("Results_H.Rdata","Results.Rdata") # Results_H.Rdata has our 1H predictive models. # Results_C.Rdata has our 13C predictive models. # source("Applying_Model.R") # choose saved directory, # choose working directory, # and chooose your file (metid_list.txt_H.txt or metid_list.txt_C.txt). # Results are saved as Results_test_data.Rdata. # # After calculation. # DATA3 # Learning data set from "metid_list.txt_*.txt" (matirx) # PREDs2[[231]] # PREDs objest has predicted data. (scaling factor) (list) # PREDs2[[231]] + DATA3[,2] # *** Predicted chemical shifts *** # RMSEs2[[231]] # RMSEs objest has average error between orginal and predicted data. (list) 3-3. Please see more detail for ML in caret library. http://topepo.github.io/caret/index.html #************************************************************************