Main Page

From NMReDATA
Revision as of 16:43, 4 August 2017 by 129.194.8.73 (talk) (Concerning symmetry)
Jump to: navigation, search

Direct link to page describing the format of the "<NMREDATA_...>" tags.

Direct link to the the1D spectra attributes page.

Possible object structure of NMReDATA.

Introduction

The NMReDATA working group decided to include data extracted from NMR spectra of small molecules using "tags" in SDF files.

More details about SDF files !

An important task of the group is to define the format of the content of the "<NMREDATA_...>" tags. More details here!.

The version 1.0 will be decided in September at the "Round table" of the Smash 2017 conference at Baveno, Italy.

NMR records

We call "NMR record", a folder (or .zip file including the folder) or database record including:

1) All the NMR spectra (including FID, acquisition and processing parameters)

2) The .nmredata.sdf file

Pictural representation of NMR record and example of SDF file presented in the poster presented in July at the Euromar 2017.

Currect version of the format of NMReDATA

The format can be found here : NMReDATA tag format

Changes to V 0.98

Certification

When the assignment is made using a computer assisted manner, the software may want to add a certification of the validity of the data. This should be (up to the manufacturers) to somehow encode it to make it impossible to forge the certification (using hashtag, etc. ?) Certificates TAGS could be listed at the end of the .sdf file. They can originate from the CASE software or from the database hosting the data and spectra, from the journal (to say data were peered reviewed). They can be cumulated. If the text of the .sdf file needs to be hashed for certification, the list of TAG used for hashing could be listed. (I’m not sure what needs to be done to certify the validity of certificates. To be refined by the certificate specialists).

>  <NMREDATA_CERTIFICATION>
Software=CMC_assist
Author=Bruker
Confidence_level=4.6
Confidence_level_certificate=”ADFS678AG67DFG6SD5F7GS5DFGSD8F5GSD7FG7”
Unique_solution=YES
Unique_solution_certificate=”ADFS678AG67DFG6SD5F7GS5DFGSD8F5GSD7FG7”
ETC...

This is only a very vague example. The uniqueness of the structure proposed may be understood in the sense of J.-M. Nuzillard’s LSD tool. Software producers can tell what needs to be done for their format. Multiple certification can be listed one after the other. The “Software=...” assignment separates them all in the same <CERTIFICATION> tag.

Role(s) and scope of the “assignment records”

The NMR record can be generated from experimental data but also from simulations, predictions, etc. Tools to compare, evaluate, validate, and check consistency of “assignment records” will certainly be developed. Assignment records can be generated by commercial software, but also by diverse tools analysing NMR data, homemade processing tools, simulation software, etc. This is why it is important to have a format of data including a maximum of options to be as flexible as possible, even if not all possible uses are clearly defined and used immediately. Ideally, the .sdf files should be converted into other file format or spectral description without loss.

We should see as an advantage if the databases include multiple "assignment records" associated to the same molecule or the same set of NMR spectra. Some could be old, originating from, incomplete literature data. Others could include errors because they originate from bulk data processed automatically. But finally a computer could verify and nicely validated record combining all the other data. Aggregated record could be generated by NMR software/database scoring available data for consistency, calculated chemical shifts and spectral simulations. They could refine chemical shifts and couplings, etc.

SDF files generated from from experimental data

When the NMR data originate from experimental spectra, they may be quite crude (simple automated integration, peak-picking) or follow complex automated or manual analysis. The data may be partial, incomplete, contain inconsistencies, impossible features, etc. The content may be diversely complex depending on the origin of the data: - only 1D 1H NMR data (with or without integration, coupling, etc). - only 1D 13C data (just from a simple peak peaking) - only 1D data but for multiple isotopes (from NMRshiftDB ?) - full analysis based on computer-assisted software (such as Bruker cmc-se ACDLabs Structure Elucidator or Mestrelab Mnova) or web-platform (cheminfo.org) - 1D and 2D data processed automatically with ambiguities on the signal assignment and partial (for example not all signals are assigned) and/or ambiguous (due to lack of resolution, or other problems) - The file may not contain the actual assignment, only the structure and the list of chemical shift (the assignment could be added by NMR tools). - The data may come from scientific report i.e. the text providing the description of the spectra. It could be like the one of the text of the following figure ( from http://onlinelibrary.wiley.com/doi/10.1002/mrc.4527/full).

Scripts could be written to convert such a "pure text" description into .sdf file and include the .mol file. - For assignment work made with only "paper and pencil", a simple webtool allowing to draw a molecule, enter lists of signal names and 2D correlation could be easily made. We could consider to accept .pdf or pictures of the spectra when the original files do not exist anymore.

Synthetic/predicted data

SDF files generated from calculated data

The NMR data may originate from DFT calculations or any other type of predictor of chemical shifts, and/or coupling. In such a case, a general tag is added to provide information about the software. For example:

>  <NMREDATA_ORIGIN>
Source=Calculation
method=DFT
Geometry=method/basis set
Shielding=method_basis set
Coupling=method_basis set
Software=...
Version=...

SDF files generated from literature data

When the NMR data originate from publications, a reference to the published paper/book/thesis are given in the NMREDATA_LITERATURE tag.

>  <NMREDATA_LITERATURE>
Source=Journal
DOI=DOI_HERE (if Reference field is DOI specify it here)
CompoundNumber=label used in the reference to designate the compound (typically a number in boldface)
>  <NMREDATA_LITERATURE>
Source=Book
ISBN=ISBN_HERE (if Reference field is DOI specify it here)
CompoundNumber=label used in the reference to designate the compound (typically a number in boldface)
>  <NMREDATA_LITERATURE> 
Source=Thesis
Thesis=HTML link here (if available if not "LastName, Firstname(s), institution providing the degree, city, country, year of publication.
CompoundNumber=label used in the reference to designate the compound (typically a number in boldface)

SDF files generated after revision of existing SDF files

Assignment records may be generated after revision from experimental, literature, prediction data, etc. Ideally, the original .sdf files should be also generated to facilitate comparison or exists somewhere and be referred to. In both cases reference should be given.

>  <NMREDATA_UPDATE>
Source=Record
Record_number=ref_to_the_original_record (multiple reference is allowed for aggregation of records – separated by “,”).
Date =date.... standard format for date
Correction="fixed assignments of C(13) and C(15)"

This is also to be refined according to future developments.  

Concerning symmetry

For symmetrical molecules a difficulty may arises to code coupling and 2D correlations.

Reminder: Coupling are not directly associated to atoms, but are associated to labels (in the NMREDATA_ASSIGNMENT tag). Labels are associated to one or more atoms (in case of symetry/fast rotation, etc.).

Example of difficulty/solution concerning scalar coupling: For the 1H spectrum of 1, 2 dichlorobenzene, we have two multiplets in the 1D 1H spectrum (two different protons in an AA’XX’ system) so if the SDF file includes two labels (one for A and one for X, each pointing to two atoms), in principle one can only give one coupling: the JA,X (no JA,A or JA,X'). But if one desires to specify all the couplings, give two different "labels" to A and A' (each pointing to only one atom), so that different coupling can be given for JA,X, JA',X,JA,A', JX,X'. This may desired so that the 1D spectrum can be simulated with the correct non-equivalence effect.

2) "Problem" for correlations. Consider 1, 4 dichlorobenzene, a 3JC,H HMBC correlation will be visible between a proton that seems to be the directly bound-carbon. Because the carbons 1J and 3J bond, relative to a proton are symmetrical. A software may see the correlation as 1J, but, it should be able to analyse the NMREDATA_ASSIGNMENT tag and see that the H and C are poiting to two atoms, and that the correlation may correspond to any combination of the four possible pairs. Two pairs will seem as the actual 3J and two as the 1J.

Why not adding more data in NMReDATA tags?

We consider that our task is to focus on NMR data. But SDF files could (and should!) also include other experimental data such as:

1) The origin of the molecule. This may include the extraction method and the plant it originates from, in phytochemistry, or the reaction producing it.

2) MS data

3) other spectral data

In principle authors can add any tag provided they have tools to do it and requests from the Journals... such data could have the following form...

The software producing SDF files including NMReDATA, should read SDF files and write SDF files only adding (or modifying/reviewing) the NMReDATA data.