pubmed.ncbi.nlm.nih.gov

InChIKey collision resistance: an experimental testing - PubMed

  • ️Sun Jan 01 2012

InChIKey collision resistance: an experimental testing

Igor Pletnev et al. J Cheminform. 2012.

Abstract

InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications.We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body.From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Molecular skeleton of Spongistatin I.

Figure 2
Figure 2

The observed (circles) and theoretically expected (curve) average number of InChIKey second block collisions vs. the number of considered stereoisomers of Spongistatin I.a) The whole data range; abscissa values: log(number of isomers); b) low-collision region; abscissa values: number of isomers.

Figure 3
Figure 3

The dependence of observed average number of InChIKey second block collisions for 370 000-entry datasets vs. the number of samplings m.

Figure 4
Figure 4

Normalized frequencies of various letters within the first block of InChIKey. Measured using InChIKeys for 1 097 996 constitutional isomers of C8H8Cl3F5; the values are normalized to the frequency of ‘A’.

Similar articles

Cited by

References

    1. IUPAC International Chemical Identifier (InChI) Programs InChI version 1, software version 1.04 (September 2011) http://www.inchi-trust.org/downloads/ Last accessed 2012-09-12.
    1. Federal Information Processing Standards Publication 180–2 (+ Change Notice to include SHA-224) http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenoti... Last accessed 2012-09-12.
    1. IUPAC International Chemical Identifier (InChI) Programs InChI version 1, software version 1.04 (September 2011), User’s Guide. http://www.inchi-trust.org/fileadmin/user_upload/software/inchi-v1.04/In... Last accessed 2012-09-12.
    1. Graham RL, Grötschel M, Lovász L. Handbook of Combinatorics, Volume 2. Elseveir; 1995.
    1. InChIKey Collision. http://www-jmg.ch.cam.ac.uk/data/inchi/ Last accessed 2012-09-12.

LinkOut - more resources