Data Mining


Data mining may be defined as — ‘the non-trivial retrieval of implicit, previously unknown and potentially useful information from various sources of data’. Importantly, data mining exclusively complements and overwhelmingly expands bioinformatics, so that both the former and the later are apparently distinct in nature, but there is every possibility, change, and ample scope that eventually they may ultimately merge together. Though at present they do have their own distinct identity, but sooner or later eventually data mining and bioinformatics will turn out to be absolutely undistinguishable.

It may be observed explicitely that data mining is solely practised in the field of ‘biotechnology’ intimately involving various branches of ‘life sciences’ viz., biology, microbiology, agriculture, and above all the health care system. Besides, data mining may also be extended legitimately and exploited in other areas not related to life-sciences, such as : banking, database providers, engineering, financial institutions, government agenicies, manufacturing, marketing, telecommunications, travel industry, and service industries. In fact, the copious and massive informations generated from the above sources could be utilized to its maximum extent by means of a good number of highly specialized softwares already in actual use across the globe.

The ever fast developing ‘Biopharmaceutical Industry’ in the world is profusely using enormous quantum of databases that are virtually flooded with a plethora of vital informations, retrieved from a variety of data mining methodologies, such as :

  1. Annotated databases of the disease profiles
  2. Molecular pathways involved in dreadful human diseases
  3. Quantitative structure activity relationships (QSARs)
  4. Precise chemical structures of combinatorial libraries of compounds
  5. Results of mandatory ‘clinical trials’ of new molecules

(Data mining is employed to help the ‘pharmaceutical industry’ in general and ‘biopharmaceutical industry’ in particular to exploit and utilize this valuable information gainfully and fruitfully).

Applications of Data Mining

With the advent of tremendous volume of highly informative, valuable, useful data generated and stored so efficienctly there exists a ‘big challenge’ to the biopharmaceutical industry with respect to the critical and precise decision towards the development of absolutely viable ‘targets and lead compounds’. Thus, data mining goes a long way to simplify and focus on these complex sets of data in an absolutely efficient and intuitive manner. In fact, there are quite an appreciable number of ‘organizations’ that cater for ‘data mining services’ for a variety of specialized applications. Importantly, there are six predominant and well-known approaches with regard to the ‘data-mining’ applications, namely :

(a) Influence-based mining. i.e., intensive search for cause and effect relationships between data sets and pharmacogenomics,

(b) Affinity-based mining. i.e., data mining system distinctly identifies data points thereby making the approach more meaningful and useful, which is rather important to distinguish ‘accidental/incidental’ motifs vis-a-vis those of definite biological significance,

(c) Time-delay data mining. i.e., to identify patterns which are specifically combined or rejected as the data set gets voluminous in future,

(d) Trends-based data mining. i.e., alterations are investigated minutely which essentially take place in specific data sets over a certain period (time) ; and examining the trends instituted,

(e) Comparative data mining. i.e., various data collected at different sites vis-a-vis different time periods are compared to detect and identify the extent of ensuing dissimilarities, and

(f) Predictive data mining. i.e., it largely complements and expands traditional bioinformatics.