?what is data mining


Data mining is a new discipline which has sprung up at the confluence of several other disciplines, driven chiefly by the growth of large databases. The basic motivating stimulus behind data mining is that these large databases contain information which is of value to the database owners, but this information is concealed within the mass of uninteresting data and has to be discovered. That is, one is seeking surprising, novel, unexpected, or valuable information, and the aim is to extract this information. This means that the subject is closely allied to exploratory data analysis.


Exploration & analysis, by automatic or semi-automatic means, of  large quantities of data in order to discover  meaningful patterns.

However, issues arising from the sizes of the databases, as well as ideas and tools imported from other areas, mean that there is more to data mining than merely exploratory data analysis.

Perhaps the main economic driver to the development of data mining tools and techniques has come from the commercial world; the promise of money to be made from data processing innovations is a familiar one, and commercial databases are now rapidly growing in size, as well as in number.

The excitement of data mining is also partly a consequence of this nature; it suggests that there is valuable information concealed within the data one already has, simply waiting for someone to tease it out. Unfortunately, the “simply” part of this exercise is rather misleading.

One of the problems is that large data sets necessarily have a great deal of structure in them, but this structure has three major sources in addition to the target one of “important, real, undiscovered structure.” These three sources are data contamination, chance occurrences of data, and structure which is already known to the database owner (or, if not explicitly articulated as known, sufficiently obvious once it has been pointed out to be of no genuine interest or value, such as the fact that married people come in pairs). The first and second of these are sufficiently important to warrant some discussion.

