An Introduction to Data Mining

What is data mining?
Data mining is a field of research that has emerged in the 1990s, and is very popular today, sometimes under different names such as “big data” and “data science“, which have a similar meaning. To give a short definition of data mining, it can be defined as a set of techniques for automatically analyzing data to discover interesting knowledge or pasterns in the data. The reasons why data mining has become popular is that storing data electronically has become very cheap and that transferring data can now be done very quickly thanks to the fast computer networks that we have today. Thus, many organizations now have huge amounts of data stored in databases, that needs to be analyzed. Having a lot of data in databases is great. However, to really benefit from this data, it is necessary to analyze the data to understand it. Having data that we cannot understand or draw meaningful conclusions from it is useless. So how to analyze the data stored in large databases? Traditionally, data has been analyzed by hand to discover interesting knowledge. However, this is time-consuming, prone to error, doing this may miss some important information, and it is just not realistic to do this on large databases. To address this problem, automatic techniques have been designed to analyze data and extract interesting patterns, trends or other useful information. This is the purpose of data mining. In general, data mining techniques are designed either to explain or understand the past (e.g. why a plane has crashed) or predict the future (e.g. predict if there will be an earthquake tomorrow at a given location). Data mining techniques are used to take decisions based on facts rather than intuition.


Data mining techniques can be applied to various types of data

Data mining software are typically designed to be applied on various types of data. Below, I give a brief overview of various types of data typically encountered, and which can be analyzed using data mining techniques. Relational databases: This is the typical type of databases found in organizations and companies. The data is organized in tables. While, traditional languages for querying databases such as SQL allow to quickly find information in databases, data mining allow to find more complex patterns in data such as trends, anomalies and association between values. Customer transaction databases: This is another very common type of data, found in retail stores. It consists of transactions made by customers. For example, a transaction could be that a customer has bought bread and milk with some oranges on a given day. Analyzing this data is very useful to understand customer behavior and adapt marketing or sale strategies. Temporal data: Another popular type of data is temporal data, that is data where the time dimension is considered. A sequence is an ordered list of symbols. Sequences are found in many domains, e.g. a sequence of webpages visited by some person, a sequence of proteins in bioinformatics or sequences of products bought by customers. Another popular type of temporal data is time series. A time series is an ordered list of numerical values such as stock-market prices. Spatial data: Spatial data can also be analyzed. This include for example forestry data, ecological data, data about infrastructures such as roads and the water distribution system. Spatio-temporal data: This is data that has both a spatial and a temporal dimension. For example, this can be meteorological data, data about crowd movements or the migration of birds. Text data: Text data is widely studied in the field of data mining. Some of the main challenges is that text data is generally unstructured. Text documents often do no have a clear structure, or are not organized in predefined manner. Some example of applications to text data are (1) sentiment analysis, and (2) authorship attribution (guessing who is the author of an anonymous text). Web data: This is data from websites. It is basically a set of documents (webpages) with links, thus forming a graph. Some examples of data mining tasks on web data are: (1) predicting the next webpage that someone will visit, (2) automatically grouping webpages by topics into categories, and (3) analyzing the time spent on webpages. Graph data: Another common type of data is graphs. It is found for example in social networks (e.g. graph of friends) and chemistry (e.g. chemical molecules). Heterogeneous data. This is some data that combines several types of data, that may be stored in different format. Data streams: A data stream is a high-speed and non-stop stream of data that is potentially infinite (e.g. satellite data, video cameras, environmental data). The main challenge with data stream is that the data cannot be stored on a computer and must thus be analyzed in real-time using appropriate techniques. Some typical data mining tasks on streams are to detect changes and trends.


What are the relationships between data mining and other research fields?

There are some differences between data mining and statistics although both are related and share many concepts. Traditionally, descriptive statistics has been more focused on describing the data using measures, while inferential statistics has put more emphasis on hypothesis testing to draw significant conclusion from the data or create models. On the other hand, data mining is often more focused on the end result rather than statistical significance. Several data mining techniques do not really care about statistical tests or significance, as long as some measures such as profitability, accuracy have good values. Another difference is that data mining is mostly interested by automatic analysis of the data, and often by technologies that can scales to large amount of data. Data mining techniques are sometimes called “statistical learning” by statisticians. Thus, these topics are quite close.