Data Mining Isnt Magical, But Its Not A Cookbook Procedure, Either
By
Which claims are likely to be fraudulent? Which policies will be profitable? Which insureds can be expected to buy additional products?
These are some of the questions that may be answered by the application of data mining–a process that enables an insurance company to determine the profitability of specific business activities or customer segments.
Data mining, however, is not a cookbook procedure that can be performed by dropping any kind of data into an expensive software package. The involvement of analysts with subject matter expertise and a firm grasp of the companys business goals are more important than the sophistication of the technology employed.
Data mining–a field of study that lies at the intersection of statistics, database management, machine learning and artificial intelligence–allows companies to extract valuable business information from the mountains of data captured in their systems. The data are used to build models that reveal hidden patterns or relationships, so that the information can be used to predict future customer behavior or financial results, or to make other critical business decisions.
The applications for data mining are quite broad. For example, it can be used to detect fraud, to identify cross-selling opportunities, to identify “spam” e-mail, to recognize handwritten digits electronically, to determine the effect of demographic variables on housing prices, and to predict stock market returns.
But while many insurance companies and some banks have made substantial investments in data-mining technology, they have often been disappointed with the results achieved, particularly if they started out with unrealistic expectations about what data mining could do.
The techniques used in data mining are not magical. They are closely related to traditional statistical techniques such as linear regression or time series analysis, but with a much richer and more flexible set of models. A typical data-mining problem might be formulated as follows.
An insurance company has a dataset with information on past experience–for example, a dataset listing information on past automobile policies sold by the insurer. The company is interested in predicting some variable captured in the dataset. This might be a measure of profitability, such as the loss ratio, or whether or not the customer is likely to buy other products in the future. This variable is called the response. Other variables in the dataset, called predictors, are thought to influence the response. For example, one might conjecture that policyholders with large families or good credit ratings are more profitable than others.
Data mining attempts to construct a model that will reliably estimate the response if we know the predictors. This model can then be used to evaluate new applicants for insurance, to design marketing campaigns or to evaluate future strategy.
Dozens of exotic modeling techniques are available to address data-mining problems. Although the rise of these powerful methods is a great step forward, the old tools are still valuable. Varieties of regression techniques, exploratory data analysis and even simple graphs can help reveal hidden patterns. In most cases, no single method will yield the best solution. Rather, successful data mining requires a portfolio of tools.
The CRISP-DM model is a six-step process for performing data-mining projects that has become a standard in the industry. The steps include business understanding, data understanding, data preparation, modeling, evaluation and deployment.
Business Understanding. Data-mining analysts may become so engaged in the elegance of their tools that they forget the customer wants a data-mining model only if it solves a specific business problem. It is vital for the analyst to work closely with the project sponsor throughout the project to ensure that the models being developed will meet the companys business needs.
Like other projects, data-mining projects are subject to the hazards of “scope creep” and changes in project definition while the work is being done. A clear definition of what is to be accomplished, along with measurable success criteria, should be agreed upon before any analysis begins. Then the business objectives must be converted to a set of data-mining goals that specify what the project is to achieve in technical terms.
Data Understanding. Obtaining usable data that will be reliable for purposes of data mining is likely to be the most difficult and time-consuming part of any data-mining exercise. Further, a lack of understanding of exactly what the data mean and how it is compiled is sure to lead to incorrect results. Typical problems encountered include: multiple or out-of-date systems that are inflexible and do not talk to each other, data-coding conventions that change over time, and inadequate quality control in data coding. For example, the prorating of premiums and removal and addition of exposures can vary from coder to coder. Also, data fields may be completed inconsistently or not at all.
Ensuring the cooperation of the systems group is also crucial. If the systems department does not understand precisely what is needed, irrelevant items are likely to be included and important items omitted.