Data Mining Isnt Magical, But Its Not A Cookbook Procedure, Either


Which claims are likely to be fraudulent? Which policies will be profitable? Which insureds can be expected to buy additional products?

These are some of the questions that may be answered by the application of data mining–a process that enables an insurance company to determine the profitability of specific business activities or customer segments.

Data mining, however, is not a cookbook procedure that can be performed by dropping any kind of data into an expensive software package. The involvement of analysts with subject matter expertise and a firm grasp of the companys business goals are more important than the sophistication of the technology employed.

Data mining–a field of study that lies at the intersection of statistics, database management, machine learning and artificial intelligence–allows companies to extract valuable business information from the mountains of data captured in their systems. The data are used to build models that reveal hidden patterns or relationships, so that the information can be used to predict future customer behavior or financial results, or to make other critical business decisions.

The applications for data mining are quite broad. For example, it can be used to detect fraud, to identify cross-selling opportunities, to identify “spam” e-mail, to recognize handwritten digits electronically, to determine the effect of demographic variables on housing prices, and to predict stock market returns.

But while many insurance companies and some banks have made substantial investments in data-mining technology, they have often been disappointed with the results achieved, particularly if they started out with unrealistic expectations about what data mining could do.

The techniques used in data mining are not magical. They are closely related to traditional statistical techniques such as linear regression or time series analysis, but with a much richer and more flexible set of models. A typical data-mining problem might be formulated as follows.

An insurance company has a dataset with information on past experience–for example, a dataset listing information on past automobile policies sold by the insurer. The company is interested in predicting some variable captured in the dataset. This might be a measure of profitability, such as the loss ratio, or whether or not the customer is likely to buy other products in the future. This variable is called the response. Other variables in the dataset, called predictors, are thought to influence the response. For example, one might conjecture that policyholders with large families or good credit ratings are more profitable than others.

Data mining attempts to construct a model that will reliably estimate the response if we know the predictors. This model can then be used to evaluate new applicants for insurance, to design marketing campaigns or to evaluate future strategy.

Dozens of exotic modeling techniques are available to address data-mining problems. Although the rise of these powerful methods is a great step forward, the old tools are still valuable. Varieties of regression techniques, exploratory data analysis and even simple graphs can help reveal hidden patterns. In most cases, no single method will yield the best solution. Rather, successful data mining requires a portfolio of tools.

The CRISP-DM model is a six-step process for performing data-mining projects that has become a standard in the industry. The steps include business understanding, data understanding, data preparation, modeling, evaluation and deployment.

Business Understanding. Data-mining analysts may become so engaged in the elegance of their tools that they forget the customer wants a data-mining model only if it solves a specific business problem. It is vital for the analyst to work closely with the project sponsor throughout the project to ensure that the models being developed will meet the companys business needs.

Like other projects, data-mining projects are subject to the hazards of “scope creep” and changes in project definition while the work is being done. A clear definition of what is to be accomplished, along with measurable success criteria, should be agreed upon before any analysis begins. Then the business objectives must be converted to a set of data-mining goals that specify what the project is to achieve in technical terms.

Data Understanding. Obtaining usable data that will be reliable for purposes of data mining is likely to be the most difficult and time-consuming part of any data-mining exercise. Further, a lack of understanding of exactly what the data mean and how it is compiled is sure to lead to incorrect results. Typical problems encountered include: multiple or out-of-date systems that are inflexible and do not talk to each other, data-coding conventions that change over time, and inadequate quality control in data coding. For example, the prorating of premiums and removal and addition of exposures can vary from coder to coder. Also, data fields may be completed inconsistently or not at all.

Ensuring the cooperation of the systems group is also crucial. If the systems department does not understand precisely what is needed, irrelevant items are likely to be included and important items omitted.

Sometimes the required data are not captured in the existing data systems. In this case, data from other systems, such as installment premium systems, field sales and marketing systems, and third-party databases (e.g., credit-scoring systems) can be used. If not, supplementing existing data by a survey or sampling approach may yield enough data to perform the project.

Data Preparation. Once the data have been assembled, its quality must be verified, and it must be put into a format suitable for data mining. Tasks performed here include selecting the variables to be included in the models, throwing out rows and columns that may not be clean enough to use, and enriching the data by creating new variables.

Modeling. Constructing the model is often the easiest step and certainly the most enjoyable. Many techniques should be tried, and simple models sometimes work as well as complex ones.

The various models are ranked on criteria such as how well the model predicts new cases, whether the model can be explained intuitively and whether the rules used in the model make sense to domain experts.

Evaluation. The final model or models should be evaluated against the following criteria:

Does the model meet the defined business objectives?

Have any important business objectives not been addressed?

Does the model make sense to subject matter experts?

How is the model likely to affect profits?

How much variability do we expect in results if the company changes its business processes based on model results?

When these questions have been answered satisfactorily, a decision is made whether to perform further modeling or proceed to deployment.

Deployment. The analyst should be prepared to respond to issues that arise when the model goes live. The deployment plan should include provisions for ongoing monitoring and maintenance to avoid incorrect usage of models, and to evaluate their performance against the previously determined success criteria. Market reaction and changes implemented by competitors need to be carefully considered as well.

Success requires a well-defined business objective, a process for carefully cleansing, enriching, and preparing the data, and the involvement of domain experts throughout the data-mining process. With realistic expectations and careful consideration of all the issues, a company can make data mining an essential weapon in its arsenal of tools for evaluating profitability.

Jim King, ( is a senior consulting actuary in the Actuarial Services Group, and Orin Linden ( is director of property/casualty actuarial services for New York-based Ernst & Young LLP.

Reproduced from National Underwriter Life & Health/Financial Services Edition, November 25, 2002. Copyright 2002 by The National Underwriter Company in the serial publication. All rights reserved.Copyright in this article as an independent work may be held by the author.