Best Practices in Data Mining: The First Five Commandments

This is the commencement of a quarterly column that will focus on best practices in data mining. We define data mining as all of the analytical methods that are available to transform data into insight. Examples include statistics-based predictive models, homogeneous groupings ("clusters"), cohort analyses such as lifetime value, quantitative approaches to optimizing contact strategies across multiple channels, and the creation of report packages and key-metrics dashboards.

What this Column Will Not Be About

We will not spend a lot of time comparing predictive modeling techniques and software packages. Much has been written, for example, about the merits of regression versus neural networks. Having participated in countless model builds, I speak first-hand to the fact that technique plays only a secondary role in the success or failure of a predictive model.

Discussions about modeling techniques have always reminded me of the theological debate that took place many centuries ago about how many angels can dance on the head of a pin. Today's data miners are fixated on their own pins and angels when they wrangle about techniques!

A by-product of this wrangling is the fantastic claims made by proponents of some of these techniques. Unfortunately, such claims are pabulum for the gullible. The inconvenient truth, to borrow a phrase from a prominent national politician, is that technique has very little impact on results. There is only so much variance in the data, and the stark reality is that new techniques are not going to drastically improve the power of predictive models.

What this Column Will Be About

The focus will be on the truly important issues; namely, just about everything else having to do with data mining. For example, this month's topic will be the significant improvements that are possible for optimizing the raw inputs to the data mining process. The ultimate goal is to perform data mining off a platform that we at Daystar Wheaton Group refer to as Best Practices Marketing Database Content. This, in turn, supports deep insight into the behavior patterns that form the foundation for data-driven decision-making.

General Characteristics of Best Practices Marketing Database Content

For starters, Best Practices Marketing Database Content provides a consolidated view of all customers and inquirers across all channels. Examples of channels include direct mail, e-commerce, brick-and-mortar retail, telesales and field sales. Sometimes "and particularly in Business-to-Business and Business-to-Institution environments "prospects are included.

Best Practices Marketing Database Content is as robust as the underlying methods of data collection are capable of supporting. The complete history of transactional detail must be captured. Everything within reason must be kept, even if its value is not immediately apparent. For example:

One multi-channel marketer failed to forward non-cash transactions from its brick-and-mortar operation to the marketing database. This became a problem when a test was done to determine the effectiveness of coupons sent to customers, which were good for free samples of selected merchandise. The goal was to determine whether these coupons would economically stimulate store traffic. But, because the corresponding transactions did not involve cash, there was no way to mine the database for insights into which customers had taken advantage of the offer, and what the corresponding effect was on long-term demand.

The Ten Commandments of Best Practices Marketing Database Content

There are Ten Commandments that, if followed, will ensure Best Practices Marketing Database Content. Five are discussed this month, and the balance will be covered in the next column:

1: The Data Must Be Maintained at the Atomic Level

All customer events such as the purchase of products and services must be maintained at the lowest feasible level. This is important because, although you can always aggregate, you can never disaggregate. Robust event detail provides the necessary input for seminal data mining exercises such as product affinity analysis.

"Buckets" and other accumulations created from the data should be avoided. This is particularly important for businesses that are rapidly expanding, where it can be impossible to audit and maintain summary data approaches across ever-increasing numbers of divisions.

One firm learned the hard way about the need to maintain atomic-level detail when it discovered that its aggregated merchandise data did not support deep-dive product affinity analysis. This is because, by definition, it was impossible to understand purchase patterns within each aggregated merchandise category. For example, with no detail beyond "Jewelry," there was no way to identify patterns across subcategories such as Watches, Fine/Fashion Merchandise, Bridal Diamonds, Fashion Diamonds, Pearls/Stones, Accessories and Loose Goods.

2: The Data Must Not Be Archived or Deleted

Within reason, data must not be archived. Likewise, it must not be deleted except under rare circumstances. Ideally, even ancient data must be retained because you never know when you might need it. Rolling off older data is perhaps the most common shortcoming of today's marketing databases; an ironic development because, unlike ten or twenty years ago, disk space is cheap.

Data mining can be severely hampered when the data does not extend significantly back in time. One database marketing firm experienced this when it tried to build a model to predict which customers would respond to a Holiday promotion. Unfortunately, all data content older than thirty-six months was rolled off the database on a regular basis. Remarkably, it was not even archived. For example, the database would only reflect three years of history for a customer who had been purchasing for ten years.

The only way to build the Holiday model, of course, was to go back to the previous Holiday promotion. This reduced to twenty-four months the historical data available to drive the model. More problematic was the need to validate the model off another Holiday promotion; the most recent of which had "by definition "taken place two years earlier. This, in turn, reduced to twelve months the amount of available data. As you can imagine, the resulting model was far from optimal in its effectiveness!

3: The Data Must Be Time-Stamped

The use of time-stamped data to describe phenomena such as orders, items and promotions facilitates an understanding of the sequence of progression for customers who have been cross-sold. This is also true if customers are found to have purchased across multiple divisions during the incorporation of acquired companies. Corresponding data mining applications include product affinity analysis and next-most-likely-purchase modeling.

4: The Semantics of the Data Must Be Consistent and Accurate

Descriptive information on products and services must be easily identifiable over time despite any changes that might have taken place in naming conventions. Consider how untenable analysis would be if the data semantics were so inconsistent that "say ""item number 1956" referenced a type of necktie several years ago but umbrellas now. Also, the reconciliation of different product and services coding schemes must be appropriate to the data-driven marketing needs of the overall business, and not merely to the individual divisions.

5: The Data Must Not Be Over-Written

Deep dive data mining is predicated upon the re-creation of past-point-in-time "views." For example, a model to predict who is most likely to respond to a Summer Clearance offer will be based on the historical information available at the time of an earlier Summer Clearance promotion. The re-creation of point-in-time views is problematic when data is overwritten.

A major financial institution learned this in conjunction with a comprehensive database that it built to facilitate prospecting. After months of work, the prospect database was ready to launch. The internal sponsors of the project, anxious to display immediate payback to senior management, convened a two-day summit meeting to develop a comprehensive, data-driven strategy.

One hour into the meeting, the brainstorming came to an abrupt and premature end. The technical folks, in their quest for processing efficiency, had not included in the database a running history of several fields that were critical to the execution of any data mining work. Instead, the values comprising these fields were over-written during each update cycle.

The incorporation of this running history necessitated a redesign of the prospect database. The unfortunate result was a two-month delay, a loss of credibility in the eyes of senior management, and a substantial decline in momentum.

Final Thoughts

The next column will focus on Commandments Six through Ten of Best Practices Marketing Database Content. In the meantime, consider whether your marketing database violates any of the first five Commandments. The extent to which it does is the extent to which your firm's revenues and profits are being artificially limited.