Casting a Wide Net for Potential Predictors when Building a Model

What makes for a high-performing statistics-based predictive model? Often, a key factor is the creation of a large number of potential independent ("predictor") variables. Generally, it is wise to cast as wide a net as possible during the early stages of a model build, even though only a limited number of variables will make the final cut. That way, you maximize your chances of capturing all of the customer dynamics that are inherent in the data. In fact, with analysis files of significant size and complexity, it is common for experienced data miners to define several hundred potential predictors, each of which makes good business sense.

What makes it possible to create several hundred potential predictors? It is our old friend Best-Practices Marketing Database ContentSM, a central tenet of which is that the data should be as robust as the underlying methods of collection are capable of supporting. The idea is to capture everything within reason about a customer's relationship with your company, in the form of atomic-level transaction detail. This detail, in turn, acts as the building blocks for the creation of lots and lots of potential predictors.

Sometimes, however, large numbers of potential predictors do not significantly enhance the power of a model. This is true for businesses whose underlying dynamics are so straightforward that virtually all available model power can be captured by a handful of simple RFM-type variables. For example:

Several years ago, a model was built off a Marketing Database that reflected best-practices content. Several hundred potential predictors were created using combinations of: 1) RFM derivations, 2) merchandise categories, 3) initial order characteristics, 4) post-demand activity, 5) order channel distribution, 6) seasonality, 7) full-price vs. discounted merchandise, and 8) payment type. The final model did a great job of predicting customer purchase behavior. For example, the ratio-to-average ("lift," where 100 is average) for Decile 1 was 423 compared with only 16 for Decile 10.

However, additional analysis revealed that over 90% of the model's power could be achieved with a handful of variables in which purchase dollars were delineated by several simple recency and seasonality breaks. With results such as this, some might question the ability of best-practices content to drive significant increases in revenue and profit. Next month's e-Letter will explain why, even for companies where models driven by simple RFM variables do a great job of predicting customer behavior, robust atomic-level data content still plays a seminal role in sophisticated CRM.