Is your marketing data AI-ready?
AI is not new. Many of the retail and CPG marketing solutions that I have been a part of for decades now have leveraged sophisticated AI and machine learning techniques. That said, relative to its potential, AI is still in its adolescence. It still needs some adult supervision for anything dangerous or business-altering. And yes, even unsupervised learning models.
Like many adolescents, overconfident AI-based solutions sometimes act like they ‘know it all’ and fail to consider the relatively small, biased, and siloed experiences on which they are basing their opinions. Sure, over time your AI models will get smarter and figure more stuff out on their own, but until then, you need to inject domain-specific advice and rules in the form of data.
Below are some tips that I picked up over the years conceiving, nurturing, and learning from dozens of industry-leading marketing solutions used by hundreds of CPG, retailer, and agency clients. I hope that you can benefit from both my successes and my mistakes.
AI is the secret weapon helping our industry rethink how it works and deal with generational changes like consumer privacy, digital id signal loss, and retail media. It will fail if the data that AI depends on is not up for the task and you do not solve, avoid, or mitigate the challenges posed below.
Let’s dive in…
Incomplete Data
Your AI is taught to find patterns, so it doesn’t take a PhD Data Scientist to understand that hiding key patterns from your data can be a problem. Unfortunately, missing or unknown data happens all the time and needs to be proactively mitigated. Your data needs to help your AI understand the difference between ‘no’ versus ‘I do not know’. Incomplete data can be data that you usually get but did not get for some reason or another. Missing a few days or stores in your time-series purchase data is an example of this. Not addressing this could result in your AI believing that your customer(s) did not purchase your products within that data gap, thereby suppressing the impact you attribute to the marketing activities that may be going on. For many use cases, it would be better to replace that implied ‘zero’ sales number with an estimated baseline or forecasted sales number. You should flag that data as ‘estimated’ so your AI use cases can choose to use that data or not.
Another common incompleteness gap is where you never get data that you know exists. This is common for brands using purchase data from just a subset of retailers that sell their products. Regardless of the type of incomplete data, blindly assuming something like a purchase or marketing exposure did not happen because that data says so is troublesome for many use cases. This can result in media budget waste, loyal shopper purchase subsidization, or worse, costly strategic mistakes.
Estimating how complete your data is by using historical or third-party data can help your AI-based solutions adapt, adjust, or avoid misleading incomplete data. Some marketing use cases work with incomplete data just fine. Identifying customers who have bought a given product or saw an ad works well despite missing some data. However, identifying customers who did not buy a given product or bought a given amount of a product, is far less certain because they may have bought the product in one of your blind spots. The bigger the blind spot, the less certain your AI predictions can be, which is why injecting your estimated level of completeness into the data may help.
Biased Data
To borrow from the garbage data quality adage, ‘bias in, bias out’. AI excels at helping you get better at what it has seen you do before. So, if you train your model on Every Day Low Price retailer data, you will get great regular price advice, but that model will struggle to understand the effects of weekly circulars, coupons, and other promotion tactics. Similarly, if you train your digital bid optimization algorithms only on prior ad impression or ‘win log’ data, you will get even better at buying the same media or consumer segments, but you will struggle if you shift your spending to new media providers, publishers, context/content types, and audience segments.
Unfortunately, it is difficult to de-bias your raw training data without some unbiased data. The best solution, of course, is to inject more varied data to eliminate the bias! That is not an option for most of us because the data we need is closely guarded or financially impractical to source. That leaves you with a few options.
First, you can exclude or customize your AI use cases around each known bias. Not awesome.
Second, you could replace your current biased data sources with a third-party nationally representative panel data set. This will work for many, but not all of your AI and marketing use cases. Panel data will increasingly become mandatory as other data gets ‘walled’ off. Panels will however introduce sample size and other quirks that will need to be managed due to panel recruitment challenges, panelist churn, obtrusive data collection, proprietary taxonomies, etc.
A third option is the best of both worlds. Use representative panel data to adjust, calibrate, or filter your current data to make it less biased. Panels can be your ‘truth’ set to help mitigate your data’s bias by comparing it to industry category purchase volume norms, retailer market share dynamics, typical purchase cycles, and audience segment skews. There are limits to this, unfortunately, as you cannot adjust or scale data that you do not have. For example, if you cannot afford to buy media on professional sports content, no amount of calibrating your ad impression data can tell you if your brand buyers like professional sports.
Complicated Data
As odd as it sounds, some data is just too complicated for AI. This includes continuous numerical ranges, granular attributes, and even highly normalized database structures. Take numerical data like price, time, size, or age that have hundreds or thousands of potential values. It is a best practice to group and aggregate these high cardinality data elements, most often into quartiles, deciles, or similar groupings. Most industries however beg for the data ranges to be based on human domain knowledge or specialized clustering techniques. Deciling purchase quantity or time of day may make sense, but likely not age and price. Placing newborns and 9-year-olds into the same age decile does not make sense. It is better to leverage proven segmentations in many cases. That said, hiding the fine details from AI is not always the right choice. For example, retailers know that pricing something at $98 or $99 is far different just one dollar more at $100.
Having too many descriptive attribute values for products, customers, prospects, media, etc. can also create too much noise for your AI models. Take something like color or flavor with dozens of variations. That fidelity is required to describe a product accurately but it could overwhelm a model making it difficult for useful insights or trends to emerge. Summarizing attributes is much tricker than grouping number ranges as there are typically several ways to summarize attributes. Take fruit flavors, you may want to group oranges, navel oranges, blood oranges, tangerines, and clementines together. You might also want something higher level like citrus, berries, apples, etc. Grouping high-antioxidant fruits like strawberries, blueberries, pomegranates, dark chocolate, etc. might also prove insightful. Suffice it to say, that effectively classifying data is very domain-specific and requires close collaboration with those who know your product, category, and customers well.
Simplifying your database structures for AI may also be necessary. Your data scientists need close collaboration with your data architects and engineers as the data attributes, segmentations, and taxonomies they need are (or should be) maintained in a normalized master data management repository. Depending on the sophistication of your data cloud technology and AI stack, you likely will need to flatten this type of data for your AI.
Inconsistent Data
In some ways, data inconsistency has many of the same adverse effects that overly complicated data has. The difference is that inconsistent data is often wrong, versus just too granular, due to missing data quality practices, lack of standards, and poor cross-functional collaboration. In many cases, these inconsistencies can be easily fixed by aligning definitions and centrally managing acceptable values. For example, did you buy your morning coffee at a Gas Station or Convenience Store? Was it a 7-11, 7-ELEVEn, 7-ELEVEN, Seven-Eleven, etc.? Are the sales represented in cups or ounces sold?
In some cases, the data is technically correct but just represented wrong. Finding patterns in your data does not work well if you mix U.S. dollars with Eurodollars, metric and imperial measurements, or pounds versus units sold. This problem can be mitigated in most situations by normalizing the data to a single aligned standard. Some conversions are easy. An inch is always 25.4 millimeters. Some are a little harder. As of today, 1 Euro = 1.06 U.S. Dollar, but it will be different by the time you read this. Still, others are harder. I once had to know what an average banana weighed to convert pounds sold into units sold.
Unfortunately, many data inconsistencies are out of your control and come inconsistent from your outside vendors or partners. Take media ad impressions and campaign conversions, for example. Media platforms and publishers too often vary on what qualifies as being ‘seen’ or can be considered a success. Harmonizing data where you do not have the underlying raw details is sometimes impossible, but that does not mean that you do not need to intervene to keep your AI from coming to the wrong conclusions. How to mitigate third-party inconsistencies is very situation-dependent and difficult to comment on here, but it could be everything from calibrating to some universe norm, mapping to a common taxonomy, adding confidence weights, or even just caveating results.
Contextless Data
Perhaps the biggest data challenge that humans understand that AI does not is context. What influences what, when, where, why, and how we do things in life is complex. Context can be almost anything… pandemics, new babies, new jobs, new homes, morning commutes, holiday weekends, dinner time, bedtime, sporting events, political commentary, retailer circulars, in-store displays, etc. Injecting life context into your data can sometimes be the difference between your AI surfacing what marketing investments truly moved the needle versus giving credit to the distance second reason.
Data scientists as consumers know that context matters, but they too often suppress their less geeky human knowledge and are content with their ‘all else being equal’ model caveats. We know however that ‘all else’ is rarely equal. If we expect AI to replace more expert human reason, we will need to inject more human-relatable context into our data.
All said, it is impractical to understand, collect, or integrate every type of context down to the time, place, and person it influenced. My suggestion is to start with a good data model and populate it with what you already have. Gather and label key timeframes, moments, locations, places, events, media content, etc. Most companies have done this already, albeit scattered across their enterprise. Piggyback on other established marketing processes and analytics that may have already summarized this disparate data for you. A great example is the data many have already gathered and neatly time-aligned for their marketing mix models. It is a great basecoat that should cover your broader marketing stimuli.
Classified media content is becoming a critical form of context to understand as more advertising dollars shift from addressable audience targeting to non-addressable targeting to media context that disproportionately attracts your desired audience. Unfortunately, accurately, and relevantly classifying media content is not as commonplace as you would think. Industry commercial and financial factors hamper the availability, granularity, and accuracy of media and publisher inventory data. The industry’s major content taxonomies (e.g. IAB, Google Topics, vendors) will need to be more granular, relevant, and accurate to adequately replace audience targeting. You may need to do this yourself. If so, you will also need a strong sample of URL and program-level ad and media exposure data to train your models, optimize bidding algorithms, and measure results. There is no way around it. Your media AI predictions need data that knows both what you buy and what you watch.
And a few more
Stale data – Your model training and execution data needs to reflect the current world you are trying to predict. People, households, categories, and societies all change over time, and some can change in an instant. Major events like the global pandemic forced many models to be adjusted and adjusted again. How stale is stale is very company, category, and brand specific. For some (e.g. baby product brands), near real-time and person-specific data is the ultimate goal.
Low Sample data – Making big decisions on too few observations can be dangerous. If you cannot augment your data to boost sample size, your data needs behaviorally relevant segmentations and hierarchies to allow your AI to find more confident patterns by aggregating similar data points into larger observations.
Modeled data – Modeling already modeled data is not ideal, yet this happens more than we would like, often due to ignorance of how the upstream data was created by the third-party data suppliers. This will only get more challenging as our reliance on probabilistic data increases due to consumer privacy and digital id signal loss. Increased penetration of Gen AI-generated content will also impact the data and reliability of future large-language models.
Illegally Used data – Permissible data use is getting tighter and penalties for abusing them are getting stronger. Promises made to consumers and contractual commitments made with data partners are being scrutinized more than ever, both by the business parties and our governments. In some cases, it may be prudent to prepare and enforce use case specific data to avoid accidental misuse of seemingly better data that could lead to costly penalties, redos, and business consequences.
Fraudulent data – It is unfortunate that we even need to address this one, but the advertising industry is littered with fraudulent, bogus, and misrepresented data. You will need to invest to analyze and remove data that could mislead your AI. This includes data from bots, clickbait, made-for-advertising sites, and data that is just plain unrealistic (e.g. spending thousands a day, clicking on hundreds of ads, watching TV 24x7, spanning time zones hourly)
I could go on, but then again, you can always hire me :-)
Net, healthy AI solution performance is only as good as the data it eats. Your data food pyramid should be clean, insight-rich, and free from bias, inconsistency, and other bad stuff. To quote industry data guru and friend Scott Taylor, ‘good decisions made on bad data are just bad decisions you don’t know about…yet’. This is a timeless statement, but it is more important than ever because AI has the potential to proliferate, accelerate, obscure, and automate bad data decisions like never before.
Reach out to me at dan@harmonymartech.com if you want my help making your data more AI-ready so you can enable more confident and effective marketing campaigns, shopper loyalty, audience insights, media buys, creative content, data monetization, etc.