Identifying Substitutable Goods using Large-scale Shopping Cart Basket Data

Introduction

A fundamental challenge of analyzing consumer behavior data is identifying substitutable goods (substitutes). Large-scale shopping cart data provides vast opportunities to understand consumer purchase behavior at a more granular level to uncover insights such as identifying substitutes. Large-scale shopping cart data is complex and captures many interrelated forces at play driving a consumer’s purchase decisions. For example, on any given shopping trip a consumer needs to consider the purpose of their shopping trip, season of the year, promotions seen in store, and their own personal preferences, among many other factors determining their purchases. To capture the complexity that shopping cart data provides, we implement a sequential probabilistic model of shopping baskets known as “SHOPPER,” developed by Ruiz, Athey, and Blei [1]. This model helps us understand the interrelated forces driving a consumer’s purchasing decisions. Hundreds of thousands of shopping trips over millions of transactions are analyzed to uncover consumer insights and in particular, substitutes. In the next section, we give a brief description of the data and model, describe the large-scale shopping cart data we analyze, and the SHOPPER model used to identify substitutes. The section on “Identifying Substitutes and Examples” first explains the construction of substitution scores based on model estimates and then presents several examples of substitute items across different food categories. The conclusion section provides insights gained from the model around identifying substitutes.

Brief Description of Data and Model

The large-scale shopping cart data we analyze is a subset of Numerator’s extensive consumer panel data, which contains data for over a 1,000,000 shoppers, and 1,000,000,000 shopping trips (baskets) [2]. For this study, we analyze a small slice of Numerator’s data containing over 6,000 items, 100,000 shoppers, 600,000 shopping trips (baskets), and 2,000,000 purchases over the timeframe 2021-2022 for one retailer (anonymized for data privacy). These data tracks consumers over time so we have repeated shopping trips, detailed information on the date of purchases, and other items being co-purchased and their prices. The size and granularity of these data allows us to uncover insights at the shopper level, including the identification of substitutable goods for any specific item.

We employ a random utility model to estimate a consumer’s mean utility for each item during a shopping trip. The main assumption is that for each shopping trip we assume a consumer wants to maximize the utility of each item in their basket of goods, accounting for a consumer’s idiosyncratic preference for each item. The model is described below:

$\psi_{t,c}= \lambda_{c} +\theta_{u_t}^{T} \alpha_{c}-\gamma_{u_t}^{T}\beta_{c}log\tau_{t,c}+\delta_{w_t}^{T} \mu_{c}\ (1) \\c = item,\ t = trip,\ w = week,\ u = customer$

Where $\psi_{t,c}$ represents the mean utility for item $c$ during shopping trip $t$ , as purchased by consumer $u_{c}$ on week $w_{t}$ . We are interested in estimating item popularities $\lambda_{c}$ , latent consumer preference $\theta_{u}$ , item latent variables of each item $\alpha_{c}$ , price sensitivity factorizations for each user $\gamma_{u}$ **and each item $\beta_{c}$ , and seasonal factorizations as the interaction of week $\delta_{w_t}$ **and an item’s individual response to week $\mu_{c}$ .

Next, for each shopping trip we assume a consumer selects items sequentially and at each step $i$ chooses over items not included in their current basket. The sequence ends when the consumer chooses the last item called the “checkout item.” The mean utility for item $c$ then depends on the mean utility $\psi_{t,c}$ from equation (1) above and interactions with other items already included in the basket as described below:

$\Psi(c,y_{t,i-1})=\psi_{t,c} +\rho_{c}^{T}(\frac{1}{i-1} \sum_{j=1}^{i-1}\alpha_{y_{t,j}})\ (2)$

$c = item, t = trip \\ y_{t,i-1} = (y_{t,1},y_{t,2},...,y_{t,i-1};\ items\ in\ basket\ up\ to\ step\ i \\ y_{t}=(y_{t,1},y_{t,2},...,y_{t,n_t}); items\ in\ basket\ where\ y_{y,n_t}= checkout\ item$

Where $\rho_{c}$ represents the interaction coefficients for item $c$ and $\alpha_{y}$ are the same item latent variables as described in equation (1). If $\rho_{c}\alpha_{y}$ is positive then item $c$ is more likely to be a complement and if $\rho_{c}\alpha_{y}$ is negative then item $c$ is more likely to be a substitute. We are interested in estimating $\rho_{c}$ for equation (2).

After estimating these variables, we can use these estimates to identify goods that are substitutes for a specific item (among other insights such as complements, seasonal effects, and price sensitivities). Substitutes have similar interactions with other items included in their basket, but are not typically purchased together (for example, two different brands of chocolate bars). For each item, a ranked list of other items and their substitution scores can be constructed to provide insights into (1) how substitutable an item is and (2) the next best item to recommend if that specific item is unavailable.

Identifying Substitutes and Examples

The substitution scores previously mentioned range from 1 (most substitutable) to 0 (not substitutable). To uncover which items are more substitutable than others, the model estimates the conditional probabilities of purchasing a specific item, say “k”, given another item “c” is already included in the basket. These conditional probabilities are constructed for every item in the dataset.

For example, consider a set of three items (c, c’, and k) where we want to estimate how substitutable the pair of items c and c’ are to each other. Item c would be perfectly substitutable with item c’, with a substitution score of 1, if the conditional probability of choosing item k given item c is in the basket is the same probability of choosing item k given another item c’ is in the basket. The more similar the probabilities, the higher the substitution score and the more substitutable the pair of items c and c’ are. The results provide an unrestricted view of what an item’s substitutes could be since it provides a substitution score for all items, regardless if it is a different package size. Table 1 below illustrates a few examples of queried items and their top five most substitutable items according to their substitution scores. We display results from three different items across diverse categories to illustrate the variation in substitutes.

Table 1: Substitution scores for three queried items across various categories (cereal, chips, fruit), listing the top five substitutable items according to substitution score

Cinnamon Toast Crunch		Doritos Nacho Cheese		Fresh Strawberries
Item Substitutes:	Scores:	Item Substitutes:	Scores:	Item Substitutes:	Scores:
Frosted Flakes	0.932	Ruffles Original	0.944	Premium Strawberries	0.908
Lucky Charms	0.915	Doritos Spicy Nacho	0.935	Red Seedless Grapes	0.839
Trix	0.914	Fritos Chili Cheese	0.934	Red Seedless Table Grapes	0.817
Apple Jacks	0.896	Cheetos Crunchy	0.928	Watermelon Chunks	0.810
Fruit Loops	0.896	Fritos Original	0.927	Raspberries	0.802

Qualitatively, these results align intuitively with our expectations. Specifically, for each of the three queried items, the top five identified substitutes typically belong to the same category. For instance, Cinnamon Toast Crunch yielded other cereal items as substitutes, Doritos Nachos resulted in other chips items, and Fresh Strawberries presented other fruit items as substitutes. Second, within the same category we observe intuitively related substitutes. Many cheese-based chips were suggested as substitutes for Doritos Nacho Cheese; Premium Strawberries had the highest substitution score for Fresh Strawberries; and many other high sugar cereals were suggested as substitutes for Cinnamon Toast Crunch (instead of more healthier cereals such as Raisin Bran). Third, the difference between the substitution scores for Premium Strawberries and Red Seedless Grapes vs. Red Seedless Grapes and Red Seedless Table Grapes is much higher (0.069 vs. 0.022). This is not surprising given strawberries are a different fruit from the other fruits suggested. It is reassuring to observe that many of the substitutes obtained from the model make intuitive sense.

Conclusion

This article discusses a method for identifying substitutes through analyzing large-scale shopping cart data using a sequential probabilistic model (SHOPPER), which captures the many interrelated forces influencing a consumer’s purchase decisions. These results can help uncover insights to improve assortment recommendations or suggest what to display if a particular item is out of stock. For a specific product, we can identify whether the next best product is the same product in a different package size (for example, standard vs. jumbo), slightly different flavor (for example, ranch vs. blue cheese), or another competitor’s product. In some cases, we might find that a particular product does not have as many substitutable products given the unique nature of the product.

Furthermore, we are actively extending the SHOPPER model through the inclusion of retailer and banner effects, and consumer demographic data. These variables will add further nuance and range to the model, as well as quantify previously unseen insights. This will allow us to gain additional insights such as whether substitute items vary across retailers and also if particular groups of consumers have different sets of substitutes for a specific item. If you’re interested in learning more about our substitutes solutions, how substitutes are constructed, and what insights we can uncover, please feel free to reach out at info@tickr.com. We hope you found this article helpful and look forward to hearing from you!

Citations

[1] Ruiz, F. J. R., Athey, S., & Blei, D. M. (2020). SHOPPER: A probabilistic model of consumer choice with substitutes and complements. The Annals of Applied Statistics, 14(1). https://doi.org/10.1214/19-aoas1265

[2] Numerator (2024). Numerator OmniPanel Data. Numerator https://www.numerator.com/omnipanels/