Tell me what you buy, and I will tell you who you are. A data story.

Applied Data Analysis EPFL, Autumn 2019

12 Dec 2019

Preface.

Due to rumours about the troublesome ability of shopping centers to infer consumer information based on their shopping patterns, a private investigator, Detective Duck, was hired by a consumer advocacy group to investigate the matter. Here are the details of his investigation.

Markdown Monster icon

Living in a time and age when every piece of our data is stored and analysed, the consumer advocacy group wondered what information retailers can gather and infer about consumers. Nevertheless, to answer their question, the group could only provide Detective Duck with information about a two-year shopping spree of the clients of an unknown shopping center located somewhere in the US. Based on this data, Duck was tasked with identifying possible links, if they exist, between demographic information (e.g. marital status, income, number of children, etc) and purchase patterns. In other words, he needed to find out how easy it is for retailers to infer a specific customer profile based on their shopping habits. This was crucial to the consumer advocacy group as precise customer profiles can lead to targeted marketing aimed at increasing consumers' spending habits, and to a loss of privacy. Luckily, Detective Duck followed the autumn 2019 Applied Data Analysis course at the EPFL and was determined to apply his new skills with diligence to this case.

Chapter 1. A rocky start to a rocky investigation.

A Dunnhumby dataset.

The consumer advocacy group provided Detective Duck with a dataset owned by Dunnhumby, an American Data Science company. This included the results of a two years long study, over 2500 voluntary households. When looking at the dataset, Duck quickly realised that it was rather big and had a lot of miscellaneous information. Nevertheless, our favourite detective had learned that when faced with this kind of challenges, he has to start by pre-processing and cleaning the available information. For this, he kept only households which showed coherent and sufficient demographic data. Furthermore, he labelled all 10'000 products into precise grocery categories, as shown below. In this graph, Duck plotted the number of occurences per label for all the transactions. That is how he noticed that most transactions fell into five major labels: vegetables, meat & seafood, cookies & snacks & candy, dairy and beverages. This was not a major surprise to him as this looked like a typical diet.

Know thy households.

Before starting to hunt for clues, Detective Duck first had to know what kind of households were present in his data. That is when he was first shocked in his investigation. Indeed, after cleaning the 2500 households studied by Dunhumby, only 750 remained that provided their demographic information. At this moment, suspicions started to arise in detective Duck's mind that the consumer advocacy group had provided him with quite a challenge. Undoubtetly, such a small dataset would cause troubles when trying to draw predictions. Duck regretted ever leaving his cozy pond.

Still, Duck decided to extract the clients' main consumption patterns and to correlate them with their demographic features. For this, he developped several strategies that are explained below.

Chapter 2. Profiling is not a piece of cake.

Once the data was ready for analysis, Detective Duck was excited to face the real challenge: customer profile hunting.

A hunt for correlations.

Detective Duck seeked for major correlation patterns in the whole data. After some coding, he produced the following matrix illustrating the strength of relation between demographic features, spendings and the products quantities for the most common labels.

From this, he learned that there is some high correlation between demographic information. Specifically, household size is highly correlated to the number of kids and marital status. This did not surprise Duck as any good detective could have deduced it. Nevertheless, he noted that most households with kids were married in this study. Furthermore, the mean and yearly spending were also correlated (0.7) with the different products quantities. This did not alarm Duck, as spending more usually means buying more products in a grocery store. Unfortunately, he could not find any direct link between demographics and product quantities. Unsatisfied with this result, he decided to further pursue the matter. Duck supposed that no correlation between labels and demographics could mean that the labels were not specific enough. Indeed, all households probably buy globally the same amounts in the categories of vegetables, meat, dairy products, etc. Thus, except for outliers with extreme diets, all households seem to buy from the same grocery categories.

A duck in a random forest.

Our favourite duck is not the kind of duck that gives up when discovering weak correlations. Thus, he decided to follow another path and seek for help from Machine Learning, more specifically, Random Forests. He built a predictive model that would hopefully find things, he as a simple duck, could not. First, he fitted a decision tree on the monthly product quantities bought by all households with available demographic information. The results for this were regretfully bad. So bad that Duck could have just randomly set a random demographic parameter for each household and the result would have been better (e.g. the ROC AUC score barely exceeded the baseline score). The same occurred when fitting a random forest model. Thus, he concluded that, using only labels, there was not enough data to train a good machine learning model and infer satisfying customer profiles. Disappointed, he was however not surprised (see section Know thy households.). Duck resigned himself to pursue another investigation path. He had not yet said his last word.

Smart clustering by a strategic duck.

Duck needed to change his plan of action as he had promised to deliver. Thus, he decided to search for typical grocery carts present in the data. From this, he hoped to be able to relate them to demographic parameters. To look for general carts, he pursued several ideas.

Product clusters:

His first idea was to toss aside labels and focus on specific products. So, he looked to cluster households by the exact weekly product quantities they bought. He was hoping to identify different typical weekly shopping carts (e.g. a weekly shopping cart with bananas and cheddar cheese). To find the ideal number of clusters in the data, he used the elbow method heuristic and found a number between 5 and 9.

This he applied to the k-means algorithm to create 7 households clusters. To identify what those clusters bought weekly, he decided that for a product to characterize a cluster, it had to be bought by at least a third of the households in the cluster (a very generous threshold). Looking at the results, Duck saw that two clusters contained only a single household and that one contained around a 1000 but had no product in common. He dismissed those. Furthermore, the remaining clusters had either bananas or dairy as their characteristic products. From this, Duck started to wonder if this dataset was a cruel joke that someone had played on him, or whether this shopping center was frequented by what seemed to be a troop of monkeys. Detective Duck concluded that even though the labels were too vague, clustering on products was probably too precise as there were several product IDs for the same type of product (e.g. 5 different milk IDs). So, though some households maybe bought the same weekly quantity of milk, because they bought a different brand, their carts were not assimilated as the same.

Department clusters:

Barely recovered from his banana and milk experience, Duck's second idea was to look for clusters in the labels and departments of the products. Reminder, the departments include "Produce", "Grocery", "Drug and General Merchandise", etc. After an SVD on the weekly average amount of products bought per deparment and on labels, the detective found that two dimensions explain 96% of total variance in the departments and 70% in the labels. So, he plotted the data along these two dimensions and colored it according to different demographic labels to search for clusters.

Though visually he could not find any cluster, he still wanted to be sure that none were still hidden. He decided to apply the heuristic elbow method for a cluster number from 2 to 20. This method found an ideal number of 4 clusters for departments but none for labels. Full of hope, Duck applied K-means with 4 clusters to the department data and had a closer look at them in the hope of discovering something.

Interestingly, Duck noticed that the clusters were influenced by the quantity of groceries bought. So, the only thing that seemed to distinguish households was how much groceries they bought weekly at this retailer. To see whether there were any other indicators in those clusters, Detective Duck decided to look at the biggest cluster in order to see whether there was any correlation within the cluster. Note: though there were around 1000 households in this cluster, to calculate correlation between labels, there was only demographic information for 22 of them. All correlation calculations between demographics and anything else were done with only 22 datapoints.

To Duck's delight, some bigger correlations appeared within the cluster. Notably, within this cluster there seemed to be some indication of a link between the weekly quantity of dairy bought and the number of kids/household size. This reminded the detective of the weird dairy and banana clusters that appeared when evaluating products. These clusters could be an indicator that, when separating households according to how much they bought weekly at this retailer, one might get a better idea of their shopping patterns. And this way relate them better to their demographics. Nevertheless, to conclude anything solid, Duck would have needed way more demographic data.

Though the correlation data had to be taken into account carefully because of the scarcity of demographic data, but not wanting to give up on his newly found clusters, Detective Duck decided to look at the proportions of label consumption for each cluster. He was interested to see if there was anything separating them except the quantity of weekly groceries bought. Disappointingly, the proportions seemed more or less similar. Duck suspected that this was due to the fact that, as seen previously, labels were too global, and most households bought the same things. Instead of looking at average proportions, Duck suspected he should have looked at the proportions of outsiders in the clusters instead. But again, there was not enough data for that, and our beloved detective was short on time.

A last idea before closing the file.

Just before handing in his report, Duck had a last idea that maybe it is not the composition of the basket which differentiated the customers, but their habits, such as the time at which they went shopping. He calculated the average transaction time per household and looked at how much it was correlated to the other features.

Markdown Monster icon

These calculations were not very concluent in terms of significant correlations. Duck could not conclude anything as coefficients were too low. The two highest correlation results, the age group and owner status, were investigated a bit further.

Though the correlation coefficients were really low, he still observed the following:

Older people tended to make their purchases sooner in the day than younger age groups. This seemed quite logical as older age groups tend to be retired and have more time in the day.
Probable renters and probable owners tended to make their purchases later than others.

Chapter 3. A reassuring report to the Consumer Advocacy Group

After this investigation full of twists and turns, Duck's report to the C.A.G was the following:

“Due to the results, there should be no cause for immediate worry. Though armed with extended data analysis resources, I did not discover anything big from clients' shopping habits, at least not at the scale of this single retailer. To find anything significant, retailers would need to build a clever data collection. So, to infer who their customers are, they would need to harvest a lot more demographic information of a way bigger and more diverse client sample. Indeed, globally households behave quite similarly, no matter their social class. The main danger right now, is retailers collecting personal shopping habits and targetting their clients individually. Nevertheless, it seems that it is hard (at least it was for me) to predict who you are from what you buy.”

What Duck could have done better.

The data collection the detective has analysed was quite atypical. Since he did not have any information about the location nor the identity of the shopping center, he could not claim that it represented an average American grocery store. Moreover, the number of households who gave their demographic features was very limited. As accuracies of machine learning models highly rely on the dataset size, Duck stipulates that his results were probably biased by a too small dataset. Furthermore, he chose the product’s labels arbitrarily. They were set based on the common classes of food he personally knew. This could have been done with higher precision, based on existing labelling available or using classification based on machine learning.

An open investigation.

To be completely re-assured, Duck advised the C.A.G to run a meta-analysis among several shopping centers, in order to confirm or infirm his conclusions. The household set giving their personal data for each center should preferably be bigger and the products classification should be comparable in between the different grocery stores.

Good job detective Duck. But our favourite detective does not even hear us. The smart little guy is already on his way to a new investigation.