From this he swa that the demographic data was not distributed uniformly. Indeed:
After that, he noticed some inconsistencies. His conclusions were the following:
In the products table, there were originally 44 departments of products, and Duck noticed that some were redundant. He cleaned the departments and labels, and ended up with the following categories for the products, with the representation of each category in terms of number of products:
Note that most of our products fell under cookies and snacks, meat and households.
Duck found out that there were 2,595,732 transactions over the year. When he looked deeper into what products were sold in the transactions, he ended up with the following graph:
Because he wanted to look at how much households spent on groceries, he decided to keep only transactions that were in the department of Groceries, Drug and General Merchandise, Produce Meat, Deli and Packaged Meat.
After that, he wanted to go deeper and look at the purchase behaviour of the customers. Firts, he looked at the distribution of the number of purchases per year.
At this point, he suspected that about half of the households did not use this retailer as their principal retailer, because they bought there less that once a week.To explore this further, he thought that looking at the spendings would give a good indication.
Indeed, when he plotted the average yearly and weekly spending against the income, he figured out that there were quite a lot of households who spent less than $50 a week, which seemed too low for a weekly budget of full groceries as the weekly amount set by "Business insider" for the US is around $70. In addition, according to the "Bureau of Labor statistics", each household should have spent at least around $2500 dollars per year. Households that spent less than that did not participate in the study fully and bought groceries elsewhere. He thus hypothesized that to be a "loyal" shopper, they should have spent at least $2500 per year. But, when he wanted to filter them out to keep only the loyal shoppers, he ended up with only 286 households left, which was really too low for a relevant analysis. His conclusion was : "let's keep everybody for now".
To complete his understanding of the data, he looked at the distribution of the participation to the study:
He observed that most households only participated between 20 and 80 weeks. Only 3 households did over 100 weeks. But for the same reasons as before, he decided to keep them anyway, because with the filtering he would not have had much to analyze. And that's the story of his pre-processing.