Preparing an upload
Only two database columns (fields) are required to perform a minimal basket analysis:
- A unique identifier for a sales transaction, often called transaction_id
- The name of a product or product category of the item being sold
If you are lucky enough to have access to sales transaction data that includes these two columns in every row, you can upload that file directly (even if there are many other columns).
If you have SQL database access to a sales transaction table which includes these two fields, you can extract 10,000 records of this data with a very simple query, and save it as a .csv file:
SELECT * FROM SALES_TRANSACTIONS LIMIT 10000;
It is very common to store the product (or category) names separately in a different database table, which needs to be “joined” to the sales transaction table in order to have the two required columns (otherwise, the product would most likely be some unreadable product identifier). If this is the case, the query would be more like this (just use your table/column names):
SELECT SALES_TRANSACTIONS.transaction_id, PRODUCTS_TABLE.product_name FROM SALES_TRANSACTIONS JOIN PRODUCTS_TABLE ON SALES_TRANSACTIONS.product_id = PRODUCTS_TABLE.id LIMIT 10000;
This query also limits the output to just the two required columns. Feel free to experiment and become comfortable with how these queries work on this website: DB Fiddle
General data guidelines
The best basket analysis results will come from data with these characteristics:
- 10,000 – 100,000 rows (records) of data or total transactions, probably 20-100Mb
- Correct and human-readable product/category names
- Representative of a broad sampling of dates, stores and products or categories
- “Clean” sales transactions, meaning: no “shop-keeping” non-product sale transactions; no free/unprofitable products; and no refunds, exchanges, other non-sales activities