Black Friday Sales Analysis and Prediction


Black Friday is an informal name for the Friday following Thanksgiving Day in the United States, which is celebrated on the fourth Thursday of November. Many stores offer highly promoted sales on Black Friday and open very early, such as at midnight, or may even start their sales at some time on Thanksgiving. The major challenge for a Retail store or eCommerce business is to choose product price such that they get maximum profit at the end of the sales. The goal of this project is to predict purchase amount based on historical purchase patterns.

The Data

DATA

TThe dataset is obtained from an online data analytics hackathon hosted by Analytics Vidhya. Features include customer information, categories of products purchased, city demographics and purchase amount. The data have a total of 12 columns, 550068 transaction rows, 5891 unique users, and 3631 unique products. There are 7 numeric variables and 5 object variables.

Technical Aspect

This is a multiple or multivariate linear regression model which is used to predict the purchase amount a customer is expected to spend on Black Friday. This is a supervised machine learning problem since we are using available target values to train the model. Here we already know the target, how much a customer spend on a specific product. Hence the response is a continuous value. The ideal outcome is to provide retailers with information on how much is the expected purchase amount.
Tools : Python( Flask, Numpy, Pandas, Matplotlib, Seaborn)

The project is divided into the following sections .
  1. Data Understanding.
    • Dataset have ~550K rows and 12 columns
    • There are 7 numeric variables and 5 object variables.
    • All features were categorical variable and target was continuous variable.
    • The data includes transaction history of 5891 unique customers, and 3631 unique products.
    • Only "Product_Category_2" and "Product_Category_3" columns have missing values

  2. Exploratory Data Analysis
    • There are almost 3 times more male customers than female customers. Maybe male visitors are more likely to go out and buy something for their ladies when more deals are present. The purchase amount contributed by male customers is 76%.
    • The highest number of customers belong to the age group between 26 and 35. Also more purchase amount is contributed by the same age group. Even though, the frequency of visit of 46-50 and 55+ age group is less, they contribute 20% and 18% of the total purchase amount. Based on these results, the retail store should sell most of the products that target people in their late twenties to early thirties. To increase profits, the number of products targeting people around their thirties can be increased.
    • From the mean of purchase amount by different age group graph, it is clear that all age group have near mean values. From this it is clear that, the age group 46-50 and 55+ spend more in a single purchase.
    • The customers belonging to occupation category 4 are more frequent visitors and customers belonging to occupation category 20,12 and 10 spends more. The mean purchase of different occupation are in the range 8000 to 10000.
    • The three cities are almost equally represented in the retail store during Black Fridays. Maybe the store is somewhere between these three cities, is easily accessible and has good road connections from these cities. Customers from C_category cities make up more than half of our black friday sales even though, most frequent customers are from B_type city. On the contrary, we didn't get very many customers from A_type city and they spent the least in our store. This can be noted when making future marketing plans.
    • The frequent customers in the store are new residents (one year or less). That is store apperas to be popular among new residents. People who have been living in current city for longer spend a bit more than new comers. Since they chose to stay with the store, we do need to find out what kept them loyal so that better plans can be made to keep more customers instead of losing them over time.
    • There are more single people customers and they spend more than married people on Black Friday.
    • The most selling product category 1 is 5th and 3rd category with 40% contribution on the total purchase amount. The product category with highest purchase amount is 10th. Even though 5th & 8th category is selling more, its mean value is less.

  3. Data Preprocessing
    • Here the columns Product_Category_2 and Product_Category_3 have missing values. The missing value percentages of these columns are above 20%. If we fill these values with test static, more than 70% data will be artificial. Also, for every single product it is non-realistic to have a second and third product category. Therefore, these columns can be removed.
    • Encoded categorical features using label encoding and one hot encoding.
    • Checked correlation with other features. There were no multicollinear features. Also there is no single feature that shows strong correlation with target column directly, so purchase depends on the ensemble of all features.

  4. Modeling
    • Splitted dataset into into random train and test subset of ratio 80:20
    • Implemented multiple supervised models such as Linear Regressor, Decision Tree Regressor, Random Forest Regressor.
    • Created baseline models, models after standard scaling, models after minmax scaling, models after clustering customer features, models after clustering & pca and modelling after customer clustering & product clustering.
    • Best model was the random forest model obtained after customer clustering on the clean data.

View other projects Bact to top