Sales forecasting for the sales department

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

Predictive models can be very interesing in departments like Sales. Predictive models forecast future sales based on historical data.

The goal is to predict future daily sales based on the characteristics provided in the dataset. This dataset has information on 1115 company stores.

The Company has supplied us 3 .csv files:

  • train.csv – Historical data including Sales.
  • test.csv – Historical data excluding Sales.
  • store.csv – Supplemental information about the stores.

This features are some of them:

  • Id: An Id that represents a (Store, Date) duple within the test set.

  • Store: A unique Id for each store.

  • Sales: The turnover for any given day (this is what you are predicting).

  • Customers: The number of customers on a given day.

  • Open: An indicator for whether the store was open: 0 = closed, 1 = open.

  • StateHoliday: Indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.

  • SchoolHoliday: Indicates if the (Store, Date) was affected by the closure of public schools.

  • StoreType: Differentiates between 4 different store models: a, b, c, d.

  • Assortment: Describes an assortment level: a = basic, b = extra, c = extended.

  • CompetitionDistance: Distance in meters to the nearest competitor store.

  • CompetitionOpenSince[Month/Year]: Gives the approximate year and month of the time the nearest competitor was opened.

  • Promo: Indicates whether a store is running a promo on that day.

  • Promo2: Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating..

  • Promo2Since[Year/Week]: Describes the year and calendar week when the store started participating in Promo2.

  • PromoInterval: Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Source: https://www.kaggle.com/c/cs3244-rossmann-store-sales/data

1- Import libraries and dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

Let’s load train.csv:

df = pd.read_csv("train.csv")
df.shape
(1017209, 9)
  • We have almost 1M of observations.
  • 1115 stores
  • ‘Sales’ is our target variable
df.head().to_csv()
',Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday\n0,1,5,2015-07-31,5263,555,1,1,0,1\n1,2,5,2015-07-31,6064,625,1,1,0,1\n2,3,5,2015-07-31,8314,821,1,1,0,1\n3,4,5,2015-07-31,13995,1498,1,1,0,1\n4,5,5,2015-07-31,4822,559,1,1,0,1\n'
df.tail()

[table id=24 /]

df.info()
<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1017209 entries, 0 to 1017208
    Data columns (total 9 columns):
     #   Column         Non-Null Count    Dtype 
    ---  ------         --------------    ----- 
     0   Store          1017209 non-null  int64 
     1   DayOfWeek      1017209 non-null  int64 
     2   Date           1017209 non-null  object
     3   Sales          1017209 non-null  int64 
     4   Customers      1017209 non-null  int64 
     5   Open           1017209 non-null  int64 
     6   Promo          1017209 non-null  int64 
     7   StateHoliday   1017209 non-null  object
     8   SchoolHoliday  1017209 non-null  int64 
    dtypes: int64(7), object(2)
    memory usage: 69.8+ MB
  • 9 columns.
  • 8 features (each with 1017209 points).
  • 1 target variable (Sales).
df.describe()

[table id=25 /]

  • Average sales a day = 5773 Euros.
  • Minimal sales a day = 0 and Max sales a day = 41551.
  • Average customers = 633, minimum number of customers = 0, max number of customers = 7388.

Let’s load store.csv:

store_df = pd.read_csv("store.csv")
store_df.head()

[table id=26 /]

store_df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1115 entries, 0 to 1114
    Data columns (total 10 columns):
     #   Column                     Non-Null Count  Dtype  
    ---  ------                     --------------  -----  
     0   Store                      1115 non-null   int64  
     1   StoreType                  1115 non-null   object 
     2   Assortment                 1115 non-null   object 
     3   CompetitionDistance        1112 non-null   float64
     4   CompetitionOpenSinceMonth  761 non-null    float64
     5   CompetitionOpenSinceYear   761 non-null    float64
     6   Promo2                     1115 non-null   int64  
     7   Promo2SinceWeek            571 non-null    float64
     8   Promo2SinceYear            571 non-null    float64
     9   PromoInterval              571 non-null    object 
    dtypes: float64(5), int64(2), object(3)
    memory usage: 87.2+ KB
store_df.describe()

[table id=27 /]

  • CompetitionDistance’s average is 5,4 kms.

2 – Missing data and data visualization

sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
df.hist(bins = 30, figsize=(20,20), color = 'r')
  • Averge customers per day: 600. Max.: 4500
  • Data are distributed evently (~ 150000 observations x 7 days = ~ 1,1M observations)
  • Stores are open ~ 80% of the time
  • Data have a similar distribution in all the stores.
  • Average sales are about 5000-6000 €
  • School holidays last abaout 18% of the time
df["Customers"].max()
7388

– Open and Closed stores:

closed_train_df = df[df['Open'] == 0]
open_train_df   = df[df['Open'] == 1]
print("Total = {} ".format(len(df)))
print("Stores Open = {}".format(len(open_train_df)))
print("Stores Closed = {}".format(len(closed_train_df)))
print("Percentage of stores closed = {}%".format(100.0*len(closed_train_df)/len(df)))
    Total = 1017209 
    Stores Open = 844392
    Stores Closed = 172817
    Percentage of stores closed = 16.98933060954042%

– We only keep the sotres open and eliminate the other ones:

df = df[df['Open'] == 1]
df

[table id=28 /]

– We remove the open column:

df.drop(['Open'], axis = 1, inplace = True)
df

[table id=29 /]

df.describe()

[table id=30 /]

  • Average sales = 6955 €.
  • Average customers = 762 (the figures have risen).

Missing values

sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
store_df[store_df['CompetitionDistance'].isnull()]

[table id=31 /]

store_df[store_df['CompetitionOpenSinceMonth'].isnull()]

[table id=32 /]

  • Missing data from 3 observations in ‘CompetitionDistance’.
  • Missing data from 354 observations in ‘CompetitionOpenSinceMonth’ (almost a third of 1115 stores).
store_df[store_df['Promo2'] == 0]

[table id=33 /]

  • It seems if ‘promo2’ is zero, then ‘promo2SinceWeek’, ‘Promo2SinceYear’ and ‘PromoInterval’ are zero. This is logical because if we don’t have a promotion we woudn’t have a promotion date.

– We have missing data from 354 observations in ‘CompetitionOpenSinceMonth’ and ‘CompetitionOpenSinceYear’, we’re going to put them to zero:

str_cols = ['Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'CompetitionOpenSinceYear', 'CompetitionOpenSinceMonth']

for str in str_cols:
    store_df[str].fillna(0, inplace = True)
sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")

– We fill the missing values of ‘CompetitionDistance’ with average values:

store_df['CompetitionDistance'].fillna(store_df['CompetitionDistance'].mean(), inplace=True)
sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
store_df.hist(bins = 30, figsize=(20,20), color = 'r')
  • The number 2 promotion is being followed by half of the stores.
  • A Half of the stores have their rivals at a distance of 0-3000 m.

Let’s explore the combined dataset (sore.csv + test.csv)

– We’re going to mix both dataset using an inner:

inner_df = pd.merge(df, store_df, how = 'inner', on = 'Store') #'on' is a column merge
inner_df.to_csv('test.csv', index = False)
inner_df

[table id=34 /]

– Correlation:

correlations = inner_df.corr()['Sales'].sort_values()
correlations
    DayOfWeek                   -0.178736
    Promo2SinceYear             -0.127621
    Promo2                      -0.127596
    Promo2SinceWeek             -0.058476
    CompetitionDistance         -0.036343
    CompetitionOpenSinceMonth   -0.018370
    CompetitionOpenSinceYear     0.005266
    Store                        0.007710
    SchoolHoliday                0.038617
    Promo                        0.368145
    Customers                    0.823597
    Sales                        1.000000
    Name: Sales, dtype: float64

The positves ones are correlated and de negative ones not. Near to zero don’t tell us anything.

  • Customers and promotion are positively correlated with sales.
  • ‘Promo2’ doesn’t seems very effective to us.
  • Sales decrease as the week forward (‘Sales’ is in 1 and ‘DayOfWeek’* in -0.178.
correlations = inner_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot=True)
  • Clients, Promo2 and Sales have a strong correlation.

Let’s split year, month and day and then put them in a separate column:

inner_df['Year'] = pd.DatetimeIndex(inner_df['Date']).year
inner_df['Month'] = pd.DatetimeIndex(inner_df['Date']).month
inner_df['Day'] = pd.DatetimeIndex(inner_df['Date']).day
inner_df

[table id=35 /]

Average sales vs. number of customers per month:

axis = inner_df.groupby('Month')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per month")

plt.figure()
axis = inner_df.groupby('Month')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per month")
  • It seems that the sales and customers reached its lowest point in Christmas.

Average sales vs. customers per day of the month:

axis = inner_df.groupby('Day')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per day of the month")

axis = inner_df.groupby('Day')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per day of the month")
  • The minimum number of clients is around the 24th of the month.
  • Most customers and sales are between the 30th and the 1st of the month.

Average sales vs. customers per day of the week (7 = Sunday)

axis = inner_df.groupby('DayOfWeek')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per day of the week")

axis = inner_df.groupby('DayOfWeek')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per day of the week")
  • The day of greatest sales and clients is Sunday.
  • When we look at sales we can see there is not much difference between Sunday and Monday.

Average date vs. Store Type per Sales

fig, ax = plt.subplots(figsize = (20, 10))
inner_df.groupby(['Date', 'StoreType']).mean()['Sales'].unstack().plot(ax = ax)
plt.figure(figsize=[15,10])

plt.subplot(211)
sns.barplot(x = 'Promo', y = 'Sales', data = inner_df)

plt.subplot(212)
sns.barplot(x = 'Promo', y = 'Customers', data = inner_df)
  • When we have a promotion, then sales and clients go up.
plt.figure(figsize=[15,10])

plt.subplot(211)
sns.violinplot(x = 'Promo', y = 'Sales', data = inner_df)

plt.subplot(212)
sns.violinplot(x = 'Promo', y = 'Customers', data = inner_df)

3 – Forecasting with ‘Facebook Prophet’

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

Source: https://facebook.github.io/prophet/

Training Model A

!pip install fbprophet
from fbprophet import Prophet
def sales_predictions(Store_ID, sales_df, periods): #intervals of time to predict
    '''
    This function modify the name of the variables like 'ds', beacuse Facebook algotithm forces us to put it this way.
    '''
    sales_df = sales_df[sales_df['Store'] == Store_ID]
    sales_df = sales_df[['Date', 'Sales']].rename(columns = {'Date': 'ds', 'Sales': 'y'})
    sales_df = sales_df.sort_values('ds')

    model    = Prophet()
    model.fit(sales_df)
    
    #prediction
    future   = model.make_future_dataframe(periods = periods)
    forecast = model.predict(future) 
    
    figure   = model.plot(forecast, xlabel = "Date", ylabel = "Sales")
    figure2  = model.plot_components(forecast)  
sales_predictions(10, inner_df, 60)

Training Model B

  • StateHoliday: It tells us weather it was a holiday day or not (a= normal holiday, b = Easter holiday, c = Christmas, 0 = Not a holiday)
  • SchoolHoliday: It tells us weather Sotre or Date is affected by the closure of public schools or not.

– We manually add the holidays to the algorithm:

def sales_predictions(Store_ID, sales_df, holidays, periods):
    sales_df = sales_df[sales_df['Store'] == Store_ID]
    sales_df = sales_df[['Date', 'Sales']].rename(columns = {'Date': 'ds', 'Sales': 'y'})
    sales_df = sales_df.sort_values('ds')

    model    = Prophet(holidays=holidays)
    model.fit(sales_df)
    future   = model.make_future_dataframe(periods = periods)
    forecast = model.predict(future)
    figure   = model.plot(forecast, xlabel = "Fecha", ylabel = "Ventas")
    figure2  = model.plot_components(forecast)

– We get all the dates related to school holidays:

school_holidays = inner_df[inner_df['SchoolHoliday'] == 1].loc[:, 'Date'].values
school_holidays.shape
(163457,)
school_holidays = np.unique(school_holidays)
school_holidays.shape
(477,)

– We get all the dates corresponding to State holidays:

state_holidays = inner_df[(inner_df['StateHoliday'] == 'a') | (inner_df['StateHoliday'] == 'b') | (inner_df['StateHoliday'] == 'c')].loc[:, 'Date'].values
state_holidays.shape
(910,)
state_holidays = np.unique(state_holidays)
state_holidays.shape
(35,)
  • 35 days’ holydays.
school_holidays = pd.DataFrame({'ds': pd.to_datetime(school_holidays), 
                                'holiday': 'school_holiday'})
school_holidays

[table id=36 /]

state_holidays = pd.DataFrame({'ds': pd.to_datetime(state_holidays), 
                                'holiday': 'state_holiday'})
state_holidays

[table id=37 /]

– We concatenate school vacations and State holidays:

school_state_holidays = pd.concat((state_holidays, school_holidays), axis = 0)
school_state_holidays

[table id=38 /]

– Let’s make predictions using holidays for a specific store:

sales_predictions(6, inner_df, school_state_holidays, 90)

If we look at the sales, we can observer a downward trend according Facebook’s algorithm’s prediction.