Predicting Avocado Prices

We are going to predict avocado prices and therefore we will use Facebook Prophet tool.

The dataset represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the dataset reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in dataset are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this data.

Some relevant columns in the dataset:

  • Date: The date of the observation.
  • AveragePrice: The average price of a single avocado.
  • type: Conventional or organic.
  • year: The year.
  • Region: The city or region of the observation.
  • Total Volume: Total number of avocados sold.
  • 4046: Total number of avocados with PLU 4046 sold.
  • 4225: Total number of avocados with PLU 4225 sold.
  • 4770: Total number of avocados with PLU 4770 sold.

Data Source: https://www.kaggle.com/neuromusic/avocado-prices

Prophet

Prophet is open source software released by Facebook’s Core Data Science team.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

In this link you have more information about Prophet with Python:

1 – Import libraries and data exploration

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from fbprophet import Prophet
df = pd.read_csv('avocado.csv')
df.head()

[table id=95 /]

df = df.sort_values("Date")
df.info() 
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 18249 entries, 11569 to 8814
    Data columns (total 14 columns):
     #   Column        Non-Null Count  Dtype  
    ---  ------        --------------  -----  
     0   Unnamed: 0    18249 non-null  int64  
     1   Date          18249 non-null  object 
     2   AveragePrice  18249 non-null  float64
     3   Total Volume  18249 non-null  float64
     4   4046          18249 non-null  float64
     5   4225          18249 non-null  float64
     6   4770          18249 non-null  float64
     7   Total Bags    18249 non-null  float64
     8   Small Bags    18249 non-null  float64
     9   Large Bags    18249 non-null  float64
     10  XLarge Bags   18249 non-null  float64
     11  type          18249 non-null  object 
     12  year          18249 non-null  int64  
     13  region        18249 non-null  object 
    dtypes: float64(9), int64(2), object(3)
    memory usage: 2.1+ MB

Missing values

# Let's see how many null elements are contained in the data
total = df.isnull().sum().sort_values(ascending=False) 
# missing values percentage
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)

[table id=96 /]

Price trend during the year

plt.figure(figsize=(10,10))
plt.plot(df['Date'], df['AveragePrice'])
  • We see that the price of the avocado rises when it’s September.

Regions

df['region'].value_counts()
    Jacksonville           338
    Tampa                  338
    BuffaloRochester       338
    Portland               338
    SanDiego               338
    NorthernNewEngland     338
    HarrisburgScranton     338
    SouthCentral           338
    PhoenixTucson          338
    RaleighGreensboro      338
    Indianapolis           338
    Plains                 338
    Orlando                338
    Houston                338
    SouthCarolina          338
    West                   338
    Midsouth               338
    CincinnatiDayton       338
    LasVegas               338
    Boston                 338
    Charlotte              338
    Albany                 338
    Nashville              338
    Southeast              338
    Columbus               338
    Philadelphia           338
    Chicago                338
    Louisville             338
    GrandRapids            338
    Atlanta                338
    BaltimoreWashington    338
    Roanoke                338
    Denver                 338
    NewYork                338
    Pittsburgh             338
    TotalUS                338
    Syracuse               338
    Spokane                338
    HartfordSpringfield    338
    RichmondNorfolk        338
    Boise                  338
    DallasFtWorth          338
    Sacramento             338
    California             338
    SanFrancisco           338
    Detroit                338
    GreatLakes             338
    StLouis                338
    MiamiFtLauderdale      338
    Northeast              338
    NewOrleansMobile       338
    Seattle                338
    LosAngeles             338
    WestTexNewMexico       335
    Name: region, dtype: int64

Year

plt.figure(figsize=[15,5])
sns.countplot(x = 'year', data = df)
plt.xticks(rotation = 45)
  • We see less sales in 2018 because the data we have goes up to the beginning of that year.

2 – Data Preparation

df_prophet = df[['Date', 'AveragePrice']] 
df_prophet.tail()

[table id=97 /]

3 – Predictions with Prophet

df_prophet = df_prophet.rename(columns={'Date':'ds', 'AveragePrice':'y'})
df_prophet.head()

[table id=98 /]

m = Prophet()
m.fit(df_prophet)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast

[table id=99 /]

figure = m.plot(forecast, xlabel='Date', ylabel='Price')
figure2 = m.plot_components(forecast)

4 – Nashville data analysis

df_nashville = df[df['region']=='Nashville']
df_nashville

[table id=100 /]

df_nashville = df_nashville.sort_values("Date")
df_nashville

[table id=101 /]

df_nashville = df_nashville.rename(columns={'Date':'ds', 'AveragePrice':'y'})
m = Prophet()
m.fit(df_nashville)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig = m.plot(forecast, xlabel='Date', ylabel='Price')
fig2 = m.plot_components(forecast)