Chicago Crime Prediction

We are going to predicting crime rate in chicago with Facebook Prophet.

Our dataset contains a summary of the reported crimes occurred in the City of Chicago from 2001 to 2017 and contains the following columns:

  • ID: Unique identifier for the record.
  • Case Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  • Date: Date when the incident occurred.
  • Block: address where the incident occurred
  • IUCR: The Illinois Unifrom Crime Reporting code.
  • Primary Type: The primary description of the IUCR code.
  • Description: The secondary description of the IUCR code, a subcategory of the primary description.
  • Location Description: Description of the location where the incident occurred.
  • Arrest: Indicates whether an arrest was made.
  • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car.
  • District: Indicates the police district where the incident occurred.
  • Ward: The ward (City Council district) where the incident occurred.
  • Community Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: Year the incident occurred.
  • Updated On: Date and time the record was last updated.
  • Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Longitude: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

Data source: https://www.kaggle.com/currie32/crimes-in-chicago

Prophet

Prophet is open source software released by Facebook’s Core Data Science team.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

In this link you have more information about Prophet with Python:

1 – Import libraries and dataset

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from fbprophet import Prophet
df_1 = pd.read_csv('Chicago_Crimes_2001_to_2004.csv', error_bad_lines=False)
df_2 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)
df_3 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
df_4 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)
# Concatenaning all datasets
df = pd.concat([df_1, df_2, df_3, df_4], ignore_index=False, axis=0)
df.head()

[table id=85 /]

2 – Missing values

# Let's see how many null elements are contained in the data
total = df.isnull().sum().sort_values(ascending=False) 
# missing values percentage
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)

[table id=86 /]

– Dropping unnamed:

df.drop(['Unnamed: 0', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude'], inplace=True, axis=1)
df

[table id=87 /]

– Assembling a datetime:

df.Date = pd.to_datetime(df.Date, format='%m/%d/%Y %I:%M:%S %p')
df.head()

[table id=88 /]

– Change the index to the date:

df.index = pd.DatetimeIndex(df.Date)
df.head()

[table id=89 /]

Primary Type visualization

df['Primary Type'].value_counts().iloc[:15]
    THEFT                         1640506
    BATTERY                       1442716
    CRIMINAL DAMAGE                923000
    NARCOTICS                      885431
    OTHER OFFENSE                  491922
    ASSAULT                        481661
    BURGLARY                       470958
    MOTOR VEHICLE THEFT            370548
    ROBBERY                        300453
    DECEPTIVE PRACTICE             280931
    CRIMINAL TRESPASS              229366
    PROSTITUTION                    86401
    WEAPONS VIOLATION               77429
    PUBLIC PEACE VIOLATION          58548
    OFFENSE INVOLVING CHILDREN      51441
    Name: Primary Type, dtype: int64
df['Primary Type'].value_counts().iloc[:15].index
    Index(['THEFT', 'BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'OTHER OFFENSE',
           'ASSAULT', 'BURGLARY', 'MOTOR VEHICLE THEFT', 'ROBBERY',
           'DECEPTIVE PRACTICE', 'CRIMINAL TRESPASS', 'PROSTITUTION',
           'WEAPONS VIOLATION', 'PUBLIC PEACE VIOLATION',
           'OFFENSE INVOLVING CHILDREN'],
          dtype='object')
plt.figure(figsize = (15, 10))
sns.countplot(y= 'Primary Type', data = df, order = df['Primary Type'].value_counts().iloc[:15].index)

Location Description visualization

plt.figure(figsize = (15, 10))
sns.countplot(y= 'Location Description', data = df, order = df['Location Description'].value_counts().iloc[:15].index)

3 – Data resample

Resample is a Convenience method for frequency conversion and resampling of time series.

More info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html

– Per year:

df.resample('Y').size()
    Date
    2001-12-31    568518
    2002-12-31    490879
    2003-12-31    475913
    2004-12-31    388205
    2005-12-31    455811
    2006-12-31    794684
    2007-12-31    621848
    2008-12-31    852053
    2009-12-31    783900
    2010-12-31    700691
    2011-12-31    352066
    2012-12-31    335670
    2013-12-31    306703
    2014-12-31    274527
    2015-12-31    262995
    2016-12-31    265462
    2017-12-31     11357
    Freq: A-DEC, dtype: int64
plt.plot(df.resample('Y').size())
plt.title('Crimes Count Per Year')
plt.xlabel('Years')
plt.ylabel('Number of Crimes')

– Per month:

df.resample('M').size()
    Date
    2001-01-31    74995
    2001-02-28    66288
    2001-03-31    53122
    2001-04-30    40166
    2001-05-31    41876
                  ...  
    2016-09-30    23235
    2016-10-31    23314
    2016-11-30    21140
    2016-12-31    19580
    2017-01-31    11357
    Freq: M, Length: 193, dtype: int64
plt.plot(df.resample('M').size())
plt.title('Crimes Count Per Month')
plt.xlabel('Months')
plt.ylabel('Number of Crimes')

5 – Data Preparation

df_prophet = df.resample('M').size().reset_index()
df_prophet

[table id=90 /]

df_prophet.columns = ['Date', 'Crime Count']
df_prophet.head()

[table id=91 /]

df_prophet = pd.DataFrame(df_prophet)
df_prophet

[table id=92 /]

6 – Predictions with Prophet

df_prophet.columns
    Index(['Date', 'Crime Count'], dtype='object')
df_prophet_final = df_prophet.rename(columns={'Date':'ds', 'Crime Count':'y'})
df_prophet_final.head()

[table id=93 /]

m = Prophet()
m.fit(df_prophet_final)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)

forecast

[table id=94 /]

figure = m.plot(forecast, xlabel='Date', ylabel='Crime Rate')
figure2 = m.plot_components(forecast)

We can see the prediction with prophet is right and we could predict crime rate in Chicago for next years with some precision.