Marketing Campaign

Review: Miftahur Rahman, Ph.D.

July 24, 2022

In this review article I will be analyzing a marketing campaign dataset in python making use of a machine learning algorithm called "Random forest". "Random forest" model is made up of multiple decision trees, it seeks to find the best split to subset the data, and they are typically trained through the Classification and Regression Tree (CART) algorithm. Metrics, such as Gini impurity, information gain, or mean square error (MSE), can be used to evaluate the quality of the split. As you read the article you will find further discussions on the topics. To make the discussion easy for the general readers, there are inclusions of topics regarding marketing campaign are included in the article.

Marketing Campaign Strategies:

Ref: https://flypchart.co/marketing-campaign-ideas/

1. Go multi-channel

An effective way to enhance the marketing campaign is to make use of multichannel that would build your brand awareness significantly more and increase the performance.

Before launching new products you may keep the following multichannel in mind and try to invest time and money into as many channels as you can that are relevant to your business:

  • Blogging
  • Podcasts or webinars
  • Social media
  • Social events (Facebook or YouTube Live, Reddit AMAs)
  • Email newsletters
  • Advertising (on and offline, if relevant)
  • In-person events

Even if you don’t have the budget to invest in everything, make a concerted effort to cast a wide net with your marketing channels. This gives you more opportunities to reach out a new audience, and remind your audience of your brand on several platforms.

2. Diversify your content

If you really want to go multichannel with your marketing, you’ll need to diversify your content to make it happen. Different content types work better for different platforms. For example:

  • You’ll need to create videos for Facebook Live or YouTube
  • Audio for your podcasts
  • Infographics to share on social media and your blog etc.

You can also diversify the kind of blog content you create. Instead of sticking to your regular evergreen posts, create ultimate guides, case studies, reports, how-to walkthroughs, and other content that is ultra-useful (and ultra-shareable).

Just make sure the kind of content you create aligns well with your marketing campaign goal. Different content types perform better for different points in the sales funnel (e.g. viral videos in the awareness phase and case studies near the closing phase):

3. Collaborate

Building your audience from scratch is a challenge. But there’s no reason why you can’t get a leg up by collaborating with other brands, happy customers, or influencers. Happy customers are ready and willing to offer testimonials or online reviews. Or you can find social media or blogging influencers to review your product to reach a new audience as well.

Brand collaboration can be something as simple as pairing up with another industry player to host a webinar or author a new report. Or you could get really creative with your collaboration, like big brands Uber and Pandora did.

If drivers download the Uber Partner app, Uber riders can choose which Pandora music they want to hear on their trip, and play it through your car radio. Once the passenger leaves, the driver’s own music resumes playing.

The partnership solved an important pain point for Uber passengers: Not digging their driver’s music. The campaign also turned a lot of Uber drivers and passengers into Pandora users and generated buzz in the tech world for its ingenuity.

4. Create a real deadline

What’s the goal of your marketing campaign? Driving sales? Finding new leads? Clearing out your stock for a new shipment?

Whatever action you want your audience to take as the result of your marketing, needs to have a deadline. If you’re running a sale, make it clear in your advertising exactly how long it lasts (e.g. Until Sunday at 10PM).

Say you’re running a social media giveaway to generate new leads. Keep reminding your audience when the deadline is to enter and win.

Creating a real deadline (even if it’s arbitrary) helps initiate urgency in your audience to act now. Even if they’re interested in what you have to offer, they might not act unless you tell them the offer won’t be around forever.

Stick to your deadline even if you don’t want to. You can always down sell to the people who missed out later on or repeat your marketing campaign in the future.

5. Build suspense

Build suspense as part of your marketing campaign strategy to capture the attention of your target audience, industry players, and (hopefully) news outlets.

A great example of effective suspense marketing is Pinterest’s original strategy. When the site first launched, it was actually invitation-only. You had to register and wait several weeks to get an invitation to BETA test the site.

That’s pretty novel for a social networking site. People started talking about the secrecy online.

When the site finally did open up to the public, sites and news outlets like Huffington Post, Gizmodo, BBC News, TechCrunch and CNET gave them tons of coverage.

Would that have happened if they hadn’t launched the site as an exclusive membership? No.

Exclusivity isn’t the only way to build suspense. Discussing your product well before launch and using sneak peeks in your marketing can create the pre-launch suspense you need to build extra buzz. You can create a countdown to your product launch and share it on social media to keep your audience thinking about your brand.

6. Use custom research and data

Custom research is a powerful tool if you want to stand out from the competition in your industry. And offering industry advice is way more valuable if you have original data to back up your point. Conduct original research within your business and publish these statistics to help build thought leadership.

Most commonly, this will take the form of a case study or report. According to a recent CMI/MarketingProfs study, B2B content marketers say case studies are effective in their campaigns 70% of the time.

One of the main reasons original research is so powerful in marketing is that other industry leaders can use your data to back up points in their own content as well.

Getting data is easier than you may think. You can simply use your marketing analytics, or survey your audience to get insights.

7. Give something away

Giveaways can help you achieve all sorts of marketing goals. Here are a few different examples:

Offering a free trial of your software to generate leads (and potentially sales). Creating a social media competition (e.g. a photo contest) to generate buzz about your brand and gain new followers. Running a sweepstakes to collect contact information from potential leads. Just make sure your giveaway appeals to your target audience, but not too broadly. For example, say your business develops health and dieting apps, and you decide to run a sweepstakes to win an iPad for lead generation. A lot of the email sign ups you get won’t necessarily be interested in health apps – they just wanted to win an iPad. Premium workout gear would be a better-targeted prize to help generate relevant leads.

If you’re promoting your giveaway on social media (which you should), create a branded hashtag to help generate buzz:

8. Track key metrics

Ultimately, you’re never going to know which marketing campaign tactics work best with your target audience unless you monitor and track results. Create an analytics dashboard so you can track key conversions and goal outcomes.

If your goal is social media followers/engagement, most of these key metrics will be available right on the platform (e.g. Twitter Analytics).

If you want to track on-site behavior, you can use Google Analytics. It has a Goal Tracking feature that helps you keep track of “events” you define, such as new subscriptions, contact form completions, click through on a page, content engagement, etc.

It also tracks the source of your traffic, so you can see which platforms bring the most conversions to your site.

9. Repeat and improve

By tracking key metrics during your marketing campaign, you’ll be able to identify which types of content, which platforms, and which marketing messages resonate most with your audience and drive your marketing goals.

Learn from this information to improve future marketing campaigns. Create a cadence with your marketing strategy to regularly repeat and improve your campaigns every quarter, year, every time you launch a new product, etc.

Small business marketing is an art as much as it is a science. You have to always be on your toes to stand out from your competitors and grow your business in a crowded market.

But if you put to work even a few of these marketing campaign ideas, you’ll start growing the buzz you need to really broaden brand reach, accelerate sales, or achieve any other marketing campaign goal.

In [62]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# /kaggle/input/arketing-campaign/marketing_campaign.csv
# /kaggle/input/arketing-campaign/marketing_campaign.xlsx
import warnings
import seaborn as sns
%matplotlib inline
from IPython.display import Image
from sklearn.tree import export_graphviz
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn import metrics

About this file

A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenses. The objective is to predict who will respond to an offer for a product or service.

Content

AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise Complain - 1 if customer complained in the last 2 years DtCustomer - date of customer’s enrolment with the company Education - customer’s level of education Marital - customer’s marital status Kidhome - number of small children in customer’s household 
Teenhome - number of teenagers in customer’s household 
Income - customer’s yearly household income MntFishProducts - amount spent on fish products in the last 2 years MntMeatProducts - amount spent on meat products in the last 2 years MntFruits - amount spent on fruits products in the last 2 years MntSweetProducts - amount spent on sweet products in the last 2 years MntWines - amount spent on wine products in the last 2 years MntGoldProds - amount spent on gold products in the last 2 years NumDealsPurchases - number of purchases made with discount NumCatalogPurchases - number of purchases made using catalogue NumStorePurchases - number of purchases made directly in stores NumWebPurchases - number of purchases made through company’s web site NumWebVisitsMonth - number of visits to company’s web site in the last month Recency - number of days since the last purchase

Acknowledgements O. Parr-Rud. Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner. SAS Institute, 2014.

Statistical Analysis System (SAS):

Statistical software mainly used for data management, analytics and business intelligence is called as SAS. SAS stands for Statistical Analysis System and it is written in C language. SAS is used in most of the operating systems. SAS can be used as a programming language and also as graphical interface. It was developed by Anthony James Barr and can read data from spreadsheets and databases. The output can be given as tables, graphs and documents. SAS is used to report, retrieve and analyze statistical data and it is also used to run SQL queries.

What is SAS?

The reason for the existence of the Statistical Analysis System is to work with data that are obtained from numerous sources. The data from various are gathered together and used to perform some sort of statistical analysis to retrieve the expected outcome. As foretold, we could use software to work on statistical analysis, but we could also make the use of the SAS programming language.

How does SAS make working so easy?

The SAS makes it very easy for an organization to work with the roughly gathered data and transform those into some useful outcomes that help the business in several ways. Here we will see how statistical analysis is used in various industries:

IT Management

In the world of information technology, data analysis is very vigorously used to design a solution based on the outcome of data processing. Information technology includes the solution delivery system which could never be possible without getting a close view of the changing trend of data.

In CRM

For any of the business, customer relationship management plays a very crucial role as it’s the only factor that leads to business development. For the businesses handling an abundant number of customers, it is very important to understand the way their customers are willing to work with the business which could be achieved by analysis

Business Intelligence

In business intelligence, the analysis makes the use of random data to get some precious information. It’s all about analysis of data which is introduced by SAS application or platforms.

In Finance management

In managing the financial data, the representatives are supposed to work mainly in the visual analysis as they are non-technical people almost of the time. SAS lets us work with either of the graphical interface or with a programming language that makes things easy for any kind of people regardless of whether they are from a technical background or not.

Inspiration

The main objective is to train a predictive model which allows the company to maximize the profit of the next marketing campaign.

In [63]:
df = pd.read_csv('./Marketing-Campaign/marketing_campaign.csv', delimiter=';')
df.head()
Out[63]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 2012-09-04 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 2014-03-08 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 2013-08-21 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 2014-02-10 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 2014-01-19 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 3 11 0
In [64]:
df.shape
Out[64]:
(2240, 29)
In [65]:
df.isnull().sum()
Out[65]:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64
In [66]:
df.drop(['Z_CostContact', 'Z_Revenue'], axis=1, inplace=True)
df.Dt_Customer = pd.to_datetime(df['Dt_Customer'])
df['Dt_Customer_Month'] = df['Dt_Customer'].dt.to_period("M")
df['acc_age'] = (pd.to_datetime("2014-12").year - df['Dt_Customer_Month'].dt.year)*12 + (pd.to_datetime("2014-12").month - df['Dt_Customer_Month'].dt.month)
df.drop(['Dt_Customer_Month', 'Dt_Customer'], axis=1, inplace=True)
df['Age'] = 2014 - df["Year_Birth"]
df.drop(['Year_Birth'], axis=1, inplace=True)
df['Income'] = df['Income'].fillna(df.Income.mean(), axis = 0)
df.head()
Out[66]:
ID Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response acc_age Age
0 5524 Graduation Single 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 27 57
1 2174 Graduation Single 46344.0 1 1 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 9 60
2 4141 Graduation Together 71613.0 0 0 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 16 49
3 6182 Graduation Together 26646.0 1 0 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 0 10 30
4 5324 PhD Married 58293.0 1 0 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 0 11 33
In [67]:
df.describe()
Out[67]:
ID Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response acc_age Age
count 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000
mean 5592.159821 52247.251354 0.444196 0.506250 49.109375 303.935714 26.302232 166.950000 37.525446 27.062946 44.021875 2.325000 4.084821 2.662054 5.790179 5.316518 0.072768 0.074554 0.072768 0.064286 0.013393 0.009375 0.149107 17.195089 45.194196
std 3246.662198 25037.797168 0.538398 0.544538 28.962453 336.597393 39.773434 225.715373 54.628979 41.280498 52.167439 1.932238 2.778714 2.923101 3.250958 2.426645 0.259813 0.262728 0.259813 0.245316 0.114976 0.096391 0.356274 6.639904 11.984069
min 0.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 18.000000
25% 2828.250000 35538.750000 0.000000 0.000000 24.000000 23.750000 1.000000 16.000000 3.000000 1.000000 9.000000 1.000000 2.000000 0.000000 3.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000 37.000000
50% 5458.500000 51741.500000 0.000000 0.000000 49.000000 173.500000 8.000000 67.000000 12.000000 8.000000 24.000000 2.000000 4.000000 2.000000 5.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 17.000000 44.000000
75% 8427.750000 68289.750000 1.000000 1.000000 74.000000 504.250000 33.000000 232.000000 50.000000 33.000000 56.000000 3.000000 6.000000 4.000000 8.000000 7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 23.000000 55.000000
max 11191.000000 666666.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 263.000000 362.000000 15.000000 27.000000 28.000000 13.000000 20.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 29.000000 121.000000
In [68]:
one_hot = OneHotEncoder(handle_unknown='ignore')
one_hot_edu_df = pd.DataFrame(one_hot.fit_transform(df[['Education']]).toarray())
df = df.join(one_hot_edu_df)
In [69]:
df.Marital_Status.unique()
Out[69]:
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)
In [70]:
ms_label = {value: key for key, value in enumerate(df.Marital_Status.unique())}
df.Marital_Status = df.Marital_Status.map(ms_label)
edu_label = {value: key for key, value in enumerate(df.Education.unique())}
df.Education = df.Education.map(edu_label)
In [71]:
df["total_Mnt"] = df["MntWines"] + df["MntFruits"] + df["MntMeatProducts"]+ df['MntFishProducts'] + df["MntSweetProducts"] + df["MntGoldProds"]
df['MntWines_pct'] = df['MntWines'] / df['total_Mnt']
df['MntFruits_pct'] = df["MntFruits"] / df['total_Mnt']
df["MntMeatProducts_pct"] = df["MntMeatProducts"] / df['total_Mnt']
df["MntFishProducts_pct"] = df["MntFishProducts"] / df['total_Mnt']
df["MntSweetProducts_pct"] = df["MntSweetProducts"] / df['total_Mnt']
df["MntGoldProds_pct"] = df["MntGoldProds"] / df['total_Mnt']
In [72]:
df.head()
Out[72]:
ID Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response acc_age Age 0 1 2 3 4 total_Mnt MntWines_pct MntFruits_pct MntMeatProducts_pct MntFishProducts_pct MntSweetProducts_pct MntGoldProds_pct
0 5524 0 0 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 27 57 0.0 0.0 1.0 0.0 0.0 1617 0.392703 0.054422 0.337662 0.106370 0.054422 0.054422
1 2174 0 0 46344.0 1 1 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 9 60 0.0 0.0 1.0 0.0 0.0 27 0.407407 0.037037 0.222222 0.074074 0.037037 0.222222
2 4141 0 1 71613.0 0 0 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 16 49 0.0 0.0 1.0 0.0 0.0 776 0.548969 0.063144 0.163660 0.143041 0.027062 0.054124
3 6182 0 1 26646.0 1 0 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 0 10 30 0.0 0.0 1.0 0.0 0.0 53 0.207547 0.075472 0.377358 0.188679 0.056604 0.094340
4 5324 1 2 58293.0 1 0 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 0 11 33 0.0 0.0 0.0 0.0 1.0 422 0.409953 0.101896 0.279621 0.109005 0.063981 0.035545
In [73]:
X = df.drop(['Response'], axis=1, inplace=False)
y = df['Response']
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.4, random_state = 123)

What is random forest?

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

Decision trees

Since the random forest model is made up of multiple decision trees, it would be helpful to start by describing the decision tree algorithm briefly. Decision trees start with a basic question, such as, “Should I surf?” From there, you can ask a series of questions to determine an answer, such as, “Is it a long period swell?” or “Is the wind blowing offshore?”. These questions make up the decision nodes in the tree, acting as a means to split the data. Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node. Observations that fit the criteria will follow the “Yes” branch and those that don’t will follow the alternate path. Decision trees seek to find the best split to subset the data, and they are typically trained through the Classification and Regression Tree (CART) algorithm. Metrics, such as Gini impurity, information gain, or mean square error (MSE), can be used to evaluate the quality of the split.

Decision_Tree_Diagram.png

This decision tree is an example of a classification problem, where the class labels are "surf" and "don't surf."

While decision trees are common supervised learning algorithms, they can be prone to problems, such as bias and overfitting. However, when multiple decision trees form an ensemble in the random forest algorithm, they predict more accurate results, particularly when the individual trees are uncorrelated with each other.

Ensemble methods

Ensemble learning methods are made up of a set of classifiers—e.g. decision trees—and their predictions are aggregated to identify the most popular result. The most well-known ensemble methods are bagging, also known as bootstrap aggregation, and boosting. In 1996, Leo Breiman (link resides outside IBM) (PDF, 810 KB) introduced the bagging method; in this method, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these models are then trained independently, and depending on the type of task—i.e. regression or classification—the average or majority of those predictions yield a more accurate estimate. This approach is commonly used to reduce variance within a noisy dataset.

Random forest algorithm

The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees. Feature randomness, also known as feature bagging or “the random subspace method”(link resides outside IBM) (PDF, 121 KB), generates a random subset of features, which ensures low correlation among decision trees. This is a key difference between decision trees and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features.

If we go back to the “should I surf?” example, the questions that I may ask to determine the prediction may not be as comprehensive as someone else’s set of questions. By accounting for all the potential variability in the data, we can reduce the risk of overfitting, bias, and overall variance, resulting in more precise predictions.

How it works

Random forest algorithms have three main hyperparameters, which need to be set before training. These include node size, the number of trees, and the number of features sampled. From there, the random forest classifier can be used to solve for regression or classification problems.

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample. Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later. Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees. Depending on the type of problem, the determination of the prediction will vary. For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e. the most frequent categorical variable—will yield the predicted class. Finally, the oob sample is then used for cross-validation, finalizing that prediction.

Random%20Forest%20Diagram.png

Benefits and challenges of random forest

There are a number of key advantages and challenges that the random forest algorithm presents when used for classification or regression problems. Some of them include:

Key Benefits

Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error. Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing. Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded. However, permutation importance, also known as mean decrease accuracy (MDA), is another importance measure. MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples.

Key Challenges

Time-consuming process: Since random forest algorithms can handle large data sets, they can be provide more accurate predictions, but can be slow to process data as they are computing data for each individual decision tree. Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data. More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them. Random forest applications The random forest algorithm has been applied across a number of industries, allowing them to make better business decisions. Some use cases include:

Finance:

It is a preferred algorithm over others as it reduces time spent on data management and pre-processing tasks. It can be used to evaluate customers with high credit risk, to detect fraud, and option pricing problems.

Healthcare:

The random forest algorithm has applications within computational biology (link resides outside IBM) (PDF, 737 KB), allowing doctors to tackle problems such as gene expression classification, biomarker discovery, and sequence annotation. As a result, doctors can make estimates around drug responses to specific medications.

E-commerce:

It can be used for recommendation engines for cross-sell purposes.

In [74]:
rfc = RandomForestClassifier(n_estimators=30,random_state=1)
max_depth_range = range(1,16)
param_grid = dict(max_depth=max_depth_range)
grid = GridSearchCV(rfc,param_grid,cv = 10,scoring = 'accuracy')
grid.fit(X_train, y_train)
Out[74]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(n_estimators=30, random_state=1),
             param_grid={'max_depth': range(1, 16)}, scoring='accuracy')
In [75]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(n_estimators=30, random_state=1),
             param_grid={'max_depth': range(1, 16)}, scoring='accuracy')
Out[75]:
GridSearchCV(cv=10,
             estimator=RandomForestClassifier(n_estimators=30, random_state=1),
             param_grid={'max_depth': range(1, 16)}, scoring='accuracy')
In [76]:
grid_mean_scores = grid.cv_results_["mean_test_score"]
grid_test_mean_scores = metrics
# plot the results
sns.mpl.pyplot.plot(max_depth_range, grid_mean_scores)
sns.mpl.pyplot.xlabel('max_depth')
sns.mpl.pyplot.ylabel('Cross-Validated Mean Train Set Accuracy');
In [77]:
best_rfc = RandomForestClassifier(n_estimators=50,random_state=1,max_depth = 12)
best_rfc.fit(X_train, y_train)
rfc_pred = best_rfc.predict(X_test)
accuracy_train = metrics.accuracy_score(y_train, best_rfc.predict(X_train))
accuracy_test = metrics.accuracy_score(y_test, rfc_pred)
print("Accuracy of Random Forest train is: ", accuracy_train)
print("Accuracy of Random Forest test is: ", accuracy_test)
Accuracy of Random Forest train is:  0.9903273809523809
Accuracy of Random Forest test is:  0.8671875
In [78]:
y_test_pred_rcf = best_rfc.predict(X_test)
print("Classification Report:\n", metrics.classification_report(y_test,y_test_pred_rcf))
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.98      0.93       760
           1       0.66      0.26      0.37       136

    accuracy                           0.87       896
   macro avg       0.77      0.62      0.65       896
weighted avg       0.85      0.87      0.84       896

In [79]:
pred_probs_rcf =  best_rfc.predict_proba(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, pred_probs_rcf[:,1])
sns.mpl.pyplot.plot(fpr, tpr,label="DS")
sns.mpl.pyplot.xlim([0, 1])
sns.mpl.pyplot.ylim([0, 1.05])
sns.mpl.pyplot.legend(loc="lower right")
sns.mpl.pyplot.xlabel('False Positive Rate (1 - Specificity)')
sns.mpl.pyplot.ylabel('True Positive Rate (Sensitivity)')
Out[79]:
Text(0, 0.5, 'True Positive Rate (Sensitivity)')
In [80]:
print("Test AUC: ",metrics.roc_auc_score(y_test, best_rfc.predict(X_test)))
Test AUC:  0.6168343653250774

License

This Notebook has been released under the Apache 2.0 open source license.

In [ ]: