Overview
After a few more meetings, Beatriz has assigned your team to address the following issues asked by the stakeholders:
Miguel Ferreira, Bank President asks:
Like I said the other day, the core task we're interested in is identifying those customers most likely to subscribe to a term deposit.
This way, we can build a targeted marketing campaign that focuses primarily on those customers.
Francisco, VP of Marketing says:
And I'd like you to find any actionable patterns in our results. Should we only call single people on Saturdays? Does it make sense to call students at all?
Things like that.
Miguel Ferreira, Bank President adds:
One other thing we should probably address, does contacting people too frequently for these marketing campaigns have an adverse affect on the outcome?
Beatriz, Senior Data Scientist says:
One last thing, there are a bunch of social and economic indicators in the data.
We should be careful about how we consider these. We may want to see separate models for times when, for example, the consumer confidence index is high compared to when it is low.
Miguel Ferreira, Bank President adds:
Good thought Beatriz. Different customer segments tend to react to economic changes differently.
We'll definitely want to know if it's better to use a particular model during different economic situations.
Beatriz, Senior Data Scientist says:
One last thing, since we're planning on deploying these models, we'll want to make sure that once you have the models trained and tested, that you persist those trained models to a file so we can load them into our production systems.
Francisco, VP of Marketing says:
I have an additional request. We just received the results of our last marketing campaign and want to test your model against the list to see what impact it would have had against our bottom line. Could you run the list through your model and make some predictions for us?
You can access the data here: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bank_holdout_test.csv
Then, please submit a csv file that has a single column, with the header "predictions" and a prediction (one per row) for each individual in this file. If we should contact the individual, predict a 1. If we shouldn't contact the individual, predict a 0 for that row. There should be 4,119 predictions in the csv file when completed.
Here is an example of what the csv should look like when finished (if your model had predicted to not contact anyone. Hopefully your csv has more 1s in it.): https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bank_csv_answers_sample.csv
Miguel Ferreira, Bank President adds:
Sounds like we have enough to get started. If you could send us your write up on this, that would be great.
Team Project Expectations
Be sure to read over the Team Project Expectations guide to know what the expectations are for this and future projects.
Tips from Johnny
Johnny, the data science intern, whispers to you after the meeting:
Hey, I put together a list of tips and ideas that might help us out:
Data Dictionary
Our database analyst put together this data dictionary to help explain the values and sources of different columns in the bank dataset, so be sure to review that.
Target Variable
One oddity here is that our target feature is simply labled y
, but it's a boolean indicating "y" or "n", did the client subscribe to a term deposit.
Binning
Just as you did with the Titanic dataset when you reduced the number of titles, you may find it useful to "bin" categorical features into discrete groups in order to address some of the questions above. There are multiple ways to do this, but previously we used the map()
function.
Decision Trees
You can find documentation on how to use decision trees with sci-kit learn on these pages:
Model Persistence
When you train a model, a large amount of information is stored in memory. That model can then be used to make predictions for new instances at a later time.
You'll want to save these trained models using python's pickle
module, as shown here.
However, rather than using pickle's default protocol version, you should use protocol version 5, which was introduced in Python 3.8 and is optimized for dealing with structures that contain numpy arrays and pandas data frames.
Model Ensembles, Bagging, and Boosting
Often, we can get better results by using a set of models, each using a slightly different set of training data, or other parameters. These are called "Model Ensembles" and it's very common to use an ensemble of decision trees (often called a "Random Forest") rather than a single tree.
Two popular techniques used in the creation of ensembles are "boosting" and "bagging". You can read more about these topics on pages 163 - 167 of your textbook.
For details on how to use these techniques with Sci-Kit Learn, see this page.
Avoiding overfitting through pruning
It is very easy to overfit a decision tree. The text discusses strategies to avoid this problem in section 4.4.4 (pages 158 - 163).
In SciKit-Learn, you can use parameters such as max_depth and min_samples_leaf to control tree complexity and overfitting.
Alternatively, you can use something more elaborate, such as cost complexity pruning.
Johnny, the Data Science Intern, drops by your hotel room around midnight:
Okay, just one last thing, if you need any more help at all, I put together this collection of Google Colab notebooks that might be useful.