Module 02 - Overview
This week you'll learn about supervised learning, decision trees, and more about model evaluation.
Subject to Change
Keep in mind that your instructor may deviate somewhat from the following guide, and they have final say on assignment requirements, delivery methods, and due dates. So be sure to pay attention to both in-class and Canvas announcements.
Module 02 Assignments
For your convenience, here are links to the module 02 readings and assignments:
Readings
Data
- Bank Dataset
- Bank Data Dictionary
- Bank Holdout Dataset
- Bank Holdout MINI Dataset
- Sample CSV predictions
- Google Colab Notebook
Use a decision tree
If you're really struggling with how to make a decision tree, you should try reading through the documentation a couple more times.
If that still doesn't help, this notebook can give you some more guidance:
Use a random forest
If you want to try different machine learning algorithms on this dataset, you may want to try a random forest.
This notebook can give you an example of how to build a random forest:
Holdout Dataset
Most of the modules contain a "holdout dataset" which is used by the instructor to see how well your model performs. This dataset has the same fields as the training set except we have removed the value you will be predicting. Your model will not have seen this information before, and the results will indicate how well the model has "learned" during training.
Once your team is confident your model has been adequately trained, load the data from the holdout dataset and make predictions on it. You are required to make predictions for ALL rows in the holdout dataset.
Don't Forget:
- Perform the same transformations on the dataset. (MinMaxScaling, add/remove features, etc.)
- Do NOT remove any rows from the holdout dataset
- Do NOT sort or shuffle the holdout dataset
Holdout Mini Dataset
This module has a mini holdout dataset. You can test your model against this mini holdout dataset as many times as you'd like. It is here to
- Verify that your CSV is in the correct format
- Verify that your model is making good predictions
- Give you an idea of what your grade might be on the final holdout set
Once your team is confident your model has been adequately trained, load the data from the mini holdout dataset and make predictions on it. Save the predictions as a CSV.
Open and run this colab (follow the instructions at the top of the notebook)
Don't Forget:
- Perform the same transformations on the dataset. (MinMaxScaling, add/remove features, etc.)
- Do NOT remove any rows from the holdout dataset
- Do NOT sort or shuffle the holdout dataset
- You can test the mini holdout as many times as you'd like, but be careful not to overfit to the mini holdout set. Ideally, you should get similar results to the test dataset you created.
Sample code to help with the mini holdout
test = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bank_holdout_test_mini.csv")
# Do same transformations as on the training set
predictions = clf.predict(test)
# Convert the predictions to a dataframe and label the column 'y'
my_predictions = pd.DataFrame(predictions, columns = ['y'])
# Replace PUTTEAMNUMBERHERE with your team
my_predictions.to_csv("teamPUTTEAMNUMBERHERE-module2-predictions.csv",index=False)