CSE 450 - Machine Learning

Module 03 — Housing Estimates, Project

Overview

After a few more meetings with the executive team, the head of the data science division has assigned your team to address the following issues asked by the stakeholders:

Cecil, the VP of Customer Relations says:

The biggest thing I want to see is quantifiable evidence that the predictions we come up with are reliable.

William, the VP of Finance asks:

I'd like to know which property types are weighing most heavily in the house prices predicted by your model. My excel spreadsheets can tell me that information for our current methodology...can your so-called artificial intelligence do the same?

Devon, the CEO adds:

Yes...thank you William. These are all great questions.

One other question the board was wondering about, is if there are additional factors about these areas that might be affecting prices, which we aren't taking into account.

That may be a little above and beyond what you're team is planning, since it would require finding more data from an external source and correlating it with the data we already have, but if you have the time, I know they'd appreciate it.

If you could send us your team's write up on this by Saturday night, that would be great.

Cecil, the VP of Customer Relations says:

Oh, one more thing. We actually just received a batch of new home data. Could you run your model on it to make some predictions for us? We are anxious to see your model in action.

You can access the data here: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test.csv

Then, please include with your write up a csv file that has a single column, with the header: "price" and a prediction (one per row) for each of the homes in this file.

Team Project Expectations

Be sure to read over the Team Project Expectations guide to know what the expectations are for this and future projects.

Tips from Johnny

Johnny, the data science intern, whispers to you after the meeting:

Hey, I put together a list of tips and ideas that might help us out:

Data Dictionary

Our database analyst put together this data dictionary to help explain the values and sources of different columns in the housing dataset, so be sure to review that.

Binning

Just as you did with the Titanic dataset when you reduced the number of titles, you may find it useful to "bin" certain features into discrete groups in order to address some of the questions above. There are multiple ways to do this, but previously we used the map() function.

Adding External Data

If you do decide to add additional data from another source, such as data you find related to a particular zip code, you might find this Pandas tutorial on combining tables to be useful.

XGBoost

You can find documentation on how to use xgboost to be useful, particularly the sections on the sklearn wrapper:

Johnny, the Data Science Intern, drops by your hotel room around midnight:

Okay, just one last thing, if you need any more help at all, I put together this collection of Google Colab notebooks that might be useful.


  1. CEO photo by Oz Seyrek on Unsplash  

  2. VP of HR photo by Christina @ wocintechchat.com 

  3. VP of Finance photo by steffen Wienberg on Unsplash 

  4. Data Science Intern photo by Fábio Lucas on Unsplash