Photo by Campaign Creators on Unsplash
Questions
You're about to go into a strategy meeting with the CEO, Vice President of Human Resources, and Vice President of Finance. They want to make sure you have the data required to answer the questions they're most interested in.
Be prepared to answer the following questions:
Problem Type
Devon, the CEO says:
I just sat through four hours of machine learning training with the board of directors this past week, so I'm curious to get your take on this.
Looking at the data and our business model, what kind of machine learning problem do you think we're looking at here?
Model Confidence
Cecil, the VP of Customer Relations asks:
My biggest concern right now is making sure that whatever method we come up with to predict housing prices, we can also attach some kind of empirical confidence metric.
Based on your initial analysis of the data, your team feels you can best show confidence in your model by using:
- The sum of squares error (SSE).
- The mean squared error (MSE).
- The root mean squared error (RMSE).
- The $R^2$ value.
Insurance Question
William, the VP of Finance asks:
Our insurance customers are particularly interested in making sure that homes in… unsavory neighborhoods, aren't estimated high.
Is there a way we can easily identify properties in low income areas and have the model lower those estimates to protect our insurance customers' interests?
Based on your initial analysis of the data, your team feels:
- We can lower the predicted price for specific neighborhoods before training the model.
- We can add average income or other demographic information for the area as features.
- For specific zip codes we could add a step to our pipeline that reduces the predicted price by a specific percentage prior to outputting the final price result to ensure the properties aren’t being overvalued.
- Taking this kind of action would be a violation of federal laws and/or ethics.
Data Analysis
Johnny, the data science intern asks:
The head of data science says we should use gradient boosted trees for this analysis.
I've noticed that a lot of the features use pretty different ranges.
For example, how should we handle square footage?