




How to solve this real business problem as a data scientist?



Lyndon D'Arcy

How would a data scientist solve this business problem?

Original question details: “Suppose I have a dataset of all the boat owners that sold a boat for the last 7 years....

If I'm trying to create a predictive model of people that own boats and are likely to be selling their boats...in the near future

What sort of data points and data sets as well as software tools would be needed”

There are two issues with your dataset.

You are trying to solve a two-outcome (binomial) classification problem.

You want to predict, based on who owns a boat today, what will be the outcome in the future - sell, or don't sell.






Unfortunately, everyone in your dataset sold their boat. That means that your model will always predict a sale outcome, because that's all it knows.

What you need is a dataset of all boat owners, regardless of whether they sold in a given year or not. Then you can start to build a meaningful classifier.

The second issue to be wary of with your dataset, is that if it is based on sales data it may only contain information that was known after the sale event, for example the sale price. You want to make sure that this sort of information is not included in the model, since it is not known at the time that we are making our predictions.

A final thought on other datasets that might be useful.

Why do people sell boats? Too expensive? They don't use it any more? Upgrade to a better boat? Moving city and can't take it with them?

If you can get an expert to give you a breakdown of the main reasons that people sell boats, that will help point you to the data sources that will give you the most predictive value.







Ricardo Vladimiro

I'll try to go through all of the points you raised.

Features don't really apply here. There's no modelling involved to solve this problem, there might be but you don't actually need it. Picking the top words is a straightforward statistical test. I don't know your dataset or ability to create the one you need though.

The target value depends on what you want to test. This depends on a number of variables.

In the end you mention both models and problems which hints me that you don't really know what you want to do. My best advice is do not do it yourself.

The best solution to this is to hire a statistician or data analyst, preferably one that is able to handle observational studies.






Colleen Farrelly

As a data scientist, what's the complex real life problem you have ever solved (using data science)?

By far, problems involving human behavior and biological responses given a very incomplete set of predictors. At Kaplan, we've been able to predict which students will drop out at a given point of time and which students are at risk for failing exit exams with a high degree of accuracy and predictors I wouldn't have guessed were associated with the behavior at the start of projects (usually >95% accuracy with current models).



Jason T Widjaja

How can a data scientist negotiate with other parties to get access to their data?

Originally Answered: What is a data scientist without data? How can you negotiate with other parties to gain access to data?

A data scientist without data is like a printer without ink - full of capability but unable to function without the required raw material.

But printers need more than ink to function well. Just as a printer is tasked to print useful content rather than spraying ink randomly on a page, data scientists also need use cases and well formed projects to investigate.

The art of translating business problems into data science use cases - and getting the requisite data to do so - is a crucial but underrated skill. And I suspect the lack of this skill is the cause of many failures in data science investments. Having senior executives hire a team of data scientists to declare with much fanfare ‘go forth and do AI and machine learning!’ is neither necessary nor sufficient.






Having gone through this use case hunting and data acquisition dozens of times with my team, here are five things that I have found helpful in getting access to data:

Don’t show off tech - show off the potential to solve problems.

Align every project to business strategy. There are multiple possible projects in every business department. But the one that will ultimately get support are usually the ones that most directly and significantly impact business metrics.

Run open learning sessions using data generated to approximate real data sets. There is a lot of interest in data science, machine learning and AI at the moment.

And most importantly:

Communicate with empathy to your audience.






