天天看點

【龍騰網】作為一名資料科學家,如何解決這個真實的業務問題?

作者:龍騰網看世界

正文翻譯

【龍騰網】作為一名資料科學家,如何解決這個真實的業務問題?

How to solve this real business problem as a data scientist?

作為一名資料科學家,如何解決這個真實的業務問題?

評論翻譯

Lyndon D'Arcy

How would a data scientist solve this business problem?

Original question details: “Suppose I have a dataset of all the boat owners that sold a boat for the last 7 years....

If I'm trying to create a predictive model of people that own boats and are likely to be selling their boats...in the near future

What sort of data points and data sets as well as software tools would be needed”

There are two issues with your dataset.

You are trying to solve a two-outcome (binomial) classification problem.

You want to predict, based on who owns a boat today, what will be the outcome in the future - sell, or don't sell.

資料科學家将如何解決這個商業問題?

原始問題細節:“假設我有一個包含了過去7年裡所有賣船的船主的資料集。如果我試圖建立一個預測模型,預測那些擁有船隻并可能在不久的将來出售船隻的人。需要什麼樣的資料點、資料集以及軟體工具?”

你的資料集存在兩個問題。

你正試圖解決一個有兩個結果(二項)的分類問題。

你想根據今天誰擁有一艘船來預測未來的結果——賣掉還是不賣掉。

Unfortunately, everyone in your dataset sold their boat. That means that your model will always predict a sale outcome, because that's all it knows.

What you need is a dataset of all boat owners, regardless of whether they sold in a given year or not. Then you can start to build a meaningful classifier.

The second issue to be wary of with your dataset, is that if it is based on sales data it may only contain information that was known after the sale event, for example the sale price. You want to make sure that this sort of information is not included in the model, since it is not known at the time that we are making our predictions.

A final thought on other datasets that might be useful.

Why do people sell boats? Too expensive? They don't use it any more? Upgrade to a better boat? Moving city and can't take it with them?

If you can get an expert to give you a breakdown of the main reasons that people sell boats, that will help point you to the data sources that will give you the most predictive value.

不幸的是,你的資料集中的每個人都賣掉了他們的船。這意味着你的模型總是會預測銷售結果,因為這是它所知道的一切。

你需要的是所有船主的資料集,不管他們是否在某一年賣出了船。然後你可以開始建構一個有意義的分類器。

關于資料集要注意的第二個問題是,如果它是基于銷售資料,那麼它可能隻包含銷售事件之後已知的資訊,例如銷售價格。你要確定這類資訊不包含在模型中,因為我們在做預測時還不知道這些資訊。

關于其他可能有用的資料集的最後一個想法。

人們為什麼要賣船?太貴了?他們不再用了嗎?更新到更好的船?搬來搬去卻不能帶走嗎?

如果你能讓專家告訴你人們賣船的主要原因,這将有助于你找到最具預測價值的資料源。

Ricardo Vladimiro

I'll try to go through all of the points you raised.

Features don't really apply here. There's no modelling involved to solve this problem, there might be but you don't actually need it. Picking the top words is a straightforward statistical test. I don't know your dataset or ability to create the one you need though.

The target value depends on what you want to test. This depends on a number of variables.

In the end you mention both models and problems which hints me that you don't really know what you want to do. My best advice is do not do it yourself.

The best solution to this is to hire a statistician or data analyst, preferably one that is able to handle observational studies.

我會盡力把你提出的所有觀點都仔細研究一遍。

特性在這裡并不适用。解決這個問題不需要模組化,可能會有用,但實際上你并不需要它。挑選最熱門的單詞是一個簡單的統計測試。我不知道你的資料集,也不知道你是否有能力建立你需要的資料集。

目标值取決于您想要測試的内容。這取決于許多變量。

最後,你提到了模型和問題,這暗示我你真的不知道自己想做什麼。我最好的建議是不要自己做。

對此,最好的解決方案是聘請一名統計學家或資料分析師,最好是能夠處理觀察性研究的人。

Colleen Farrelly

As a data scientist, what's the complex real life problem you have ever solved (using data science)?

By far, problems involving human behavior and biological responses given a very incomplete set of predictors. At Kaplan, we've been able to predict which students will drop out at a given point of time and which students are at risk for failing exit exams with a high degree of accuracy and predictors I wouldn't have guessed were associated with the behavior at the start of projects (usually >95% accuracy with current models).

作為一名資料科學家,你(使用資料科學)解決過什麼複雜的現實生活問題?

到目前為止,涉及人類行為和生物反應的問題給出了一組非常不完整的預測因素。在卡普蘭,我們已經能夠高度準确地預測哪些學生将在特定時間點退學,哪些學生有可能在畢業考試中不及格,而我不會想到它們與項目開始時的行為有關(通常>95%的現有模型準确率)。

Jason T Widjaja

How can a data scientist negotiate with other parties to get access to their data?

Originally Answered: What is a data scientist without data? How can you negotiate with other parties to gain access to data?

A data scientist without data is like a printer without ink - full of capability but unable to function without the required raw material.

But printers need more than ink to function well. Just as a printer is tasked to print useful content rather than spraying ink randomly on a page, data scientists also need use cases and well formed projects to investigate.

The art of translating business problems into data science use cases - and getting the requisite data to do so - is a crucial but underrated skill. And I suspect the lack of this skill is the cause of many failures in data science investments. Having senior executives hire a team of data scientists to declare with much fanfare ‘go forth and do AI and machine learning!’ is neither necessary nor sufficient.

資料科學家如何與其他各方協商以擷取他們的資料?

最初回答:沒有資料的資料科學家是什麼?你如何與其他各方協商以獲得資料通路權限?

沒有資料的資料科學家就像沒有墨水的列印機——有足夠的能力,但沒有所需的原材料就無法運作。

但是列印機需要的不僅僅是墨水。就像列印機的任務是列印有用的内容,而不是在頁面上随機噴射墨水一樣,資料科學家也需要案例和形式良好的項目來進行調查。

将業務問題轉化為資料科學用例的藝術——并獲得必要的資料——是一項至關重要但被低估的技能。我懷疑缺乏這項技能是資料科學投資失敗的原因。高管們聘請了一個資料科學家團隊,大張旗鼓地宣布“去做人工智能和機器學習吧!”這句話既不必要也不充分。

Having gone through this use case hunting and data acquisition dozens of times with my team, here are five things that I have found helpful in getting access to data:

Don’t show off tech - show off the potential to solve problems.

Align every project to business strategy. There are multiple possible projects in every business department. But the one that will ultimately get support are usually the ones that most directly and significantly impact business metrics.

Run open learning sessions using data generated to approximate real data sets. There is a lot of interest in data science, machine learning and AI at the moment.

And most importantly:

Communicate with empathy to your audience.

在與我的團隊一起進行了數十次用例搜尋和資料擷取之後,我發現以下5件事情有助于通路資料:

不要炫耀技術——炫耀解決問題的潛力。

使每個項目與業務戰略保持一緻。每個業務部門都有多個可能的項目。但是最終得到支援的通常是那些對業務名額産生最直接和最顯著影響的。

使用生成的接近真實資料集的資料運作開放式學習會話。目前,人們對資料科學、機器學習和人工智能很感興趣。

最重要的是:

帶着同理心與你的聽衆交流。

繼續閱讀