天天看點

IMDB 5000 Movie Dataset(來自IMDB的5000個電影的資料集)描述

描述

Background

背景

How can we tell the greatness of a movie before it is released in cinema?

電影上映之前,我們怎樣得知它的好壞呢?

This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.

這個問題困擾我很長一段時間了,因為并沒有一個普适的方法來判定電影的好壞。許多人依靠評分來測量一個電影品質的好壞,而另一些人則憑他們的直覺。但在電影上映後也需要時間來擷取一個合理數量的評分回報。同時,人們的直覺有時候是不可靠的。

Question

問題

  1. Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?

    給出每年生産出的數以千計的電影,在不依賴于評分或我們的直覺的前提下,有沒有一個更好的方法來得知電影的好壞呢?

  2. Will the number of human faces in movie poster correlate with the movie rating?

    電影海報中的人臉數量會和電影評分有關嗎?

Method

方法

To answer this question, I scraped 5000+ movies from IMDB website using a Python library called “scrapy”.

為了回答這個問題,我使用一個叫做“scrapy”的python函數庫從IMDB網站上爬取了超過5000部電影的資訊。

The scraping process took 2 hours to finish. In the end, I was able to obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables:

爬取過程需要2個小時來完成。最後,我可以擷取到所有5043部電影和4906張海報需要的28個變量,涵蓋66個國家跨越100年的時間長度。有2399個不同的導演名字和數千位男女影員。以下是28個變量:

“movie_title”

“電影标題”

“color”

“顔色”

“num_critic_for_reviews”

“評論的評分數量”

“movie_facebook_likes”

“電影facebook贊數”

“duration”

“電影時長”

“director_name”

“導演名字”

“director_facebook_likes”

“導演facebook贊數”

“actor_3_name”

“演員3的姓名”

“actor_3_facebook_likes”

“演員3的facebook贊數”

“actor_2_name”

“演員2的姓名”

“actor_2_facebook_likes”

“演員2的姓名”

“actor_1_name”

“演員1的姓名”

“actor_1_facebook_likes”

“演員1的姓名”

“gross”

“票房收入”

“genres”

“體裁”

“num_voted_users”

“投票使用者數”

“cast_total_facebook_likes”

“演員總的facebook贊數”

“facenumber_in_poster”

“海報中的人臉數量”

“plot_keywords”

“情節關鍵詞”

“movie_imdb_link”

“電影imdb連結”

“num_user_for_reviews”

“評論的使用者數”

“language”

“語言”

“country”

“國家”

“content_rating”

“内容評級”

“budget”

“成本”

“title_year”

“上線日期”

“imdb_score”

“imdb評分”

“aspect_ratio”

“電影寬高比”

To answer question 2, I applied the human face detection algorithm on all the posters using python library called dlib, and extracted the number of faces in posters.

為了回答問題2,我使用一個叫做dlib的python函數庫将人臉檢測算法應用到所有的海報上,并提取出海報中的人臉數量。

Blog and Github codes

部落格和Github代碼

See here for more details about the scraping steps, the EDA, and the predictions : https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/

關于爬取步驟、EDA以及預測的更多詳情可見: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/

Github page: https://github.com/sundeepblue/movie_rating_prediction

Github頁面: https://github.com/sundeepblue/movie_rating_prediction

Important notes

重要說明

  1. This dataset is by no means to be a comprehensive scraping of all attributes relating to movies. It stemmed from one of my project built from scratch and finished in around one week. So please do not be surprised if you find something is off.

    該資料集絕不是一個與電影相關的所有屬性的複雜抓取。它來源于我的一個基于爬蟲的項目并在一周之内完成。是以如果你發現某些事情無法接受,請不要驚訝。

  2. This dataset is a proof of concept. It can be used for experimental and learning purpose to get hands dirty on web scraping, basic EDA, and learning algorithms in R or Python. For comprehensive movie analysis and accurate movie ratings prediction, 28 attributes from 5000 movies might not be enough. A decent dataset could contain hundreds of attributes from 50K or more movies, and requires tons of feature engineering.

    該資料集是概念的一個證明。它可以用于實驗和學習目的,對于那些想要學習網絡爬蟲、基本的電子設計自動化以及使用R或者Python的學習算法的人。對于複雜的電影分析和精準的電影評分預測,來自5000部電影的28個屬性可能是不夠的。一個漂亮的資料集會包含來自5萬或者更多電影的數以百計的屬性,并且需要大量的特征工程。

  3. There are around 800 “0”s in the “gross” attribute. This was either caused by (a) no gross number was found in certain movie page, or (b) the response returned by scrapy http request returned nothing in short period of time. So please make your own judgement when analyzing on this attribute.

    在“gross”屬性中有大約800個‘0’。這可能是由以下原因引起的:(a)在該電影頁面沒有找到票房數量;(b)爬蟲http請求傳回的response在短時間内沒有任何傳回。是以分析這項屬性時,請您自行決斷。

  4. There are around 908 directors whose “director_facebook_likes” attribute are 0. If somebody did analysis on “directory_facebook_like” attribute, there could be some off, and say, the top10, or top50 directors could be inaccurate. Thanks for pointing this out by user Kryslor. This is interesting, since the code I used to scrape everybody’s facebook like were identical. See function parse_facebook_likes_number(). It was hard to directly scrape this data from IMDB website (due to dynamic embedded div frame), so I had to use a hacky way by directly sending request to facebook website (see line 38 of this file). Perhaps for some directors, facebook did not respond with reasonable result within short timespan (< 0.25 second) and returned “None” in Python (translated to 0 in my code).

    有大約908個導演的”director_facebook_likes”屬性是0。如果有人想要基于”director_facebook_likes”屬性作分析,當提出排名前10、前50的導演時可能不太準确。

  5. For those 0s, you might want to treat them as “missing value” when using certain machine learning algorithms.

    對于那些0,在使用特定的機器學習算法時,你可能要把他們當作“丢失值”。

  6. Thanks to user “Quinton”, who found a bug in the dataset on 11/23/2016:

    (November 23, 2016 at 12:08 am) We actually used your IMDB dataset for an Advanced Data Mining class at Rockhurst University in Kansas City, MO. We love the data set and we really appreciate the time it took to create the it. However, we believe we found a small flaw in the data. Not all of the IMDB movie budget numbers are in US dollars, for example, the South Korean movie “The Host” has its budget numbers in S. Korean Won (Korean currency). But there is no data in the dataset that tells you the currency. The existance of foreign currencies skews the budget data for foreign films particularly for currencies with extreme exchange rates when compared to USD. For instance, many could assume the data set shows “The Host” cost $12 billion to make when it truthfully cost only 12 billion Won, but the dataset doesn’t make the distinction. It is not just an issue with Korean movies we found Turkish and Japanese movies with the same issue.

    Quinton was right. When I parsed the currency, I didn’t take the Korean currency into consideration. Therefore please be cautious if you analyze the currency related attributes for non US dollar currencies. The fix is actually quite simple in the corresponding python code.

    資料集中“budget”有着沒有考慮票房機關(韓元、日元、土耳其貨币)的bug。是以當您分析非美元機關的相關屬性時,要小心對待。

  7. Please be mindful that, analyzing currency related attributes, such as “gross” or “budget”, is actually more complicated than it seems. For a really thorough and accurate analysis (EDA or prediction), we may want to do some feature engineering on those attributes in a systematic way. For example, one US dollar in 1920 is different from that of 2010. So we need to take inflation factors across years into consideration, and normalize all US dollars into one basis (a certain year). So do all other currencies (British pound, Chinese RMB Yuan, etc). If you also consider exchange rate between two different currencies and wanted to convert everything into dollars, things become tricker, because even those rates also varies over time. $1 equals RMB8.4 in 2000 but RMB6.8 in 2015.

    請記住,分析貨币相關的屬性,例如”gross”或”budget”時,實際上比它看起來更加複雜。對于一個真正很徹底且精确的分析,我們可能要系統性地對那些屬性做一些特征工程。例如,1920年的1美元與2010年的是不一樣的。是以我們需要将跨越年份的通貨膨脹因素考慮進去,并且将所有的美元歸一化到同一個基準(一個特定的年份)上。其它的貨币也是一樣。如果您也考慮了兩種不同貨币之間的匯率并且想要把任何币種都轉換成美元,事情将變得不可信,因為即使那些評級也會随時間而不同。在2000年的時候,1美元等于8.4人民币;而到了2015年則等于6.8人民币。

繼續閱讀