背景
Netflix是最受歡迎的媒體和視訊流平台之一。他們的平台上有超過8000部電影或電視節目,截止到2021年中期,他們在全球有超過2億的使用者。這個表格資料集由Netflix上的所有電影和電視節目的清單組成,并附有詳細資訊,如演員、導演、評級、發行年份、持續時間等。
文章主要是對資料集進行簡單的探索性資料分析,後續會繼續完善對Netflix的深入了解。
導入必要的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
資料讀取和預處理
netflix_overall = pd.read_csv('./netflix-shows/netflix_titles.csv')
netflix_overall.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... | |
1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
netflix_overall.shape
(8807, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
show_id 8807
type 2
title 8807
director 4528
cast 7692
country 748
date_added 1767
release_year 74
rating 17
duration 220
listed_in 514
description 8775
dtype: int64
電影類型隻有兩種,着手分析下
# plt.rcParams['figure.dpi'] = 200
# plt.rcParams['figure.figsize'] = [6, 3.0]
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set3")
plt.figure(figsize=(12,6))
plt.title('netflix type')
plt.pie(netflix_overall.type.value_counts(), labels=netflix_overall.type.value_counts().index, autopct='%1.1f%%', startangle=180);
observation:
- 網飛節目還是以電影為主
- 其中電影占到了近7成,有着4000+的數量
缺失值分析
show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4
duration 3
listed_in 0
description 0
dtype: int64
total = netflix_overall.isnull().sum().sort_values(ascending = False)
percent = (netflix_overall.isnull().sum()/netflix_overall.isnull().count()*100).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(7)
Total | Percent | |
---|---|---|
director | 2634 | 29.908028 |
country | 831 | 9.435676 |
cast | 825 | 9.367549 |
date_added | 10 | 0.113546 |
rating | 4 | 0.045418 |
duration | 3 | 0.034064 |
show_id | 0.000000 |
plt.figure(figsize=(12,6))
plt.title('Percentage of missing values')
plt.pie(missing_data.Total[:4], labels=missing_data.Total.index[:4], autopct='%1.2f%%', startangle=180);
從圖中可以看出缺失值的分布情況,而導演和演員的缺失情況我們不能随意填充,可以考慮删除缺失值,而其他資料缺失較少的則用中位數,衆數等填充
2446
2676
188
6131
顯然直接去掉
nan
值的話,
'TV show'
資料直接廢了,因為缺失的太多了,有2446個,而本身資料就2676,是以決定直接去掉這一列,不做分析,而導演這一列缺失資料其實不多可以填充
unknown
netflix_overall['country'] = netflix_overall['country'].fillna(netflix_overall['country'].mode()[0])
netflix_overall['date_added'] = netflix_overall['date_added'].fillna(netflix_overall['date_added'].mode()[0])
netflix_overall['rating'] = netflix_overall['rating'].fillna(netflix_overall['rating'].mode()[0])
netflix_overall['duration'] = netflix_overall['duration'].fillna(netflix_overall['duration'].mode()[0])
netflix_overall['director'].fillna('No Director', inplace=True)
show_id False
type False
title False
director False
country False
date_added False
release_year False
rating False
duration False
listed_in False
description False
dtype: bool
檢視是否有重複值
劃分資料集
由于資料集隻包含電影或電視節目,是以最好同時擁有兩者的資料集,這樣我們就可以深入研究 Netflix 電影或 Netflix 電視節目,是以我們将建立兩個新資料集。一個用于電影,另一個用于表演。
netflix_movies = netflix_overall[netflix_overall['type'] == 'Movie']
netflix_shows = netflix_overall[netflix_overall['type'] == 'TV Show']
display(netflix_movies.head())
netflix_shows.head()
show_id | type | title | director | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|
s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... | |
6 | s7 | Movie | My Little Pony: A New Generation | Robert Cullen, José Luis Ucha | United States | September 24, 2021 | 2021 | PG | 91 min | Children & Family Movies | Equestria's divided. But a bright-eyed hero be... |
7 | s8 | Movie | Sankofa | Haile Gerima | United States, Ghana, Burkina Faso, United Kin... | September 24, 2021 | 1993 | TV-MA | 125 min | Dramas, Independent Movies, International Movies | On a photo shoot in Ghana, an American model s... |
9 | s10 | Movie | The Starling | Theodore Melfi | United States | September 24, 2021 | 2021 | PG-13 | 104 min | Comedies, Dramas | A woman adjusting to life after a loss contend... |
12 | s13 | Movie | Je Suis Karl | Christian Schwochow | Germany, Czech Republic | September 23, 2021 | 2021 | TV-MA | 127 min | Dramas, International Movies | After most of her family is murdered in a terr... |
show_id | type | title | director | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | s2 | TV Show | Blood & Water | No Director | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | United States | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | No Director | United States | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | No Director | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
5 | s6 | TV Show | Midnight Mass | Mike Flanagan | United States | September 24, 2021 | 2021 | TV-MA | 1 Season | TV Dramas, TV Horror, TV Mysteries | The arrival of a charismatic young priest brin... |
資料探索性分析和可視化
電影評級分析
- 您在許多 Netflix 電視節目中看到的 TV-MA 評級意味着該節目僅适合成熟的觀衆。分級可由 Netflix 或 TVPG(電視家長指南)根據 Netflix 的要求指定
- TV-MA 評級表示特定的電視節目包含圖像暴力、粗俗語言、圖像性場景或其組合。它與 MPAA 分類和評級管理局指定的電影的 NC-17 和 R 評級大緻相當
- Netflix 還對面向年輕觀衆的節目使用一組評級。适合青少年的節目被評為 TV-14。在“大齡兒童”類别中,您可以找到 TV-Y7、TV-Y7-VF 和 TV-PG 等級,而面向小孩子的節目可以有 TV-Y 和 TV-G 等級
plt.figure(figsize=(16,6))
sns.scatterplot(x='rating',y='type',data = netflix_overall);
從這裡可知,電視節目的分級比電影的少,電影分級拉滿了,接下來進一步看看各自分級的數量分布吧
plt.figure(figsize = (12,8))
sns.countplot(x='rating',data = netflix_overall,hue='type',order = netflix_overall.rating.value_counts().index);
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_movies, palette="Set3", order=netflix_movies['rating'].value_counts().index)
observation:
電影:
- 第一大級别是 “TV-MA” ,隻為成熟的觀衆設計
- 第二大級别是 “TV-14”,代表可能不适合14歲以下兒童的内容
- 第三大級别是非常受歡迎的 "R "級。"R "級電影是指被美國電影協會評估為有可能不适合17歲以下兒童觀看的材料的電影;美國電影協會寫道:“17歲以下需要父母或成人監護人陪同”
電視節目:
- 電視節目主要集中在’TV-MA’,‘TV-14’,‘TV-PG’,還有一些大齡兒童等級的,可以說除了’TV-MA’外全是适合未成年節目
釋出年份分析
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=netflix_movies, palette="Set2", order=netflix_movies['release_year'].value_counts().index[0:15])
observation:
- 2018年是釋出電影最多的一年
- 2021年的資料不完善,是以排名較為靠後,數量也偏低
plt.figure(figsize = (35,6))
sns.countplot(x='release_year',data = netflix_overall);
正如我們所看到的,Netflix 上的大多數電影和電視節目都是在過去十年中發行的,而之前發行的很少
netflix上作品最多的國家
def split_multicolumns(col_series):
result_df = col_series.to_frame()
options = []
for idx, value in col_series[col_series.notnull()].iteritems():
for option in value.replace(' ','').split(','):
if not option in result_df.columns:
result_df[option] = False
options.append(option)
result_df.loc[idx, option] = True
return result_df[options]
UnitedStates 3192
India 962
UnitedKingdom 534
Canada 319
France 303
...
Bermuda 1
Ecuador 1
Armenia 1
Mongolia 1
Montenegro 1
Length: 118, dtype: int64
plt.figure(figsize = (12,6))
plt.xticks(rotation = 75)
plt.title('number of countries')
sns.barplot(x = cou_totals.index[:10], y = cou_totals[:10]);
observation:
現在我們探索 Netflix 上内容最多的國家/地區。如原始資料集所示,電影通常在多個國家/地區可用。是以,我們需要在分析資料之前将一部電影中的所有國家分開。在分隔國家後,我們繪制前 10 名清單,以檢視哪些國家/地區在 Netflix 上的電影可用性最高。不出所料,由于 Netflix 是一家美國公司,是以美國脫穎而出。印度出人意料地排在第二位,緊随其後的是英國和加拿大。中國的排位也是極其有趣,是來自台灣?香港?有沒有内地呢?都是值得去探索下
電視節目分别在netflix添加的時間點數量
date_added | |
---|---|
1 | September 24, 2021 |
2 | September 24, 2021 |
3 | September 24, 2021 |
4 | September 24, 2021 |
5 | September 24, 2021 |
netflix_date['year'] = netflix_date['date_added'].apply(lambda x : x.split(', ')[-1])
netflix_date['month'] = netflix_date['date_added'].apply(lambda x : x.lstrip().split(' ')[0])
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'][::-1]
df = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
df
year | 2008 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||
December | 0.0 | 0.0 | 1.0 | 7.0 | 44.0 | 38.0 | 61.0 | 47.0 | 68.0 | 0.0 |
November | 0.0 | 0.0 | 2.0 | 2.0 | 18.0 | 30.0 | 36.0 | 68.0 | 51.0 | 0.0 |
October | 0.0 | 2.0 | 0.0 | 4.0 | 19.0 | 29.0 | 45.0 | 65.0 | 51.0 | 0.0 |
September | 0.0 | 1.0 | 0.0 | 1.0 | 19.0 | 32.0 | 43.0 | 37.0 | 53.0 | 65.0 |
August | 0.0 | 1.0 | 0.0 | 0.0 | 11.0 | 38.0 | 34.0 | 44.0 | 47.0 | 61.0 |
July | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | 34.0 | 27.0 | 59.0 | 43.0 | 88.0 |
June | 0.0 | 0.0 | 0.0 | 2.0 | 7.0 | 29.0 | 28.0 | 46.0 | 41.0 | 83.0 |
May | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 23.0 | 27.0 | 48.0 | 52.0 | 38.0 |
April | 0.0 | 0.0 | 1.0 | 4.0 | 8.0 | 27.0 | 28.0 | 43.0 | 50.0 | 53.0 |
March | 0.0 | 1.0 | 0.0 | 2.0 | 3.0 | 38.0 | 35.0 | 53.0 | 44.0 | 37.0 |
February | 1.0 | 0.0 | 1.0 | 1.0 | 6.0 | 17.0 | 24.0 | 45.0 | 42.0 | 44.0 |
January | 0.0 | 0.0 | 0.0 | 0.0 | 28.0 | 14.0 | 24.0 | 37.0 | 63.0 | 36.0 |
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(df, cmap='gist_heat_r', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index, fontsize=7, fontfamily='serif')
plt.title('Netflix Contents Update', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()
cbar.ax.tick_params(labelsize=8)
cbar.ax.minorticks_on()
plt.show()
observation:
- 網飛的節目上線日期越來越多,一年比一年多,2021年主要集中在6-9月份,是以排新片可以在1-5月份
- 2021年10月到12月資料為空,主要原因可能是資料未同步更新
- 電影的會怎麼樣呢,同樣方法可以分析分析
Netflix 電影時長和電視節目連載分析
0 90 min
6 91 min
7 125 min
9 104 min
12 127 min
...
8801 96 min
8802 158 min
8804 88 min
8805 88 min
8806 111 min
Name: duration, Length: 6131, dtype: object
需要對整數型分析,是以處理掉min等多餘因素,然後轉換成整數,以友善排序畫圖
netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)
netflix_movies['duration']
0 90
6 91
7 125
9 104
12 127
...
8801 96
8802 158
8804 88
8805 88
8806 111
Name: duration, Length: 6128, dtype: int32
sns.set(style="darkgrid")
sns.kdeplot(data=netflix_movies['duration'], shade=True);
是以,Netflix上的大量電影都在75-120分鐘的長度範圍内,是以電影院看到的3h大制作看來還是少數?是否是因為大家承受不住這漫長時間的緣故?
features=['title','duration']
durations= netflix_shows[features]
durations['no_of_seasons'] = durations['duration'].str.replace(' Season','')
durations['no_of_seasons'] = durations['duration'].str.replace(' Season','').str.replace('s','')
durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
t=['title','no_of_seasons']
top=durations[t]
top=top.sort_values(by='no_of_seasons', ascending=False)
top
title | no_of_seasons | |
---|---|---|
548 | Grey's Anatomy | 17 |
2423 | Supernatural | 15 |
4798 | NCIS | 15 |
1354 | Heartland | 13 |
4220 | COMEDIANS of the world | 13 |
... | ... | ... |
3853 | I Have a Script | 1 |
3852 | Abyss | 1 |
3851 | Unchained Fate | 1 |
3850 | The Missing Menu | 1 |
3696 | Record of Grancrest War | 1 |
2676 rows × 2 columns
top20=top[0:20]
plt.figure(figsize = (12,6))
plt.xticks(rotation = 90)
sns.barplot(x='title',y='no_of_seasons', data = top20);
是以,《海軍罪案調查處》、《實習醫生格蕾》和《超自然》是季數最多的電視劇之一。
plt.figure(figsize = (20,6))
plt.xticks(rotation = 90)
sns.barplot(x = no_of_duration.index, y = no_of_duration.values, order = netflix_overall['duration'].value_counts().index[:100]);
可以推斷,隻有一季的電視節目是最多的,遠超其他時長的電視節目和電影,這或許就是最佳的
duration
Netflix上最早的電影和電視節目
old = netflix_movies.sort_values("release_year", ascending = True)
old = old[old['duration'] != ""]
old[['title', "release_year"]][:15]
title | release_year | |
---|---|---|
8205 | The Battle of Midway | 1942 |
7790 | Prelude to War | 1942 |
8763 | WWII: Report from the Aleutians | 1943 |
8739 | Why We Fight: The Battle of Russia | 1943 |
8660 | Undercover: How to Operate Behind Enemy Lines | 1943 |
8419 | The Memphis Belle: A Story of a\nFlying Fortress | 1944 |
8436 | The Negro Soldier | 1944 |
8640 | Tunisian Victory | 1944 |
7219 | Know Your Enemy - Japan | 1945 |
7575 | Nazi Concentration Camps | 1945 |
7930 | San Pietro | 1945 |
7294 | Let There Be Light | 1946 |
8587 | Thunderbolt | 1947 |
1699 | White Christmas | 1954 |
2375 | The Blazing Sun | 1954 |
old = netflix_shows.sort_values("release_year", ascending = True) #oldest movies available on netflix
old = old[old['duration'] != ""]
old[['title', "release_year"]][:15]
title | release_year | |
---|---|---|
4250 | Pioneers: First Women Filmmakers* | 1925 |
1331 | Five Came Back: The Reference Films | 1945 |
7743 | Pioneers of African-American Cinema | 1946 |
8541 | The Twilight Zone (Original Series) | 1963 |
8189 | The Andy Griffith Show | 1967 |
4550 | Monty Python's Fliegender Zirkus | 1972 |
4551 | Monty Python's Flying Circus | 1974 |
6549 | Dad's Army | 1977 |
6674 | El Chavo | 1979 |
7588 | Ninja Hattori | 1981 |
7878 | Robotech | 1985 |
2740 | Saint Seiya | 1986 |
7993 | Shaka Zulu | 1986 |
5299 | High Risk | 1988 |
6970 | Highway to Heaven | 1988 |
Netflix上最早的電影1942年,而電視節目更早些,是1925年,可以看看電視節目與電影之間發展趨勢的比較
netflix上電影和電視節目每年的增長對比
netflix_overall['year_added'] = pd.DatetimeIndex(netflix_overall['date_added']).year
netflix_movies['year_added'] = pd.DatetimeIndex(netflix_movies['date_added']).year
netflix_shows['year_added'] = pd.DatetimeIndex(netflix_shows['date_added']).year
netflix_overall['month_added'] = pd.DatetimeIndex(netflix_overall['date_added']).month
netflix_movies['month_added'] = pd.DatetimeIndex(netflix_movies['date_added']).month
netflix_shows['month_added'] = pd.DatetimeIndex(netflix_shows['date_added']).month
netflix_year = netflix_overall['year_added'].value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
netflix_year = netflix_year[netflix_year.year != 2021]
netflix_year
year | count | |
---|---|---|
2019 | 2016 | |
1 | 2020 | 1889 |
2 | 2018 | 1649 |
4 | 2017 | 1188 |
5 | 2016 | 429 |
6 | 2015 | 82 |
7 | 2014 | 24 |
8 | 2011 | 13 |
9 | 2013 | 11 |
10 | 2012 | 3 |
11 | 2009 | 2 |
12 | 2008 | 2 |
13 | 2010 | 1 |
netflix_year2 = netflix_overall[['type','year_added']]
movie_year = netflix_year2[netflix_year2['type']=='Movie'].year_added.value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
movie_year = movie_year[movie_year.year != 2021]
show_year = netflix_year2[netflix_year2['type']=='TV Show'].year_added.value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
show_year = show_year[show_year.year != 2021]
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data = netflix_year, x = 'year', y = 'count')
sns.lineplot(data = movie_year, x = 'year', y = 'count')
sns.lineplot(data = show_year, x = 'year', y = 'count')
ax.set_xticks(np.arange(2008, 2021, 1))
plt.title("Total content added each year (up to 2020)")
plt.legend(['Total','Movie','TV Show'])
plt.ylabel("Releases")
plt.xlabel("Year")
plt.show()
根據上述時間表,我們可以看到流行的流媒體平台在 2014 年之後開始受到關注。從那時起,增速飛快。但到了2019年突然下滑,這是為什麼?
後續展望
- 近年來,Netflix是否更關注電視節目而不是電影
- 文章中提及問題的解決
- 通過比對基于文本的特征來識别類似的内容,即推薦系統