天天看點

Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

背景

Netflix是最受歡迎的媒體和視訊流平台之一。他們的平台上有超過8000部電影或電視節目,截止到2021年中期,他們在全球有超過2億的使用者。這個表格資料集由Netflix上的所有電影和電視節目的清單組成,并附有詳細資訊,如演員、導演、評級、發行年份、持續時間等。

文章主要是對資料集進行簡單的探索性資料分析,後續會繼續完善對Netflix的深入了解。

導入必要的包

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
           

資料讀取和預處理

netflix_overall = pd.read_csv('./netflix-shows/netflix_titles.csv')
netflix_overall.head()
           
show_id type title director cast country date_added release_year rating duration listed_in description
s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
netflix_overall.shape
           
(8807, 12)
           
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
           
show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64
           

電影類型隻有兩種,着手分析下

# plt.rcParams['figure.dpi'] = 200
# plt.rcParams['figure.figsize'] = [6, 3.0]
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set3")
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望
plt.figure(figsize=(12,6))
plt.title('netflix type')
plt.pie(netflix_overall.type.value_counts(), labels=netflix_overall.type.value_counts().index, autopct='%1.1f%%', startangle=180);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

observation:

  1. 網飛節目還是以電影為主
  2. 其中電影占到了近7成,有着4000+的數量

缺失值分析

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望
total = netflix_overall.isnull().sum().sort_values(ascending = False)
percent = (netflix_overall.isnull().sum()/netflix_overall.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(7)
           
Total Percent
director 2634 29.908028
country 831 9.435676
cast 825 9.367549
date_added 10 0.113546
rating 4 0.045418
duration 3 0.034064
show_id 0.000000
plt.figure(figsize=(12,6))
plt.title('Percentage of missing values')
plt.pie(missing_data.Total[:4], labels=missing_data.Total.index[:4], autopct='%1.2f%%', startangle=180);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

從圖中可以看出缺失值的分布情況,而導演和演員的缺失情況我們不能随意填充,可以考慮删除缺失值,而其他資料缺失較少的則用中位數,衆數等填充

2446
           
2676
           
188
           
6131
           

顯然直接去掉

nan

值的話,

'TV show'

資料直接廢了,因為缺失的太多了,有2446個,而本身資料就2676,是以決定直接去掉這一列,不做分析,而導演這一列缺失資料其實不多可以填充

unknown

netflix_overall['country'] = netflix_overall['country'].fillna(netflix_overall['country'].mode()[0])
netflix_overall['date_added'] = netflix_overall['date_added'].fillna(netflix_overall['date_added'].mode()[0])
netflix_overall['rating'] = netflix_overall['rating'].fillna(netflix_overall['rating'].mode()[0])
netflix_overall['duration'] = netflix_overall['duration'].fillna(netflix_overall['duration'].mode()[0])
netflix_overall['director'].fillna('No Director', inplace=True)
           
show_id         False
type            False
title           False
director        False
country         False
date_added      False
release_year    False
rating          False
duration        False
listed_in       False
description     False
dtype: bool
           

檢視是否有重複值

劃分資料集

由于資料集隻包含電影或電視節目,是以最好同時擁有兩者的資料集,這樣我們就可以深入研究 Netflix 電影或 Netflix 電視節目,是以我們将建立兩個新資料集。一個用于電影,另一個用于表演。

netflix_movies = netflix_overall[netflix_overall['type'] == 'Movie']
netflix_shows = netflix_overall[netflix_overall['type'] == 'TV Show']
           
display(netflix_movies.head())
netflix_shows.head()
           
show_id type title director country date_added release_year rating duration listed_in description
s1 Movie Dick Johnson Is Dead Kirsten Johnson United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
6 s7 Movie My Little Pony: A New Generation Robert Cullen, José Luis Ucha United States September 24, 2021 2021 PG 91 min Children & Family Movies Equestria's divided. But a bright-eyed hero be...
7 s8 Movie Sankofa Haile Gerima United States, Ghana, Burkina Faso, United Kin... September 24, 2021 1993 TV-MA 125 min Dramas, Independent Movies, International Movies On a photo shoot in Ghana, an American model s...
9 s10 Movie The Starling Theodore Melfi United States September 24, 2021 2021 PG-13 104 min Comedies, Dramas A woman adjusting to life after a loss contend...
12 s13 Movie Je Suis Karl Christian Schwochow Germany, Czech Republic September 23, 2021 2021 TV-MA 127 min Dramas, International Movies After most of her family is murdered in a terr...
show_id type title director country date_added release_year rating duration listed_in description
1 s2 TV Show Blood & Water No Director South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq United States September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans No Director United States September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory No Director India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
5 s6 TV Show Midnight Mass Mike Flanagan United States September 24, 2021 2021 TV-MA 1 Season TV Dramas, TV Horror, TV Mysteries The arrival of a charismatic young priest brin...

資料探索性分析和可視化

電影評級分析

  • 您在許多 Netflix 電視節目中看到的 TV-MA 評級意味着該節目僅适合成熟的觀衆。分級可由 Netflix 或 TVPG(電視家長指南)根據 Netflix 的要求指定
  • TV-MA 評級表示特定的電視節目包含圖像暴力、粗俗語言、圖像性場景或其組合。它與 MPAA 分類和評級管理局指定的電影的 NC-17 和 R 評級大緻相當
  • Netflix 還對面向年輕觀衆的節目使用一組評級。适合青少年的節目被評為 TV-14。在“大齡兒童”類别中,您可以找到 TV-Y7、TV-Y7-VF 和 TV-PG 等級,而面向小孩子的節目可以有 TV-Y 和 TV-G 等級
plt.figure(figsize=(16,6))
sns.scatterplot(x='rating',y='type',data = netflix_overall);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

從這裡可知,電視節目的分級比電影的少,電影分級拉滿了,接下來進一步看看各自分級的數量分布吧

plt.figure(figsize = (12,8))
sns.countplot(x='rating',data = netflix_overall,hue='type',order = netflix_overall.rating.value_counts().index);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_movies, palette="Set3", order=netflix_movies['rating'].value_counts().index)
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

observation:

電影:

  • 第一大級别是 “TV-MA” ,隻為成熟的觀衆設計
  • 第二大級别是 “TV-14”,代表可能不适合14歲以下兒童的内容
  • 第三大級别是非常受歡迎的 "R "級。"R "級電影是指被美國電影協會評估為有可能不适合17歲以下兒童觀看的材料的電影;美國電影協會寫道:“17歲以下需要父母或成人監護人陪同”

電視節目:

  • 電視節目主要集中在’TV-MA’,‘TV-14’,‘TV-PG’,還有一些大齡兒童等級的,可以說除了’TV-MA’外全是适合未成年節目

釋出年份分析

plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=netflix_movies, palette="Set2", order=netflix_movies['release_year'].value_counts().index[0:15])
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

observation:

  1. 2018年是釋出電影最多的一年
  2. 2021年的資料不完善,是以排名較為靠後,數量也偏低
plt.figure(figsize = (35,6))
sns.countplot(x='release_year',data = netflix_overall);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

正如我們所看到的,Netflix 上的大多數電影和電視節目都是在過去十年中發行的,而之前發行的很少

netflix上作品最多的國家

def split_multicolumns(col_series):
    result_df = col_series.to_frame()
    options = []
    for idx, value in col_series[col_series.notnull()].iteritems():
        for option in value.replace(' ','').split(','):
            if not option in result_df.columns:
                result_df[option] = False
                options.append(option)
            result_df.loc[idx, option] = True
    return result_df[options]
           
UnitedStates     3192
India             962
UnitedKingdom     534
Canada            319
France            303
                 ... 
Bermuda             1
Ecuador             1
Armenia             1
Mongolia            1
Montenegro          1
Length: 118, dtype: int64
           
plt.figure(figsize = (12,6))
plt.xticks(rotation = 75)
plt.title('number of countries')
sns.barplot(x = cou_totals.index[:10], y = cou_totals[:10]);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

observation:

現在我們探索 Netflix 上内容最多的國家/地區。如原始資料集所示,電影通常在多個國家/地區可用。是以,我們需要在分析資料之前将一部電影中的所有國家分開。在分隔國家後,我們繪制前 10 名清單,以檢視哪些國家/地區在 Netflix 上的電影可用性最高。不出所料,由于 Netflix 是一家美國公司,是以美國脫穎而出。印度出人意料地排在第二位,緊随其後的是英國和加拿大。中國的排位也是極其有趣,是來自台灣?香港?有沒有内地呢?都是值得去探索下

電視節目分别在netflix添加的時間點數量

date_added
1 September 24, 2021
2 September 24, 2021
3 September 24, 2021
4 September 24, 2021
5 September 24, 2021
netflix_date['year'] = netflix_date['date_added'].apply(lambda x : x.split(', ')[-1])
netflix_date['month'] = netflix_date['date_added'].apply(lambda x : x.lstrip().split(' ')[0])
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'][::-1]
df = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
           
df
           
year 2008 2013 2014 2015 2016 2017 2018 2019 2020 2021
month
December 0.0 0.0 1.0 7.0 44.0 38.0 61.0 47.0 68.0 0.0
November 0.0 0.0 2.0 2.0 18.0 30.0 36.0 68.0 51.0 0.0
October 0.0 2.0 0.0 4.0 19.0 29.0 45.0 65.0 51.0 0.0
September 0.0 1.0 0.0 1.0 19.0 32.0 43.0 37.0 53.0 65.0
August 0.0 1.0 0.0 0.0 11.0 38.0 34.0 44.0 47.0 61.0
July 0.0 0.0 0.0 2.0 9.0 34.0 27.0 59.0 43.0 88.0
June 0.0 0.0 0.0 2.0 7.0 29.0 28.0 46.0 41.0 83.0
May 0.0 0.0 0.0 1.0 4.0 23.0 27.0 48.0 52.0 38.0
April 0.0 0.0 1.0 4.0 8.0 27.0 28.0 43.0 50.0 53.0
March 0.0 1.0 0.0 2.0 3.0 38.0 35.0 53.0 44.0 37.0
February 1.0 0.0 1.0 1.0 6.0 17.0 24.0 45.0 42.0 44.0
January 0.0 0.0 0.0 0.0 28.0 14.0 24.0 37.0 63.0 36.0
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(df, cmap='gist_heat_r', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index, fontsize=7, fontfamily='serif')

plt.title('Netflix Contents Update', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

observation:

  1. 網飛的節目上線日期越來越多,一年比一年多,2021年主要集中在6-9月份,是以排新片可以在1-5月份
  2. 2021年10月到12月資料為空,主要原因可能是資料未同步更新
  3. 電影的會怎麼樣呢,同樣方法可以分析分析

Netflix 電影時長和電視節目連載分析

0        90 min
6        91 min
7       125 min
9       104 min
12      127 min
         ...   
8801     96 min
8802    158 min
8804     88 min
8805     88 min
8806    111 min
Name: duration, Length: 6131, dtype: object
           

需要對整數型分析,是以處理掉min等多餘因素,然後轉換成整數,以友善排序畫圖

netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)
netflix_movies['duration']
           
0        90
6        91
7       125
9       104
12      127
       ... 
8801     96
8802    158
8804     88
8805     88
8806    111
Name: duration, Length: 6128, dtype: int32
           
sns.set(style="darkgrid")
sns.kdeplot(data=netflix_movies['duration'], shade=True);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

是以,Netflix上的大量電影都在75-120分鐘的長度範圍内,是以電影院看到的3h大制作看來還是少數?是否是因為大家承受不住這漫長時間的緣故?

features=['title','duration']
durations= netflix_shows[features]

durations['no_of_seasons'] = durations['duration'].str.replace(' Season','')
durations['no_of_seasons'] = durations['duration'].str.replace(' Season','').str.replace('s','')
durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
           
t=['title','no_of_seasons']

top=durations[t]

top=top.sort_values(by='no_of_seasons', ascending=False)
           
top
           
title no_of_seasons
548 Grey's Anatomy 17
2423 Supernatural 15
4798 NCIS 15
1354 Heartland 13
4220 COMEDIANS of the world 13
... ... ...
3853 I Have a Script 1
3852 Abyss 1
3851 Unchained Fate 1
3850 The Missing Menu 1
3696 Record of Grancrest War 1

2676 rows × 2 columns

top20=top[0:20]
plt.figure(figsize = (12,6))
plt.xticks(rotation = 90)
sns.barplot(x='title',y='no_of_seasons', data = top20);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

是以,《海軍罪案調查處》、《實習醫生格蕾》和《超自然》是季數最多的電視劇之一。

plt.figure(figsize = (20,6))
plt.xticks(rotation = 90)
sns.barplot(x = no_of_duration.index, y = no_of_duration.values, order = netflix_overall['duration'].value_counts().index[:100]);
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

可以推斷,隻有一季的電視節目是最多的,遠超其他時長的電視節目和電影,這或許就是最佳的

duration

Netflix上最早的電影和電視節目

old = netflix_movies.sort_values("release_year", ascending = True) 
old = old[old['duration'] != ""]
old[['title', "release_year"]][:15]
           
title release_year
8205 The Battle of Midway 1942
7790 Prelude to War 1942
8763 WWII: Report from the Aleutians 1943
8739 Why We Fight: The Battle of Russia 1943
8660 Undercover: How to Operate Behind Enemy Lines 1943
8419 The Memphis Belle: A Story of a\nFlying Fortress 1944
8436 The Negro Soldier 1944
8640 Tunisian Victory 1944
7219 Know Your Enemy - Japan 1945
7575 Nazi Concentration Camps 1945
7930 San Pietro 1945
7294 Let There Be Light 1946
8587 Thunderbolt 1947
1699 White Christmas 1954
2375 The Blazing Sun 1954
old = netflix_shows.sort_values("release_year", ascending = True) #oldest movies available on netflix
old = old[old['duration'] != ""]
old[['title', "release_year"]][:15]
           
title release_year
4250 Pioneers: First Women Filmmakers* 1925
1331 Five Came Back: The Reference Films 1945
7743 Pioneers of African-American Cinema 1946
8541 The Twilight Zone (Original Series) 1963
8189 The Andy Griffith Show 1967
4550 Monty Python's Fliegender Zirkus 1972
4551 Monty Python's Flying Circus 1974
6549 Dad's Army 1977
6674 El Chavo 1979
7588 Ninja Hattori 1981
7878 Robotech 1985
2740 Saint Seiya 1986
7993 Shaka Zulu 1986
5299 High Risk 1988
6970 Highway to Heaven 1988

Netflix上最早的電影1942年,而電視節目更早些,是1925年,可以看看電視節目與電影之間發展趨勢的比較

netflix上電影和電視節目每年的增長對比

netflix_overall['year_added'] = pd.DatetimeIndex(netflix_overall['date_added']).year
netflix_movies['year_added'] = pd.DatetimeIndex(netflix_movies['date_added']).year
netflix_shows['year_added'] = pd.DatetimeIndex(netflix_shows['date_added']).year
netflix_overall['month_added'] = pd.DatetimeIndex(netflix_overall['date_added']).month
netflix_movies['month_added'] = pd.DatetimeIndex(netflix_movies['date_added']).month
netflix_shows['month_added'] = pd.DatetimeIndex(netflix_shows['date_added']).month
           
netflix_year = netflix_overall['year_added'].value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
netflix_year = netflix_year[netflix_year.year != 2021]
netflix_year
           
year count
2019 2016
1 2020 1889
2 2018 1649
4 2017 1188
5 2016 429
6 2015 82
7 2014 24
8 2011 13
9 2013 11
10 2012 3
11 2009 2
12 2008 2
13 2010 1
netflix_year2 = netflix_overall[['type','year_added']]
movie_year = netflix_year2[netflix_year2['type']=='Movie'].year_added.value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
movie_year = movie_year[movie_year.year != 2021]
show_year = netflix_year2[netflix_year2['type']=='TV Show'].year_added.value_counts().to_frame().reset_index().rename(columns={'index': 'year','year_added':'count'})
show_year = show_year[show_year.year != 2021]
           
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data = netflix_year, x = 'year', y = 'count')
sns.lineplot(data = movie_year, x = 'year', y = 'count')
sns.lineplot(data = show_year, x = 'year', y = 'count')
ax.set_xticks(np.arange(2008, 2021, 1))
plt.title("Total content added each year (up to 2020)")
plt.legend(['Total','Movie','TV Show'])
plt.ylabel("Releases")
plt.xlabel("Year")
plt.show()
           
Netflix Movies and TV Shows --- 探索性資料分析背景導入必要的包資料讀取和預處理資料探索性分析和可視化後續展望

根據上述時間表,我們可以看到流行的流媒體平台在 2014 年之後開始受到關注。從那時起,增速飛快。但到了2019年突然下滑,這是為什麼?

後續展望

  1. 近年來,Netflix是否更關注電視節目而不是電影
  2. 文章中提及問題的解決
  3. 通過比對基于文本的特征來識别類似的内容,即推薦系統