資料可視化有意思的小例子：Taylor Swift 歌詞資料分析和可視化

原文位址

Data Visualization and Analysis of Taylor Swift’s Song Lyrics

Taylor Swift

英語學習時間

Taylor Swift

She is the youngest person to single-handedly write and perform a number-one song on the Hot Country Songs chart published by Billboard magazine in the United States.
Apart from that she is also the recipient of 10 Grammys, one Emmy Award, 23 Billboard Music Awards, and 10 Country Music Association Awards.
song lyrics 歌詞

資料集

Taylor Swift 6 張專輯（album）96首歌的歌詞

6列資料

歌手名 artist
專輯名 album name
歌名 track title
專輯中第幾首歌 track number
歌詞（每句一行）lyric
歌詞是這首歌的第幾句 line number
發表年份 year of release of the album

主要的分析内容

探索性資料分析

每首歌和每張專輯的歌詞的單詞數量
單詞數量随着年份的變化
單詞數量的頻率分布

文本挖掘

詞雲
bigram network (暫時還不太明白這個是什麼意思)
情感分析（sentiment analysis）

使用的工具是R語言

探索性資料分析

接觸到一個新的函數：stringr包中的

str_count()

幫助文檔中的例子

library(stringr)
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
#輸出結果是
[1] 1 3 1 1

複制

作用是統計每個字元串中符合特定規則的字元的數量

比如

str_count("A B C","\\S+")

複制

輸出的是“A B C”字元串中非空字元的數量（\S+是正規表達式的一種寫法，自己還沒有掌握）

讀入資料

lyrics<-read.csv("taylor_swift_lyrics_1.csv",header=T)
head(lyrics)

複制

計算每句歌詞的長度

library(stringr)
lyrics$length<-str_count(lyrics$lyric,"\\S+")
head(lyrics)

複制

計算每首歌的歌詞長度

library(dplyr)
length_df<-lyrics%>%
  group_by(track_title)%>%
  summarise(length=sum(length))
head(length_df)
dim(length_df)

複制

第一項内容：單詞數量最多的10首歌

Top10wordCount<-arrange(length_df,desc(length))%>%
  slice(c(1:10))
library(ggplot2)
ggplot(Top10wordCount,aes(x=reorder(track_title,length),y=length))+
  geom_col(aes(fill=track_title))+coord_flip()+
  ylab("Word count") + xlab ("") + 
  ggtitle("Top 10 songs in terms of word count") + 
  theme_minimal()+
  theme(legend.position = "none")

複制

image.png

從上圖可以看到，單詞數量最多的歌是 End Game

排名第二的是 Out of the Woods

第二項内容：單詞數最少的10首歌

Top10wordCount<-arrange(length_df,length)%>%
  slice(c(1:10))
library(RColorBrewer)
color<-rainbow(10)
ggplot(Top10wordCount,aes(x=reorder(track_title,-length),y=length))+
  geom_col(aes(fill=track_title))+coord_flip()+
  ylab("Word count") + xlab ("") + 
  ggtitle("Top 10 songs in terms of word count") + 
  theme_minimal()+scale_fill_manual(values = color)+
  theme(legend.position = "none")+
  theme(legend.position = "none")

複制

image.png

單詞數量最少的歌是 Sad Beautiful Tragic，釋出于2012年，是 Red 這張專輯中的歌

第三項内容：單詞數量的頻率分布

ggplot(length_df, aes(x=length)) + 
  geom_histogram(bins=30,aes(fill = ..count..)) + 
  geom_vline(aes(xintercept=mean(length)),
             color="#FFFFFF", linetype="dashed", size=1) +
  geom_density(aes(y=25 * ..count..),alpha=.2, fill="#1CCCC6") +
  ylab("Count") + xlab ("Legth") + 
  ggtitle("Distribution of word count") + 
  theme_minimal()

複制

image.png

第四項内容：每張專輯的單詞數量

lyrics %>% 
  group_by(album,year) %>% 
  summarise(length = sum(length))%>%
  na.omit()-> length_df_album
length_df_album
ggplot(length_df_album, aes(x= reorder(album,-length), y=length)) +
  geom_bar(stat='identity', fill="#1CCCC6") + 
  ylab("Word count") + xlab ("Album") + 
  ggtitle("Word count based on albums") + 
  theme_minimal()

複制

image.png

第五項内容：每張專輯單詞數量随時間的變化趨勢

length_df_album %>% 
  arrange(desc(year)) %>% 
  ggplot(., aes(x= factor(year), y=length, group = 1)) +
  geom_line(colour="#1CCCC6", size=1) + 
  ylab("Word count") + xlab ("Year") + 
  ggtitle("Word count change over the years") + 
  theme_minimal()+
  geom_point(aes(x=factor(year),y=length,
                 size=length,color=factor(year)),
             alpha=0.5)+
  scale_size_continuous(range=c(5,15))+
  theme(legend.position = "none")

複制

image.png

第六項内容：詞雲圖

library("tm")
library("wordcloud")
lyrics_text <- lyrics$lyric
lyrics_text<- gsub('[[:punct:]]+', '', lyrics_text)
lyrics_text<- gsub("([[:alpha:]])\1+", "", lyrics_text)
docs <- Corpus(VectorSource(lyrics_text))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing=TRUE)
lyrics_wc_df <- data.frame(word=names(word_freqs), freq=word_freqs)
lyrics_wc_df <- lyrics_wc_df[1:300,]
set.seed(1234)
wordcloud(words = lyrics_wc_df$word, freq = lyrics_wc_df$freq,
          min.freq = 1,scale=c(1.8,.5),
          max.words=200, random.order=FALSE, rot.per=0.15,
          colors=brewer.pal(8, "Dark2"))

複制

情感分析

剩下的部分有時間回來補上