天天看點

可視化程式設計語言_可視化程式設計語言影響圖

可視化程式設計語言

Gephi和Sigma.js的網絡可視化教程 (A network visualization tutorial with Gephi and Sigma.js)

Here’s a preview of what we’ll be making today: the programming languages influence graph. Check out the link to explore the “design influence” relationships between over 250 programming languages past and present!

這是我們今天要做的預覽: 程式設計語言影響圖 。 檢視連結,探索過去和現在超過250種程式設計語言之間的“設計影響力”關系!

輪到你! (Your turn!)

In today’s hyper-connected world, networks are an ubiquitous aspect of modern life.

在當今高度連接配接的世界中,網絡是現代生活中無處不在的方面。

Take the start of my day so far — I used London’s transport network to travel into town. Then I went into a branch of my favourite coffee shop and used my Chromebook to connect to their Wi-Fi network. Next, I logged in to the various social networking sites I frequent.

到目前為止,我已經開始了新的一天-我使用倫敦的交通網絡前往市區。 然後,我去了我最喜歡的咖啡店的一家分店 ,并用我的Chromebook連接配接到他們的Wi-Fi網絡 。 接下來,我登入了我經常通路的各種社交網站。

It’s no secret that some of the most influential companies of the last few decades owe their success to the power of networks.

過去幾十年來一些最有影響力的公司将其成功歸功于網絡的力量已經不是什麼秘密了。

Facebook, Twitter, Instagram, LinkedIn and other social media platforms rely on the small-world properties of social networks. This lets them connect their users with each other (and advertisers) effectively.

Facebook,Twitter,Instagram,LinkedIn和其他社交媒體平台依賴于社交網絡的小世界特性。 這樣一來,他們就可以有效地将使用者(和廣告客戶)彼此聯系起來。

Google owes much of its current success to their early dominance of the search engine market — enabled in part through their ability to return relevant results with the help of their Page Rank network algorithm.

Google的目前成功主要歸功于其在搜尋引擎市場的早期統治地位-部分是由于他們借助Page Rank網絡算法能夠傳回相關結果的能力。

Amazon’s efficient distribution network allows them to offer same-day delivery in some major cities.

亞馬遜的高效分銷網絡使他們能夠在一些主要城市提供當天送貨服務。

Networks are also super-important in fields such as Artificial Intelligence and Machine Learning. Neural networks are a very active field of research. Many feature detection algorithms, essential in Computer Vision, rely heavily on using networks to model different parts of images.

網絡在人工智能和機器學習等領域也非常重要。 神經網絡是一個非常活躍的研究領域。 在計算機視覺中必不可少的許多功能檢測算法都嚴重依賴于使用網絡對圖像的不同部分進行模組化 。

A wide range of scientific phenomena can also be understood in terms of network models. This includes quantum mechanics, biochemical pathways, and ecological and socio-economic systems.

通過網絡模型也可以了解各種各樣的科學現象。 這包括量子力學 , 生化途徑以及生态和社會經濟系統 。

Given their undeniable importance, then, how can we better understand networks and their properties?

鑒于它們不可否認的重要性,那麼,我們如何才能更好地了解網絡及其屬性?

The mathematical study of networks is known as “graph theory”, and is one of the more accessible branches of mathematics. This article aims to provide an introduction, assuming little prior knowledge or experience.

網絡的數學研究被稱為“ 圖論 ”,是數學中較易獲得的分支之一。 本文旨在提供介紹,假定您幾乎沒有先驗知識或經驗。

We’ll be using Python 3.x and some awesome open-source software called Gephi to put together a network visualization of how a range of programming languages past and present are linked by influence.

我們将使用Python 3.x和一些很棒的開源軟體Gephi ,将網絡過去和現在的各種程式設計語言如何通過影響聯系在一起的可視化網絡。

但首先… (But first…)

What exactly is a network?

網絡到底是什麼?

The examples described above give us some clues. Transport networks are made up of destinations connected by routes. Social networks are made up of individuals, connected through their relationships to one another. Google’s search engine algorithms evaluate the “rank” of different webpages by looking at which pages link out to others.

上述示例為我們提供了一些線索。 運輸網絡由路線連接配接的目的地組成。 社交網絡由個人組成 ,通過彼此之間的關系互相聯系。 Google的搜尋引擎算法通過檢視哪些頁面連結到其他頁面來評估不同網頁的“排名”。

More generally, a network is any system that can be described in terms of nodes and edges, or in colloquial terms, “dots and lines”.

更一般而言,網絡是可以用節點描述的任何系統 和邊緣 (俗稱“點和線”)。

Some systems are readily abstracted in this manner. Social networks are perhaps the most obvious example. Computer filesystems are another — folders and files are linked by their “parent” and “child” relationships.

一些系統很容易以這種方式抽象。 社交網絡也許是最明顯的例子。 計算機檔案系統是另一種-檔案夾和檔案通過它們的“父”和“子”關系連結。

But the real power of networks comes from the fact that many, many systems can be abstracted and modelled in network terms, even if at first it isn’t obvious how.

但是網絡的真正力量來自這樣一個事實,即可以用網絡術語對許多系統進行抽象和模組化,即使起初并不清楚如何實作。

代表網絡 (Representing networks)

We need to go a little beyond pen-and-paper sketches to analyze and describe networks mathematically. How can we turn pictures of dots and lines into numbers we can crunch?

我們需要超出紙本草圖的範圍,以數學方式分析和描述網絡。 我們如何将點和線的圖檔轉換為可以處理的數字?

One solution is to draw up an adjacency matrix to represent our network.

一種解決方案是繪制一個鄰接矩陣來表示我們的網絡。

Matrices are one of those concepts that might sound a little intimidating if you’re not familiar with them, but fear not. Think of them as grids of numbers which can be used to perform many calculations all at once. Here’s an example below:

如果您不熟悉矩陣,這些概念可能聽起來有些吓人,但不要害怕。 将它們視為數字網格,可以一次執行許多計算。 下面是一個示例:

Python Java Scala C#
Python     0    1     0  0
Java       0    0     0  1
Scala      0    1     0  0
C#         0    1     0  0
           

In this matrix, the intersection of each row and column is either 0 or 1, depending on whether or not the respective languages are linked. You can check this against the illustration above!

在此矩陣中,每一行和每一列的交集為0或1,這取決于是否連結了相應的語言。 您可以對照上圖進行檢查!

For most purposes, the adjacency matrix is a good way of representing a network mathematically. From a computational perspective, however, it can sometimes be a bit cumbersome.

對于大多數目的,鄰接矩陣是數學上表示網絡的一種好方法。 但是,從計算角度來看,有時可能會有些麻煩。

For instance, with even a relatively modest number of nodes (say 1000), there will be a much larger number of elements in the matrix (e.g., 1000² = 1,000,000).

例如,即使節點數量相對較少(例如1000個),矩陣中的元素數量也會大得多(例如1000²= 1,000,000)。

Many real-world systems yield sparse networks. In these networks, most nodes only connect to a small proportion of all the others.

許多現實世界的系統都會産生稀疏網絡 。 在這些網絡中,大多數節點僅連接配接到所有其他節點的一小部分。

If we represented a 1000-node sparse network in computer memory as an adjacency matrix, we’d have 1,000,000 bytes of data stored in RAM. Most will be zeros. There’s got to be a more efficient way of going about this.

如果将計算機記憶體中的1000個節點的稀疏網絡表示為鄰接矩陣,則RAM中将存儲1,000,000位元組的資料。 多數将為零。 必須有一種更有效的解決方法。

An alternative approach is to work with edge lists instead. These are exactly what they say they are. They are simply a list of which node pairs link to each other.

另一種方法是使用邊清單 代替。 這些正是他們所說的。 它們隻是節點對之間互相連結的清單。

For example, the programming languages network above can be represented as follows:

例如,上面的程式設計語言網絡可以表示如下:

Java, Python
Java, Scala
Java, C#
C#, Java
           

For larger networks, this is a much more computationally efficient means of representing them. It is of course possible to generate an adjacency matrix from an edge list (and vice versa). It’s not like we have to pick one or the other.

對于較大的網絡,這是表示它們的計算效率更高的方式。 當然可以從邊緣清單生成鄰接矩陣(反之亦然)。 好像我們不必選擇另一個。

Another means of representing networks are adjacency lists. This lists every node followed by the nodes it links to. For example:

表示網絡的另一種方法是鄰接表 。 這列出了每個節點,然後列出了它連結到的節點。 例如:

Java: Python, Scala, C#
C#: Java
           

收集資料,建立連接配接 (Collecting data, making connections)

Any network model and visualisation will only be as good as the data used to construct it. This means, as well as ensuring the data is both accurate and complete, we also need to justify a means of inferring edges between nodes.

任何網絡模型和可視化效果都隻會與用于建構它的資料一樣好。 這意味着,除了確定資料準确且完整之外,我們還需要證明一種推斷節點之間邊緣的方法。

In many respects, this is the critical step. Any subsequent analysis and inferences made about the network depend on being able to justify the “linkage criterion”.

在許多方面,這是關鍵的一步。 有關網絡的任何後續分析和推論都取決于能夠證明“連結标準”的合理性。

For example, in social network analysis, you might link people based upon whether they follow one another on social media. In molecular biology, you might link genes based upon their co-expression.

例如,在社交網絡分析中 ,您可以根據人們是否在社交媒體上彼此關注來連結人們。 在分子生物學中,您可能基于基因的共表達來連結基因。

Often, the method used to link nodes will allow for weights to be assigned to the edges, giving a measure of “strength”.

通常,用于連結節點的方法将允許将權重配置設定給邊緣,進而給出“強度”的度量。

For instance, in the context of online retail, you could link products based upon how often they are purchased together. Products that are frequently bought together would be linked by a higher weighted edge than products which are only sometimes bought together. Products that are bought together no more often than would be expected by chance wouldn’t be linked at all.

例如,在線上零售的情況下,您可以根據産品購買的頻率來連結産品。 與僅有時一起購買的産品相比,經常一起購買的産品将具有更高的權重邊連結。 在一起購買的産品的頻率不會比偶然預期的要高,根本不會連結在一起。

As you might imagine, the methods for linking nodes to one another can be as sophisticated as you like.

就像您想象的那樣,将節點互相連結的方法可以随您喜歡而複雜。

However, for this tutorial we’ll be using a simpler means of connecting programming languages. We’re gonna rely on the accuracy of Wikipedia.

但是,對于本教程,我們将使用一種更簡單的方法來連接配接程式設計語言。 我們将依靠維基百科的準确性。

For our purposes, this should be fine. Wikipedia’s success is testament that it must be doing something right. The open-source, collaborative method by which articles are written should ensure some degree of objectivity.

就我們的目的而言,這應該很好。 維基百科的成功證明了它一定在做正确的事。 撰寫文章的開源協作方法應確定一定程度的客觀性。

Also, its relatively consistent page structure makes it a convenient playground for trying out web-scraping techniques.

同樣,其相對一緻的頁面結構使其成為嘗試Web爬網技術的便捷場所。

Another bonus is the extensive, well-documented Wikipedia API, which makes information retrieval easier still. Let’s get started.

另一個好處是廣泛的, 有據可查的Wikipedia API ,它使資訊檢索更加容易。 讓我們開始吧。

第1步-安裝Gephi (Step 1 — Installing Gephi)

Gephi is available on Linux, Mac and Windows. You can download it here.

Gephi在Linux,Mac和Windows上可用。 您可以在此處下載下傳。

For this project, I was using Lubuntu. If you’re on Ubuntu/Debian, then you can follow the steps below to get Gephi up and running. Otherwise, the installation process will likely be much the same as whatever you’re familiar with.

對于這個項目,我正在使用Lubuntu。 如果您使用的是Ubuntu / Debian,則可以按照以下步驟啟動和運作Gephi。 否則,安裝過程可能與您熟悉的過程幾乎相同。

Download the latest version (at the time of writing this was v.0.9.1) of Gephi for your system. When it’s ready, you’ll need to extract the files.

為您的系統下載下傳Gephi的最新版本(在撰寫本文時為v.0.9.1)。 準備就緒後,您需要解壓縮檔案。

cd Downloads
tar -xvzf gephi-0.9.1-linux.tar.gz
cd gephi-0.9.1/bin./gephi
           

You may need to check your version of the Java JRE. Gephi requires a recent version. On my relatively fresh install of Lubuntu, I simply installed the default-jre, and everything worked from there.

您可能需要檢查Java JRE的版本。 Gephi需要最新版本。 在我相對較新的Lubuntu安裝中,我隻安裝了default-jre,一切都從那裡開始。

apt install default-jre
./gephi
           

There’s one more step before you’re ready to get underway. In order to export the graph to the Web, you can use the Sigma.js plugin for Gephi.

在您準備好開始之前,還有另外一步。 為了将圖形導出到Web,可以将Sigma.js插件用于Gephi。

From Gephi’s menu bar, choose the “Tools” option, and select “Plugins”.

從Gephi的菜單欄中,選擇“工具”選項,然後選擇“插件”。

Click on the “Available Plugins” tab and select “SigmaExporter” (I also installed JSON Exporter, because it’s another useful plugin to have around).

單擊“可用插件”頁籤,然後選擇“ SigmaExporter”(我還安裝了JSON Exporter,因為它是另一個有用的插件)。

Hit the “Install” button and you’ll be walked through the process. You’ll need to restart Gephi once you’re done.

點選“安裝”按鈕,您将逐漸完成該過程。 完成後,您需要重新啟動Gephi。

第2步-編寫Python腳本 (Step 2 — Writing the Python script)

This tutorial will use Python 3.x, plus a few modules to make life easier. Using the pip module installer, run the following command:

本教程将使用Python 3.x,以及一些使生活更輕松的子產品。 使用pip子產品安裝程式,運作以下指令:

pip3 install wikipedia
           

Now, in a new directory, create a file called something like

script.py

, and open it up in your favourite code editor/IDE. Below is an outline of the main logic:

現在,在新目錄中,建立一個名為

script.py

類的檔案,然後在您喜歡的代碼編輯器/ IDE中将其打開。 以下是主要邏輯的概述:

  1. First, you’ll need a list of programming languages to include.

    首先,您需要包含一系列程式設計語言 。

  2. Next, go through that list and retrieve the HTML of the relevant Wikipedia article.

    接下來,浏覽該清單并檢索相關Wikipedia文章HTML。

  3. From this, extract a list of programming languages that each language has influenced. This will be a rough-and-ready linkage criterion.

    從中,提取每種語言影響的程式設計語言清單。 這将是一個粗略的關聯标準。

  4. While you’re at it, it’d be nice to grab some metadata about each language.

    當您使用它時,最好能擷取有關每種語言的一些中繼資料。

  5. Finally, you’ll want to write all the data you’ve collected to a .csv file

    最後,您需要将收集的所有資料寫入.csv檔案

The full script can be found in this gist.

完整的腳本可以在本要點中找到。

導入一些子產品 (Import some modules)

In

script.py

, start by importing a few modules which will make things easier:

script.py

,首先導入一些子產品,這将使事情變得更容易:

import csv
import wikipedia
import urllib.request
from bs4 import BeautifulSoup as BS
import re
           

OK — begin by making a list of nodes to include. This is where the Wikipedia module comes in handy. It makes accessing the Wikipedia API super-easy.

确定-首先列出要包括的節點。 這是Wikipedia子產品派上用場的地方。 它使通路Wikipedia API變得非常容易。

Add the following code:

添加以下代碼:

pageTitle = "List of programming languages"
nodes = list(wikipedia.page(pageTitle).links)
print(nodes)
           

If you save and run this script, you’ll see it prints out all the links from the “List of programming languages” Wikipedia article. Nice!

如果儲存并運作此腳本,您将看到它列印出Wikipedia文章“程式設計語言清單”中的所有連結。 真好!

However, it’s always sensible to manually inspect any automatically collected data. A quick glance will reveal that, as well as many actual programming languages, the script has also picked up a few extra links.

但是,手動檢查任何自動收集的資料總是明智的。 快速浏覽一下,就會發現該腳本以及許多實際的程式設計語言,還增加了一些額外的連結。

For example, you might see “List of markup languages”, “Comparison of programming languages” and others in there.

例如,您可能會在其中看到“ 标記語言清單 ”,“ 程式設計語言比較 ”以及其他内容。

Although Gephi lets you remove nodes you’d rather not include, it wouldn’t hurt to “clean” the data before proceeding. If anything, this will save time later on.

盡管Gephi允許您删除您不希望包含的節點,但是在繼續操作之前“清理”資料不會有什麼壞處。 如果有的話,這将在以後節省時間。

removeList = [
    "List of",
    "Lists of",
    "Timeline",
    "Comparison of",
    "History of",
    "Esoteric programming language"
    ]

nodes = [i for i in nodes if not any(r in i for r in removeList)]
           

These lines define a list of substrings to be removed from the data. The script then goes through the data, removing any elements that contain any of the unwanted substrings.

這些行定義了要從資料中删除的子字元串清單。 然後,腳本周遊資料,删除包含任何不需要的子字元串的所有元素。

In Python, this requires just one line of code!

在Python中,這隻需要一行代碼!

一些輔助功能 (Some helper functions)

Now you can start scraping Wikipedia to build up an edge list (and collect any metadata). To make this easier, first define a few functions.

現在,您可以開始抓取Wikipedia來建立邊緣清單(并收集所有中繼資料)。 為了簡化操作,首先定義一些功能。

抓HTML (Grabbing HTML)

The first function uses the BeautifulSoup module to get hold of the HTML for each language’s Wikipedia page.

第一個功能使用BeautifulSoup子產品擷取每種語言的Wikipedia頁面HTML。

base = "https://en.wikipedia.org/wiki/"

def getSoup(n):
    try:
        with urllib.request.urlopen(base+n) as response:
            soup = BS(response.read(),'html.parser')
            table = soup.find_all("table",class_="infobox vevent")[0]                return table
     except:
         pass
           

This function uses the urllib.request module to get hold of the HTML for the page at

“https://en.wikipedia.org/wiki/” + “programming language”

.

此函數使用urllib.request子產品來擷取

“https://en.wikipedia.org/wiki/” + “programming language”

頁面HTML。

This is then passed to BeautifulSoup, which reads and parses the HTML into an object we can use to search for information.

然後将其傳遞給BeautifulSoup,BeautifulSoup讀取HTML并将其解析為一個對象,我們可以使用該對象來搜尋資訊。

Next, use the

find_all()

method to extract the HTML element you’re interested in.

接下來,使用

find_all()

方法提取您感興趣HTML元素。

Here, this will be the summary table at the top of each programming language article. How can these be identified?

在這裡,這将是每個程式設計語言文章頂部的摘要表。 如何識别這些?

The easiest way is to visit one of the programming language pages. Here, you can simply use the browser’s Developer Tools to inspect the elements of interest.

最簡單的方法是通路其中一種程式設計語言頁面。 在這裡,您隻需使用浏覽器的開發人員工具來檢查感興趣的元素。

The summary table has the HTML tag

<tab

le> and the CSS cl

asses "in

fobox

" and "v

event", so you can use these to identify the table in the HTML.

摘要表具有HTML标記

<tab

le>和

asses "in

fobox中

" and "v

事件”CSS分類,是以您可以使用它們來識别HTML中的表。

Specify this with the arguments:

使用參數指定它:

  • "table"

    and

    "table"

  • class_="infobox vevent"

    class_="infobox vevent"

find_all()

returns a list of all elements that match the criteria. In order to actually specify the element you’re interested in, add the index

[0]

. If the function is successful, it returns the

table

object. Otherwise, it returns

None

.

find_all()

傳回符合條件的所有元素的清單。 為了實際指定您感興趣的元素,請添加索引

[0]

。 如果函數成功,則傳回

table

對象。 否則,它傳回

None

With any automated data collection procedure, it’s always important to handle exceptions thoroughly. If not, then in the best case scenario the script crashes and you’ll need to start over.

對于任何自動資料收集過程,徹底處理異常始終很重要。 如果不是,那麼在最佳情況下,腳本會崩潰,您需要重新開始。

In the worst case, you’ll end up with a data set riddled with inconsistencies and errors. This will make it a nightmare to work with down the line.

在最壞的情況下,您将得到一個充滿不一緻和錯誤的資料集。 這将使下線工作成為一場噩夢。

檢索中繼資料 (Retrieve metadata)

The next function uses the

table

object to look for some metadata. Here, it searches the table for the year the language first appeared.

下一個函數使用

table

對象查找一些中繼資料。 在這裡,它會在表格中搜尋該語言首次出現的年份。

def getYear(t):
    try:
        t = t.get_text()
        year = t[t.find("appear"):t.find("appear")+30]
        year = re.match(r'.*([1-3][0-9]{3})',year).group(1)
        return int(year)
    except:
        return "Could not determine"
           

This short function takes the

table

object as its argument, and uses BeautifulSoup’s

get_text()

function to produce a string.

這個簡短的函數将

table

對象作為其參數,并使用BeautifulSoup的

get_text()

函數生成一個字元串。

The next step is to create a substring called

year

. This takes the 30 characters after the first appearance of the word

"appear"

. This string should contain the year the language first appeared.

下一步是建立一個名為

year

的子字元串。 單詞

"appear"

首次出現後需要30個字元。 該字元串應包含該語言首次出現的年份。

In order to extract just the year, use a regular expression (courtesy of the

re

module) to match any characters that begin with a digit between 1 and 3, and are followed by three digits.

為了隻提取年份,請使用正規表達式 (由

re

子產品提供)比對任何以1到3之間的數字開頭,後跟三個數字的字元。

re.match(r'.*([1-3][0-9]{3})',year)
           

If this is successful, the function returns

year

as an integer. Otherwise, it returns a sad-looking “Could not determine”. You might wish to scrape further metadata — such as paradigm, designer or typing discipline.

如果成功,函數将以整數形式傳回

year

。 否則,它傳回一個令人悲傷的“無法确定”。 您可能希望進一步抓取中繼資料,例如範例,設計師或打字學科。

收集連結 (Collecting links)

One more function for you — this time, you’ll feed in the

table

object for a given language, and hopefully receive out a list of other programming languages.

為您提供的另一個功能-這次,您将輸入給定語言的

table

對象,并希望收到其他程式設計語言的清單。

def getLinks(t):
    try:
        table_rows = t.find_all("tr")
        for i in range(0,len(table_rows)-1):
            try:
                if table_rows[i].get_text() == "\nInfluenced\n":
                    out = []
                    for j in table_rows[i+1].find_all("a"):
                        try:
                            out.append(j['title'])
                        except:
                            continue
                    return out
            except:
                continue
        return
    except:
        return
           

Woah, look at all that nesting… What is actually going on here then?

哇,看一下所有的嵌套...那麼這裡到底發生了什麼?

This function makes use of the fact that the

table

objects have a consistent structure. The information in the table is stored in rows (the relevant HTML tag is

<

tr> ). One of these rows will contain the` text

"\nInfluenced\n"

. The first part of the function finds which row this is.

該功能利用了

table

對象具有一緻結構的事實。 表中的資訊存儲在行中(相關HTML标記為

<

tr>)。 這些行之一将包含文本

"\nInfluenced\n"

。 函數的第一部分查找這是哪一行。

Once this row has been found, you can then be pretty sure the next row contains links to each of the programming languages influenced by the current one. Find these links using

find_all("a")

— where the argument

"a"

corresponds to the HTML tag

<a>

.

找到該行後,您就可以确定下一個 該行包含到受目前語言影響的每種程式設計語言的連結。 使用

find_all("a")

查找這些連結-其中參數

"a"

對應于HTML标簽

<a>

For each link

j

, append its

["title"]

attribute to a list called

out

. The reason to be interested in the

["title"]

attribute is because this will match exactly the language’s name as stored in

nodes

.

對于每個連結

j

,将其

["title"]

屬性附加到一個名為

out

的清單

out

。 對

["title"]

屬性感興趣的原因是,這将與存儲在

nodes

的語言名稱完全比對。

For example, Java is stored in

nodes

as

"Java (programming language)"

, so you need to use this exact name throughout the data set.

例如,Java作為

"Java (programming language)"

存儲在

nodes

中,是以您需要在整個資料集中使用這個确切的名稱。

If successful,

getLinks()

returns a list of programming languages. The rest of the function deals with exception handling, in case something should go wrong at any stage.

如果成功,則

getLinks()

傳回程式設計語言清單。 該函數的其餘部分處理異常處理,以防萬一在任何階段出現問題。

收集資料 (Collecting the data)

At last, you’re almost ready to sit back and let the script do its thing. It will collect the data and store it in two list objects.

最後,您幾乎可以坐下來讓腳本執行其任務了。 它将收集資料并将其存儲在兩個清單對象中。

edgeList = [["Source,Target"]]
meta = [["Id","Year"]]
           

Now write a loop that will apply the functions defined earlier to every item in

nodes

, and store the outputs in

edgeList

and

meta

.

現在編寫一個循環,将較早定義的功能應用于

nodes

每個項目,并将輸出存儲在

edgeList

meta

for n in nodes:
    try:
        temp = getSoup(n)
    except:
        continue
    try:
        influenced = getLinks(temp)
        for link in influenced:
            if link in nodes:
                edgeList.append([n+","+link])
                print([n+","+link])
    except:
        continue
    year = getYear(temp)
    meta.append([n,year])
           

This function takes each language in

nodes

and attempts to retrieve the summary table from its Wikipedia page.

此功能将

nodes

每種語言都使用,并嘗試從其Wikipedia頁面檢索摘要表。

Then, it retrieves all the languages the table lists as having been influenced by the language in question.

然後,它檢索表中列出的受有關語言影響的所有語言。

For each language that also appears in the

nodes

list, append an element to

edgeList

in the form of

["source,target"]

. In this way, you’ll build up an edge list to feed into Gephi.

對于也出現在

nodes

清單中的每種語言,以

["source,target"]

的形式将元素添加到

edgeList

。 這樣,您将建立一個邊緣清單以饋入Gephi。

For debugging purposes, print each element added to

edgeList

— just to be sure everything’s working as it should. If you were being extra thorough, you could add print statements to the

except

clauses, too.

出于調試目的,請列印添加到

edgeList

每個元素,以確定一切正常進行。 如果您要更徹底,也可以将print語句添加到

except

子句中。

Next, get the language’s name and year, and append these to the

meta

list.

接下來,擷取語言的名稱和年份,并将其附加到

meta

清單中。

寫入CSV (Writing to CSV)

Once the loop has run, the final step is to write the contents of

edgeList

and

meta

to comma separated value (CSV) files. This is easily done with the

csv

module imported earlier.

循環運作後,最後一步是将

edgeList

meta

的内容寫入逗号分隔值(CSV)檔案。 使用先前導入的

csv

子產品可以輕松完成此操作。

with open("edge_list.csv","w") as f: 
    wr = csv.writer(f)
    for e in edgeList:
        wr.writerow(e)

with open("metadata.csv","w") as f2:
    wr = csv.writer(f2)
    for m in meta:
        wr.writerow(m)
           

Done! Save the script, and from the terminal run:

做完了! 儲存腳本,然後從終端運作:

$ python3 script.py

$ python3 script.py

You should see the script printing out each source-target pair as it builds up the edge list. Make sure your internet connection is steady, and sit back while the script does its magic.

您應該看到腳本在建立邊緣清單時将每個源-目标對列印出來。 確定您的Internet連接配接穩定,并在腳本發揮作用時坐下來。

步驟3 —使用Gephi建構圖 (Step 3 — Graph building with Gephi)

Hopefully you got Gephi installed and running earlier. Now you can create a new project and use the data you gathered to build a directed graph. This will show how different programming languages have influenced one another!

希望您早已安裝并運作了Gephi。 現在,您可以建立一個新項目,并使用收集的資料來建構有向圖。 這将顯示不同的程式設計語言如何互相影響!

Start by creating a new project in Gephi, and switch to the “Data Laboratory” view. This provides a spreadsheet-like interface for handling data in Gephi. The first thing to do is import the edge list.

首先在Gephi中建立一個新項目,然後切換到“資料實驗室”視圖。 這提供了類似于電子表格的界面,用于在Gephi中處理資料。 首先要做的是導入邊緣清單。

  • Click “Import spreadsheet”.

    點選“導入電子表格”。

  • Choose the

    edge_list.csv

    file generated by the Python script. Ensure that Gephi knows to use the commas as the separator.

    選擇Python腳本生成的

    edge_list.csv

    檔案。 確定Gephi知道使用逗号作為分隔符。
  • Choose “Edge List” from the List type.

    從清單類型中選擇“邊緣清單”。

  • Click “Next” and check that you are importing both Source and Target columns as strings.

    單擊“下一步”,并檢查您是否正在将源列和目标列都導入為字元串。

This should update the Data Lab with a list of nodes. Now, import the

metadata.csv

file. This time, make sure to choose “Nodes list” from the List type.

這應該使用節點清單更新資料實驗室。 現在,導入

metadata.csv

檔案。 這次,請確定從“清單”類型中選擇“節點清單”。

Switch over to the “Preview” tab, and see how the network looks.

切換到“預覽”頁籤,然後檢視網絡外觀。

Ah… It’s just a little bit… monochrome. And messy. Like a plate of spaghetti. Let’s fix this.

啊……隻是一點……單色。 和淩亂。 像一盤意大利面。 讓我們解決這個問題。

使它漂亮 (Making it pretty)

There are all sorts of ways you can work on the presentation, and here’s where a little bit of creative freedom comes in. With network visualisations, there are essentially three things to take into consideration:

您可以采用多種方式來進行示範,這就是其中的一點創作自由。通過網絡可視化,本質上要考慮三件事:

  1. Positioning There are several algorithms which can generate layout patterns for a network. A popular choice is the Fruchterman-Reingold algorithm, which is available in Gephi.

    定位有幾種算法可以生成網絡的布局模式。 流行的選擇是Gephi中可用的Fruchterman-Reingold算法 。

  2. Sizing The size of nodes in a graph can be used to represent some interesting property. Often, this is a centrality measure. There are many ways of measuring centrality, but they all reflect the “importance” of a given node, in terms of how well-connected it is to the rest of the network.

    大小調整圖中節點的大小可用于表示一些有趣的屬性。 通常,這是一項中心性措施 。 有許多方法可以衡量中心性 ,但它們都反映了給定節點與網絡其餘部分的連接配接程度,這一點“很重要”。

  3. Coloring It is also possible to use color to show some property of a node. Often, color is used to indicate community structure. This is broadly defined as a “group of nodes which are more connected with each other than with the rest of the graph”. In a social network, this can reveal friendship, family or professional groups. There are several algorithms which can detect community structure. Gephi comes with the Louvain method built-in.

    着色也可以使用顔色顯示節點的某些屬性。 通常,顔色用于訓示社群結構 。 這被廣泛定義為“一組節點,彼此之間的聯系比圖的其餘部分更多”。 在社交網絡中,這可以顯示友誼,家庭或專業團體。 有幾種算法可以檢測社群結構 。 Gephi内置了Louvain方法 。

To make these changes, you will need to calculate some statistics. Switch to the “Overview” window. Here you will see a panel on the right. It should contain a “Statistics” tab. Open this, and you will see a range of options.

要進行這些更改,您将需要計算一些統計資訊。 切換到“概述”視窗。 在這裡,您會在右側看到一個面闆。 它應該包含一個“統計”标簽。 打開它,您将看到一系列選項。

Gephi comes with many inbuilt statistical capabilities. For each of them, clicking “Run” will generate a report that will reveal insights about the network.

Gephi具有許多内置的統計功能。 對于每個使用者,單擊“運作”将生成一個報告,該報告将揭示有關網絡的見解。

Some useful ones to know include:

一些有用的知識包括:

  • Average degree The average language is connected to about four others. The report also shows a degree distribution graph. This reveals that most languages have very few connections, while a small proportion have many. This suggests that this is a scale-free network. Much research has been done on scale-free networks, and the processes that generate them.

    平均程度平均語言與大約四種其他語言相關。 該報告還顯示了學位分布圖。 這表明大多數語言的聯系很少,而一小部分則有很多。 這表明這是一個無規模 網絡 。 關于無标度網絡及其生成過程已經進行了許多研究。

  • Diameter This network has a diameter of 12 — meaning this is the “widest” number of connections between any two languages. The average path length is just under four. This means that, on average, any two languages are separated by four edges. These figures give a measure of the “size” of the network.

    直徑此網絡的直徑為12,這意味着這是任何兩種語言之間“最大”的連接配接數。 平均路徑長度不到4。 這意味着,平均而言,任何兩種語言都由四個邊分開。 這些數字可以衡量網絡的“規模”。

  • Modularity This is a score that shows how “compartmentalized” the network is. Here, the modularity score is about 0.53. This is relatively high, suggesting there are distinct modules within this network. Again, this indicates something interesting about the underlying system. Languages tend to fall into distinct “influence groups”.

    子產品化這是一個分數,顯示了網絡的“分隔”程度。 在這裡,子產品化得分約為0.53。 這相對較高,表明該網絡中存在不同的子產品。 同樣,這表明底層系統有一些有趣之處。 語言傾向于分為不同的“影響力群體”。

Anyhow, to modify the appearance of the network, head over to the left panel.

無論如何,要修改網絡的外觀,請轉到左側面闆。

In the “Layout” tab, you can select which layout algorithm to use. Hit “Run” and watch the graph shift about in real-time! See which layout algorithm you think works best.

在“布局”标簽中,您可以選擇要使用的布局算法。 點選“運作”,實時觀看圖形變化! 看看您認為哪種布局算法最有效。

Above the Layout tab is the “Appearance” tab. Here, you can play with different settings for the node and edge colors, sizes and labels. These can be configured based upon attributes (including the stats you get Gephi to calculate).

“布局”标簽上方是“外觀”标簽。 在這裡,您可以對節點和邊緣顔色,大小和标簽使用不同的設定。 可以根據屬性(包括讓Gephi計算的統計資訊)進行配置。

As a suggestion, you could:

作為建議,您可以:

  • Color the nodes by their Modularity attribute. This colors them according to their community membership.

    通過其子產品屬性為節點着色。 這根據他們的社群成員身份為其着色。

  • Size the nodes by their Degree. Better connected nodes will appear larger than less connected ones.

    根據節點的大小調整節點的大小。 連通性更好的節點将看起來比連通性較小的節點更大。

However, you should experiment and come up with a layout you like best.

但是,您應該進行試驗并提出最喜歡的布局。

Once you’re happy with the appearance of your graph, it is time to move on to the final step — exporting to Web!

對圖形的外觀感到滿意之後,就該進入最後一步了-導出到Web!

第4步-Sigma.js (Step 4 — Sigma.js)

Already you have built a network visualisation that can be explored in Gephi. You could choose to take a screenshot, or save the graph in SVG, PDF or PNG format.

您已經建立了可以在Gephi中進行探索的網絡可視化。 您可以選擇截圖,也可以将圖形儲存為SVG,PDF或PNG格式。

However, if you installed the Sigma.js plugin earlier, then why not export the graph to HTML? This will create an interactive visualisation that you can host online, or upload to GitHub and share with others.

但是,如果您較早安裝了Sigma.js插件,那麼為什麼不将圖形導出到HTML? 這将建立一個互動式可視化檔案,您可以線上托管它,或者将其上傳到GitHub并與他人共享。

To do this, select “Export > Sigma.js template…” from Gephi’s menu bar.

為此,請從Gephi的菜單欄中選擇“導出> Sigma.js模闆…”。

Fill in the details as required. Make sure to choose which directory you export the project to. You can change the title, legend, description, hover behavior and many other details. When you’re ready, click “OK”.

根據需要填寫詳細資訊。 確定選擇要将項目導出到的目錄。 您可以更改标題,圖例,描述,懸停行為和許多其他詳細資訊。 準備好後,單擊“确定”。

Now, if you navigate to the directory you exported the project to, you will see a folder containing all the files generated by Sigma.js.

現在,如果您導航到将項目導出到的目錄,您将看到一個包含Sigma.js生成的所有檔案的檔案夾。

Open up

index.html

in your favorite browser. Ta-da! There’s your network! If you know a little CSS and JavaScript, you can dive into the various generated files to tweak the output as you wish.

在您喜歡的浏覽器中打開

index.html

。 - 有您的網絡! 如果您了解一點CSS和JavaScript,則可以深入研究各種生成的檔案,以根據需要調整輸出。

And that concludes this tutorial!

到此結束本教程!

摘要 (Summary)

  • Many systems can be modelled and visualised as networks. Graph theory is a branch of math that provides tools to help understand network structures and properties.

    許多系統可以模組化并可視化為網絡。 圖論是數學的一個分支,提供了有助于了解網絡結構和屬性的工具。

  • You used Python to scrape data from Wikipedia to build a programming languages influence graph. The linkage criterion was whether a given language was listed as an influence on another’s design.

    您使用Python從Wikipedia抓取資料來建構程式設計語言影響圖。 連結标準是是否将一種給定語言列為對另一種設計的影響。

  • Gephi and Sigma.js are open-source tools that allow you to analyze and visualize networks. They allow you to export the network in image, PDF or Web formats.

    Gephi和Sigma.js是開放源代碼工具,可讓您分析和可視化網絡。 它們允許您以圖像,PDF或Web格式導出網絡。

Thanks for reading — I look forward to any comments or questions you might have! For a fantastic resource to learn more about graph theory, see Albert-László Barabási’s interactive online book.

感謝您的閱讀-我期待您的任何評論或疑問! 要擷取更多關于圖論的豐富資訊,請參見Albert-LászlóBarabási的線上互動圖書 。

The full code for this tutorial can be found here.

可以在此處找到本教程的完整代碼。

翻譯自: https://www.freecodecamp.org/news/how-to-visualize-the-programming-language-influence-graph-7f1b765b44d1/

可視化程式設計語言