背景
1. 最近需要通過github rest api來擷取repository的檔案清單、檔案内容、檔案的送出資訊
2. 通過調研官方文檔: https://docs.github.com/en/rest?apiVersion=2022-11-28
3. 使用以下幾個api即可完成相應功能:
a. get_a_branch : 擷取指定分支資訊
b. get_a_tree : 擷取repository的tree結構下檔案清單資訊
c. get_a_blob : 擷取指定檔案的内容
d. list_commits : 擷取指定檔案的送出資訊
準備工作
- 注冊github賬号
- 在github上建立repository
- 在github上生成rest api使用的personal access token
接口調用
get_a_branch
- repository資訊定義
- http_get方法封裝
import requests
import json
import base64
# config in .ipython/profile_default/startup/00-github_api_config.py
# owner = '<owner>'
# repo = '<repo>'
# github_personal_access_token = '<github personal access token>'
owner = owner
repo = repo
github_personal_access_token = github_personal_access_token
domain = 'https://api.github.com'
def http_get(uri=None, **kargs):
url = f"{domain}{uri}"
print(f'get {url} ...')
res = requests.get(url=url, headers={
'Accept': 'application/vnd.github+json',
'Authorization': f'Bearer {github_personal_access_token}',
'X-GitHub-Api-Version': '2022-11-28'
}, params=kargs)
if res.status_code != 200:
print(f"{res.status_code=}, {res.text=}")
raise Exception(f"{res.status_code=}, {res.text=}")
result = json.loads(res.text)
# print(res.text)
# print(f"{type(result)=}")
return result
- get_a_branch方法定義
- 擷取'main'分支的資訊,重點關注傳回的tree_sha屬性,供下一步get_a_tree使用
# curl -L \
# -H "Accept: application/vnd.github+json" \
# -H "Authorization: Bearer <YOUR-TOKEN>"\
# -H "X-GitHub-Api-Version: 2022-11-28" \
# https://api.github.com/repos/OWNER/REPO/branches/BRANCH
def get_a_branch(branch):
uri = f"/repos/{owner}/{repo}/branches/{branch}"
return http_get(uri=uri)
result = get_a_branch('main')
tree_sha = result.get('commit').get('commit').get('tree').get('sha')
print(f"{tree_sha=}")
get_a_tree
- 遞歸擷取tree_sha對應的tree檔案清單
- 重點關注傳回的path與sha屬性,供下一步git_a_blob使用
# curl -L \
# -H "Accept: application/vnd.github+json" \
# -H "Authorization: Bearer <YOUR-TOKEN>"\
# -H "X-GitHub-Api-Version: 2022-11-28" \
# https://api.github.com/repos/OWNER/REPO/git/trees/TREE_SHA
def get_a_tree(tree_sha):
uri = f"/repos/{owner}/{repo}/git/trees/{tree_sha}?recursive=true"
return http_get(uri)
result = get_a_tree(tree_sha)
file_list = []
for e in result.get('tree'):
path = e.get('path')
type = e.get('type')
sha = e.get('sha')
# print('\n---------------------------------')
# print(f"{path=}, {type=}, {sha=}")
if type == 'blob':
file_list.append({
'path': path,
'sha': sha
})
print(file_list)
get_a_blob
- 擷取指定檔案的檔案内容
- 重點關注content屬性,需要進行base64解碼,即可得到檔案内容
# curl -L \
# -H "Accept: application/vnd.github+json" \
# -H "Authorization: Bearer <YOUR-TOKEN>"\
# -H "X-GitHub-Api-Version: 2022-11-28" \
# https://api.github.com/repos/OWNER/REPO/git/blobs/FILE_SHA
def get_a_blob(file_sha):
uri = f'/repos/{owner}/{repo}/git/blobs/{file_sha}'
return http_get(uri=uri)
for file in file_list:
result = get_a_blob(file.get('sha'))
content = result.get('content')
content = base64.b64decode(content).decode(encoding='utf-8')
file.setdefault('content', content)
print(file_list)
list_commits
# curl -L \
# -H "Accept: application/vnd.github+json" \
# -H "Authorization: Bearer <YOUR-TOKEN>"\
# -H "X-GitHub-Api-Version: 2022-11-28" \
# https://api.github.com/repos/OWNER/REPO/commits
def list_commits(path=None):
uri = f'/repos/{owner}/{repo}/commits'
return http_get(uri=uri, path=path)
for file in file_list:
path = file.get('path')
result = list_commits(path)
result = map(lambda item: {
'message': item.get('commit').get('message'),
'committer': item.get('commit').get('committer')
}, result)
# print(list(result))
file.setdefault('commit', list(result))
print(json.dumps(file_list, indent=4))
小結
- 通過以上幾步,就可以以rest api的方式擷取到 repository下對應分支所有檔案清單、檔案内容、送出資訊
- 對于需要進行源碼資料采集的在此基礎加工完成
- 配上jupyterlab程式設計更絲滑~~