laitimes

[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

author:Pemba duck

MovingPandas is an open-source geospatio-temporal data processing library based on Python and GeoPandas, which is used to process the trajectory data of moving objects. For more information about the use of MovingPandas, see the article: Getting Started with MovingPandas, which mainly introduces three drawing examples of MovingPandas. The official address of the MovingPandas repository is: movingpandas. The address of the official sample code repository for MovingPandas is: movingpandas-examples. All experimental data in this article comes from: movingpandas-examples-data.

All the code in this article can be found at: Python-Study-Notes

The author recommends installing MovingPandas in Python 3.8 or above, and recommends using conda for installation. MovingPandas can be installed using the following command:

conda install -c conda-forge movingpandas           

Due to the complexity of MovingPandas' dependency environment, pip is not recommended for installation. If you insist on using pip for installation, you can install MovingPandas by following the following command:

pip install movingpandas
# 本文必安装第三方库
pip install contextily
pip install seaborn
# 以下第三方库可选
pip install hvplot
pip install cartopy
pip install geoviews           

The following code shows the version of MovingPandas, which is Python 3.10.

# jupyter notebook环境去除warning
import warnings
warnings.filterwarnings("ignore")

# 查看movingpandas版本及依赖库版本信息
import movingpandas as mpd
mpd.show_versions()           
MovingPandas 0.16.1

SYSTEM INFO
-----------
python     : 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]
executable : /opt/conda/envs/python35-paddle120-env/bin/python
machine    : Linux-5.4.0-109-generic-x86_64-with-glibc2.31

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : None
GEOS lib   : None
GDAL       : 3.6.4
GDAL data dir: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/fiona/gdal_data
PROJ       : 9.2.1
PROJ data dir: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pyproj/proj_dir/share/proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.13.2
pandas     : 2.0.2
fiona      : 1.9.4.post1
numpy      : 1.24.3
shapely    : 2.0.1
rtree      : 1.0.1
pyproj     : 3.6.0
matplotlib : 3.7.1
mapclassify: None
geopy      : 2.3.0
holoviews  : None
hvplot     : None
geoviews   : None
stonesoup  : None           

The following code is used to load the third-party libraries required for the drawing.

import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import shapely as shp
import matplotlib.pyplot as plt

from geopandas import GeoDataFrame, read_file
from shapely.geometry import Point, LineString, Polygon
from datetime import datetime, timedelta
from os.path import exists
from urllib.request import urlretrieve
import contextily as cx
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')

plot_defaults = {'linewidth':5, 'capstyle':'round', 'figsize':(9,3), 'legend':True}           

1 Example of ship data analysis

The Automatic Identification System (AIS) is an automatic communication system used between ships. It transmits information between ships through radio signals, allowing other nearby ships and shore monitoring stations to obtain information about the ship, such as ship name, flag country, ship type, ship location, heading, speed, etc. This tutorial uses AIS data published by the Danish Maritime Authority, and the extracted sample of AIS records covers vessel traffic data near Gothenburg on July 5, 2017. This chapter provides valuable insights into vessel traffic by analyzing trajectory data from AIS data.

Step1 Load the data

# 加载数据
df = read_file('data/ais.gpkg')
# 查看数据
# Timestamp: 时间戳,指示AIS数据记录的时间。它表示了记录被创建或接收的日期和时间。
# MMSI: 船舶识别码(Maritime Mobile Service Identity),是一个唯一的数字标识符。
# NavStatus: 导航状态,指示船舶当前的导航状态或活动状态。
# SOG: 对地航速(Speed Over Ground),指船舶相对于地面的速度。
# COG: 对地航向(Course Over Ground),表示船舶相对于地面的航向。
# Name: 船名,是船舶的名称或标识符。
# ShipType: 船舶类型,表示船舶的类别或类型。
# geometry: 几何信息,表示船舶的空间位置。
df.head()           
Timestamp MMSI NavStatus UNDERTOW COG Name ShipType geometry
05/07/2017 00:00:03 219632000 Under way using engine 0.0 270.4 In Undefined POINT (11.85958 57.68817)
1 05/07/2017 00:00:05 265650970 Under way using engine 0.0 0.5 In Undefined POINT (11.84175 57.66150)
2 05/07/2017 00:00:06 265503900 Under way using engine 0.0 0.0 In Undefined POINT (11.90650 57.69077)
3 05/07/2017 00:00:14 219632000 Under way using engine 0.0 188.4 In Undefined POINT (11.85958 57.68817)
4 05/07/2017 00:00:19 265519650 Under way using engine 0.0 357.2 In Undefined POINT (11.87192 57.68233)
# 查看数据的坐标系
df_crs = df.crs
df_crs           
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich           
# 可视化数据点
df.plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example
# 查看各船舶的对地航速
df['SOG'].hist(bins=100, figsize=(12,3))           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

For data analysis, first delete those samples with a ground speed of 0:

# 打印数据维度
print("原始数据样本数:{}".format(len(df)))
df = df[df.SOG>0]
print("删除对地航速为0后数据样本数:{}".format(len(df)))           
原始数据样本数:84702
删除对地航速为0后数据样本数:33593           

See what types of vessels there are:

df['ShipType'].value_counts().plot(kind='bar', figsize=(12,3))           
<Axes: xlabel='ShipType'>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example
df['t'] = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M:%S')
# 创建轨迹,轨迹数据点数最少为100
traj_collection = mpd.TrajectoryCollection(df, 'MMSI', t='t', min_length=100)
# 压缩轨迹
traj_collection = mpd.MinTimeDeltaGeneralizer(traj_collection).generalize(tolerance=timedelta(minutes=1))
traj_collection           
TrajectoryCollection with 77 trajectories           

Step 2 Data visualization

We can show the trajectory of different types of vessels. For details on how to use the background map added in it, see GeoPandas Overlay Background Map.

shiptype_to_color = {'Passenger': 'blue', 'HSC': 'green', 'Tanker': 'red', 'Cargo': 'orange'}
ax = traj_collection.plot(column='ShipType', column_to_color=shiptype_to_color, linewidth=1, capstyle='round')
# 添加背景地图,zoom越大越精细,这里使用自适应zoom。
cx.add_basemap(ax = ax , source=cx.providers.Stamen.TonerLite,crs=traj_collection.trajectories[0].crs, zoom='auto',attribution="")           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example
# 单独展示某种类型船只的航线轨迹
passenger = traj_collection.filter('ShipType', 'Passenger')
passenger.plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

You can also display a track individually.

my_traj = traj_collection.trajectories[0]
my_traj.df.head()           
Timestamp MMSI NavStatus UNDERTOW COG Name ShipType geometry
t
2017-07-05 17:32:18 05/07/2017 17:32:18 210035000 Under way using engine 9.8 52.8 NORDIC HAMBURG Cargo POINT (11.80462 57.67612)
2017-07-05 17:33:18 05/07/2017 17:33:18 210035000 Under way using engine 9.5 58.9 NORDIC HAMBURG Cargo POINT (11.80875 57.67773)
2017-07-05 17:34:18 05/07/2017 17:34:18 210035000 Under way using engine 9.3 70.5 NORDIC HAMBURG Cargo POINT (11.81311 57.67879)
2017-07-05 17:35:28 05/07/2017 17:35:28 210035000 Under way using engine 9.5 71.1 NORDIC HAMBURG Cargo POINT (11.81855 57.67968)
2017-07-05 17:36:28 05/07/2017 17:36:28 210035000 Under way using engine 9.4 71.3 NORDIC HAMBURG Cargo POINT (11.82334 57.68044)
# 绘制该轨迹
ax = my_traj.plot() 
ax.set_title('Trajectory {}'.format(my_traj.id))           
Text(0.5, 1.0, 'Trajectory 210035000')           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

We can also look at the trajectory of the region of interest.

area_of_interest = Polygon([(11.89935, 57.69270), (11.90161, 57.68902), (11.90334, 57.68967), (11.90104, 57.69354)
, (11.89935, 57.69270)])
# 返回与感兴趣区域相交的轨迹
intersecting = traj_collection.get_intersecting(area_of_interest)
intersecting           
TrajectoryCollection with 20 trajectories           
trips = mpd.ObservationGapSplitter(passenger).split(gap=timedelta(minutes=5))
trips.plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

STEP3 Region of Interest Analysis

We can identify an area of interest, such as the Maritime Administration. The time and type of vessels leaving or arriving in the area are then analyzed.

# 对于单个轨迹,如果其两个连续观测超过间隔gap,如5分钟,则认为该轨迹需要拆分为
trips = mpd.ObservationGapSplitter(traj_collection).split(gap=timedelta(minutes=5))
# 设置感兴趣区域,如海事局的坐标
area_of_interest = Polygon([(11.86815, 57.68273), (11.86992, 57.68047), (11.87419, 57.68140), (11.87288, 57.68348), (11.86815, 57.68273)])
# 获得轨迹起点在感兴趣区域的轨迹
departures = [traj for traj in trips if traj.get_start_location().intersects(area_of_interest) and traj.get_length() > 100]           
# 合并轨迹
tc = mpd.TrajectoryCollection(departures)
ax = tc.plot()
gpd.GeoSeries(area_of_interest).plot(ax=ax,color='red')           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

This way we can see when the ships depart from the area.

for traj in departures:
    print(f"{traj.df['ShipType'].iloc[0]} vessel '{traj.df['Name'].iloc[0]}' departed at {traj.get_start_time()}")           
Law enforcement vessel 'KBV 010' departed at 2017-07-05 10:36:03
Law enforcement vessel 'KBV 010' departed at 2017-07-05 14:33:02
Law enforcement vessel 'KBV 048' departed at 2017-07-05 10:20:44
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 01:21:07
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 04:15:04
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 06:58:56
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 08:45:08
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 12:02:18
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 13:34:42
Pilot vessel 'PILOT 794 SE' departed at 2017-07-05 22:32:47
Pilot vessel 'PILOT 218 SE' departed at 2017-07-05 09:27:24
Pilot vessel 'PILOT 218 SE' departed at 2017-07-05 16:10:29           

In addition, we can also get the time of the ships arriving in the area.

arrivals = [traj for traj in trips if traj.get_end_location().intersects(area_of_interest) and traj.get_length() > 100]
print(f"Found {len(arrivals)} arrivals")

for traj in arrivals:
    print(f"{traj.df['ShipType'].iloc[0]} vessel '{traj.df['Name'].iloc[0]}' arrived at {traj.get_end_time()}")           
Found 12 arrivals
Law enforcement vessel 'KBV 010' arrived at 2017-07-05 10:51:03
Law enforcement vessel 'KBV 048' arrived at 2017-07-05 10:26:44
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 01:36:56
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 04:45:36
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 08:16:46
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 08:54:34
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 13:06:37
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 16:44:06
Pilot vessel 'PILOT 794 SE' arrived at 2017-07-05 23:58:49
Pilot vessel 'PILOT 218 SE' arrived at 2017-07-05 10:07:23
Pilot vessel 'PILOT 218 SE' arrived at 2017-07-05 17:46:12
Tanker vessel 'DANA' arrived at 2017-07-05 08:35:42           

Step4 Cluster analysis at the beginning of the trajectory

# 加载sklearn
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint           

Firstly, the XY coordinates of the starting point of each trajectory are extracted.

origins = trips.get_start_locations()
origins['lat'] = origins.geometry.y
origins['lon'] = origins.geometry.x
matrix = origins[['lat','lon']].values
matrix.shape           
(302, 2)           

The individual starting points are then clustered.

# 经纬度距离换算:每弧度距离约为6371.0088公里
kms_per_radian = 6371.0088

# DBSCAN的邻域半径设置:根据实际情况,将0.1公里转换为对应的弧度值
epsilon = 0.1 / kms_per_radian

# 使用Ball Tree算法和haversine(球面距离)作为度量方式,对经纬度数据进行DBSCAN聚类
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(matrix))

# 获取DBSCAN聚类结果的标签
cluster_labels = db.labels_

# 计算聚类的数量
num_clusters = len(set(cluster_labels))

# 将聚类的数据划分到不同的簇中,保存为pandas的Series数据结构
clusters = pd.Series([matrix[cluster_labels == n] for n in range(num_clusters)])

# 输出聚类的数量
print(f'聚类的数量:{num_clusters}')           
聚类的数量:69           

Extract the center of the starting point clustering results.

# 将聚类标签添加到origins数据中
origins['cluster'] = cluster_labels

# 定义一个函数,用于获取簇中心点
def get_centermost_point(cluster):
    # 计算簇的质心
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    # 找到距离质心最近的点作为簇的中心点
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    # 返回中心点的坐标(经度,纬度)构成的Point对象
    return Point(tuple(centermost_point)[1], tuple(centermost_point)[0])

# 对每个簇应用get_centermost_point函数,得到各个簇的中心点集合
centermost_points = clusters.map(get_centermost_point)
centermost_points.shape           
(69,)           

Summarize the information for the clustered result center.

# 通过cluster列对origins数据框进行分组,得到一个按照簇(cluster)标签进行分组的DataFrame
origins_by_cluster = pd.DataFrame(origins).groupby(['cluster'])

# 创建一个新的DataFrame,用于存储每个簇的汇总信息
summary = origins_by_cluster['ShipType'].unique().to_frame(name='types')

# 在汇总DataFrame中添加n列,表示每个簇中数据点的数量
summary['n'] = origins_by_cluster.size()

# 在汇总DataFrame中添加sog列,表示每个簇中数据点的SOG(Speed Over Ground)平均值
summary['sog'] = origins_by_cluster['SOG'].mean()

# 在汇总DataFrame中添加geometry列,表示每个簇的中心点坐标
summary['geometry'] = centermost_points

# 从汇总DataFrame中移除数据点数量小于1的簇,并按照n列进行降序排序
summary = summary[summary['n']>1].sort_values(by='n', ascending=False)

# 显示汇总DataFrame中前几行的数据
summary.head()           
types n undertow geometry
cluster
5 [Tanker, Passenger, Undefined, Fishing, Cargo] 52 9.217308 POINT (11.911787 57.69663)
28 [Passenger, Undefined, HSC] 47 0.804255 POINT (11.84232 57.661593)
[Cargo, Tanker, Tug, Passenger] 28 11.946429 POINT (11.80495 57.676108)
27 [Passenger, Undefined, HSC] 24 15.9875 POINT (11.819332 57.648027)
11 [SAR, Passenger] 19 10.736842 POINT (11.804653 57.654408)

We can view information about individual vessels in a cluster.

cluster_of_interest_id = 28
ax = origins[origins['cluster']==cluster_of_interest_id].plot( column='ShipType')           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Finally, we can plot the trajectory information and the center of the clustering of the starting points of each trajectory in the graph.

fig, ax = plt.subplots(figsize=(9, 6))
ax = trips.plot(ax=ax, color='gray', line_width=1) 
# 添加簇中心
ax = GeoDataFrame(summary,crs=df_crs).plot(ax=ax, column='sog', cmap='RdYlGn', markersize=summary['n'].values*3,zorder=2)
# 添加背景地图,zoom越大越精细,这里使用自适应zoom。
cx.add_basemap(ax = ax , source=cx.providers.Stamen.TonerLite,crs=my_traj.df.crs, zoom='auto',attribution="")
cx.add_basemap(ax = ax , source=cx.providers.Stamen.TonerLabels,crs=my_traj.df.crs, zoom='auto',attribution="")
# 保存图片
fig.savefig("res.jpg",dpi=300)           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

2 Example of data analysis of horse collars

A horse collar is a device used to tame and control horses. It is usually a ring made of leather, nylon, or other sturdy material to wrap around the horse's neck. Horse collars are designed to make it easier for horses to be controlled by the rider and move according to the rider's instructions. This chapter analyzes horse behavior based on horse collar tracking data provided by the University of Copenhagen and a municipal center for technology and the environment in Denmark, and its analysis methods can be used for other similar datasets.

Step 1Data Import

Load the data, and remove the columns you don't need.

df = read_file('data/horse_collar.gpkg')
df['t'] = pd.to_datetime(df['timestamp'])
df = df.set_index('t').tz_localize(None)
# 去除不使用的列
df = df.drop(columns=['LMT_Date', 'LMT_Time',
       'Origin', 'SCTS_Date', 'SCTS_Time', 'Latitude [?]', 'Longitude [?]',
       'FixType', 'Main [V]', 'Beacon [V]', 'Sats', 'Sat',
       'C/N', 'Sat_1', 'C/N_1', 'Sat_2', 'C/N_2', 'Sat_3', 'C/N_3', 'Sat_4',
       'C/N_4', 'Sat_5', 'C/N_5', 'Sat_6', 'C/N_6', 'Sat_7', 'C/N_7', 'Sat_8',
       'C/N_8', 'Sat_9', 'C/N_9', 'Sat_10', 'C/N_10', 'Sat_11', 'C/N_11',
       'Easting', 'Northing',], axis=1)

# 查看数据
# No: 每条记录的序号。
# CollarID: 马颈圈的唯一标识符,每一个马颈圈代表一匹马
# UTC_Date: 记录数据的日期,使用世界协调时间(UTC)标准表示。
# UTC_Time: 记录数据的时间,使用世界协调时间(UTC)标准表示。
# lat: 数据点的纬度信息。
# long: 数据点的经度信息。
# Mort. Status: 
# Temp [?C]: 马颈圈数据点的温度。
# Activity: 马颈圈数据点的活动级别。
# timestamp: 记录数据的时间戳。
# geometry: 马颈圈数据点的GeoPandas位置几何信息。
df.head()           
No CollarID UTC_Date UTC_Time years long Mort. Status Temp [? C] Activity timestamp geometry
t
2018-11-14 12:01:08 299 30788 14-11-2018 12:01:08 54.743331 11.916987 In 22.0 In 2018-11-14 12:01:08 POINT (687757.574 6070134.334)
2018-11-14 12:15:09 300 30788 14-11-2018 12:15:09 54.676884 11.910876 In 22.0 In 2018-11-14 12:15:09 POINT (687671.088 6062727.428)
2018-11-14 12:30:08 301 30788 14-11-2018 12:30:08 54.627018 11.957852 In 21.0 In 2018-11-14 12:30:08 POINT (690932.614 6057307.716)
2018-11-14 13:00:33 302 30788 14-11-2018 13:00:33 54.625893 11.953686 In 17.0 In 2018-11-14 13:00:33 POINT (690669.038 6057171.248)
2018-11-14 13:30:09 303 30788 14-11-2018 13:30:09 54.626171 11.954280 In 17.0 In 2018-11-14 13:30:09 POINT (690706.056 6057203.814)
# 查看数据维度
df.shape           
(17517, 11)           
# 查看数据的坐标系
original_crs  = df.crs
original_crs           
<Projected CRS: EPSG:25832>
Name: ETRS89 / UTM zone 32N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 6°E and 12°E: Austria; Belgium; Denmark - onshore and offshore; Germany - onshore and offshore; Norway including - onshore and offshore; Spain - offshore.
- bounds: (6.0, 38.76, 12.01, 84.33)
Coordinate Operation:
- name: UTM zone 32N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich           
# 只有一个马颈圈标识符
print(df['CollarID'].unique())
collar_id = df['CollarID'].unique()[0]           
[30788]           
# 只有一个马颈圈运行状态
df['Activity'].unique()           
array([nan])           

Step 2 Data Overview

This step checks whether the data is logical.

Location at a glance

The coordinates of the data points in the horse collar dataset should be relatively close. Following this rule, positional outliers can be removed.

# EPSG:4326是常用的经纬度坐标系WGS84
df.to_crs({'init': 'EPSG:4326'}).plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

We can see that there are two coordinates of the collar that are far apart from the coordinates of the other collar, and we need to delete these two data to remove the outliers.

# 将数据按照数据点经度从小到大排列
# 经度最高的两个点即为离群点
df.sort_values('lat').tail(2)           
No CollarID UTC_Date UTC_Time years long Mort. Status Temp [? C] Activity timestamp geometry
t
2018-11-14 12:15:09 300 30788 14-11-2018 12:15:09 54.676884 11.910876 In 22.0 In 2018-11-14 12:15:09 POINT (687671.088 6062727.428)
2018-11-14 12:01:08 299 30788 14-11-2018 12:01:08 54.743331 11.916987 In 22.0 In 2018-11-14 12:01:08 POINT (687757.574 6070134.334)

Remove these two outliers in this way.

df = df.sort_values('lat')[:-2]
# 按照编号排序
df = df.sort_values('No')
df.head()           
No CollarID UTC_Date UTC_Time years long Mort. Status Temp [? C] Activity timestamp geometry
t
2018-11-14 12:30:08 301 30788 14-11-2018 12:30:08 54.627018 11.957852 In 21.0 In 2018-11-14 12:30:08 POINT (690932.614 6057307.716)
2018-11-14 13:00:33 302 30788 14-11-2018 13:00:33 54.625893 11.953686 In 17.0 In 2018-11-14 13:00:33 POINT (690669.038 6057171.248)
2018-11-14 13:30:09 303 30788 14-11-2018 13:30:09 54.626171 11.954280 In 17.0 In 2018-11-14 13:30:09 POINT (690706.056 6057203.814)
2018-11-14 14:00:38 304 30788 14-11-2018 14:00:38 54.626167 11.954662 In 17.0 In 2018-11-14 14:00:38 POINT (690730.771 6057204.431)
2018-11-14 14:30:08 305 30788 14-11-2018 14:30:08 54.626427 11.955650 In 18.0 In 2018-11-14 14:30:08 POINT (690793.288 6057235.998)

Plot the data after removing outliers.

df.to_crs({'init': 'EPSG:4326'}).plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Finally, we can roughly calculate the area of the herd.

temp = df.to_crs(original_crs)
# 膨胀数据
# 对temp DataFrame中的几何对象进行缓冲处理。
# 对geometry列中的每个几何对象应用buffer(5),即在每个几何对象周围创建一个半径为5的缓冲区。
temp['geometry'] = temp['geometry'].buffer(5)
# dissolve将所有项按照CollarID列的信息合并
# area计算面积,original_crs使用的单位是米,那么这里面积的单位应该是平方米
total_area = temp.dissolve(by='CollarID').area 
# 将平方米转换为平方公顷(平方百米)
total_area = total_area[collar_id]/10000
print("数据覆盖的总面积为: {:,.2f} 公顷".format(total_area))           
数据覆盖的总面积为: 65.19 公顷           

Time overview

# 查看数据点的时间范围
print("数据点的时间范围为 {} 到 {}.".format(df.index.min(), df.index.max()))           
数据点的时间范围为 2018-11-14 12:30:08 到 2019-11-07 05:30:10.           
# 查看数据点的时间跨度
print("数据点的时间跨度为 {}".format(df.index.max() - df.index.min()))           
数据点的时间跨度为 357 days 17:00:02           
# 统计每一天的马颈圈数据
df['No'].resample('1d').count()           
t
2018-11-14    23
2018-11-15    66
2018-11-16    58
2018-11-17    70
2018-11-18    79
              ..
2019-11-03    48
2019-11-04    48
2019-11-05    48
2019-11-06    48
2019-11-07    12
Freq: D, Name: No, Length: 359, dtype: int64           
# 可视化数据量
df['No'].resample('1d').count().plot()           
<Axes: xlabel='t'>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

In addition, you can further count the number of samples for each month, and the code is as follows:

# 添加年份-月份列
df['Y-M'] = df.index.to_period('M')    
for i in df['Y-M'].unique():
    print("时间{},样本数{}".format(i,len(df[df['Y-M']==i])))           
时间2018-11,样本数938
时间2018-12,样本数1486
时间2019-01,样本数1489
时间2019-02,样本数1346
时间2019-03,样本数1488
时间2019-04,样本数1441
时间2019-05,样本数1650
时间2019-06,样本数1474
时间2019-07,样本数1556
时间2019-08,样本数1499
时间2019-09,样本数1439
时间2019-10,样本数1407
时间2019-11,样本数302           

Step3 Data analysis

Sampling interval analysis

For sampling data, it is important to determine whether the data is spaced or not. If it's interval sampling, what is the sampling interval?

# 重置索引并提取't'列
t = df.reset_index()['t']
# 计算时间间隔,当每一项数据与前一项数据的时间差
df['delta_t'] = t.diff().values
# 将时间间隔转换为分钟
df['delta_t'] = df['delta_t'].dt.total_seconds() / 60
df['delta_t']           
t
2018-11-14 12:30:08          NaN
2018-11-14 13:00:33    30.416667
2018-11-14 13:30:09    29.600000
2018-11-14 14:00:38    30.483333
2018-11-14 14:30:08    29.500000
                         ...    
2019-11-07 03:30:38    30.466667
2019-11-07 04:00:09    29.516667
2019-11-07 04:30:08    29.983333
2019-11-07 05:00:10    30.033333
2019-11-07 05:30:10    30.000000
Name: delta_t, Length: 17515, dtype: float64           

Draw a time interval histogram below to see which value is in the interval set. The result is that the sampling method is uniform, and the sampling interval is usually about 30 minutes.

# 绘制时间间隔柱状图
# bins=60表示将数据范围划分成60个等宽的区间,每一个bin表示一个分钟
#range=(0, 60)表示设置直方图显示的数据范围在0到60之间。
df['delta_t'].hist(bins=60, range=(0, 60))           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Movement speed and direction analysis

The maximum speed of movement is used to determine whether unattainable speeds are included in the data.

# 'CollarID'只有一个,所以轨迹只有一个
tc = mpd.TrajectoryCollection(df, 'CollarID')
traj = tc.trajectories[0]
# 添加速度,单位是坐标系单位/s
traj.add_speed()
# 计算速度的最大值
max_speed = traj.df.speed.max()
print("The highest computed speed is {:,.2f} m/s ({:,.2f} km/h)".format(max_speed, max_speed*3600/1000))           
The highest computed speed is 0.82 m/s (2.94 km/h)           

Based on the velocity, we can look at the velocity distribution as follows:

pd.DataFrame(traj.df)['speed'].hist(bins=100)           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

In turn, we can also analyze the distribution of the direction of motion.

traj.add_direction(overwrite=True)
pd.DataFrame(traj.df)['direction'].hist(bins=360)           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Time trend analysis

If we want to understand the relationship between horse movement trends and time, we can do so with the following code.

# 分析时间和速度的关系
tmp = pd.DataFrame(traj.df)[['speed']]
tmp = tmp.reindex(pd.DataFrame(traj.df).index)
# 生成小时和月份列
tmp["hour"] = tmp.index.hour
tmp["month"] = tmp.index.month
tmp.head()           
speed hour month
t
2018-11-14 12:30:08 0.162635 12 11
2018-11-14 13:00:33 0.162635 13 11
2018-11-14 13:30:09 0.027761 13 11
2018-11-14 14:00:38 0.013517 14 11
2018-11-14 14:30:08 0.039568 14 11

Then we can aggregate the average speed of horses in different months and hours. It was found that the speed of movement by hour showed a significant difference in the number of years in which horses moved earlier and over a larger time span in the summer and later and less in the winter.

# 使用pivot_table函数计算平均速度在不同小时和月份的汇总数据
pivot_table = tmp.pivot_table(values='speed', index='hour', columns='month', aggfunc='mean')
# 创建热图
plt.figure(figsize=(9, 12))
# 横轴表示月份,纵轴表示小时,
sns.heatmap(pivot_table, annot=False, cmap='coolwarm')           
<Axes: xlabel='month', ylabel='hour'>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Temperature trend analysis

In addition to time, the dataset also contains temperature information for each record, which we can also analyze.

# 分析温度和速度的关系
tmp = pd.DataFrame(traj.df)[['Temp [?C]', 'speed']]
tmp = tmp.reindex(pd.DataFrame(traj.df).index)
# 生成月份列
tmp["month"] = tmp.index.month
tmp.head()           
Temp [? C] speed month
t
2018-11-14 12:30:08 21.0 0.162635 11
2018-11-14 13:00:33 17.0 0.162635 11
2018-11-14 13:30:09 17.0 0.027761 11
2018-11-14 14:00:38 17.0 0.013517 11
2018-11-14 14:30:08 18.0 0.039568 11

As you can see in the data above, each row represents a certain temperature at a certain time in a month. In this case, you can add a column to indicate the number of times, but it is all once.

tmp['n'] = 1
tmp.head()           
Temp [? C] speed month n
t
2018-11-14 12:30:08 21.0 0.162635 11 1
2018-11-14 13:00:33 17.0 0.162635 11 1
2018-11-14 13:30:09 17.0 0.027761 11 1
2018-11-14 14:00:38 17.0 0.013517 11 1
2018-11-14 14:30:08 18.0 0.039568 11 1

Based on the temperature and month, we can count the number of times each temperature occurs in each month, as follows:

# 使用pivot_table函数统计不同月份下不同温度值出现次数
pivot_table = tmp.pivot_table(values='n', index='Temp [?C]', columns='month', aggfunc='sum')
# 创建热图
plt.figure(figsize=(9, 12))
# 横轴表示月份,纵轴表示温度值
sns.heatmap(pivot_table, annot=False, cmap='coolwarm')           
<Axes: xlabel='month', ylabel='Temp [?C]'>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Similarly, we can calculate the average of the velocity at different temperatures in different months.

# 使用pivot_table函数统计不同月份下不同温度值下速度平均值
pivot_table = tmp.pivot_table(values='speed', index='Temp [?C]', columns='month', aggfunc='mean')
# 创建热图
plt.figure(figsize=(9, 12))
# 横轴表示月份,纵轴表示温度值
sns.heatmap(pivot_table, annot=False, cmap='coolwarm')           
<Axes: xlabel='month', ylabel='Temp [?C]'>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

STEP3 Trajectory analysis

This step focuses on the analysis of individual trajectories. Therefore, it is necessary to split the continuous trajectories into separate trajectories. The results of the analysis depend on how the continuous flow is divided into trajectories, locations, and events.

Trajectory length analysis

# 按天为单位分割轨迹
daily = mpd.TemporalSplitter(tc).split(mode='day')
daily           
TrajectoryCollection with 358 trajectories           
# 计算每条轨迹的长度
daily_lengths = [traj.get_length() for traj in daily]
# 计算每条轨迹的开始时间
daily_t = [traj.get_start_time() for traj in daily]           
# 建立轨迹开始时间和长度的dataframe
daily_lengths = pd.DataFrame(daily_lengths, index=daily_t, columns=['length'])
daily_lengths.head()           
length
2018-11-14 12:30:08 1090.598526
2018-11-15 00:00:08 4219.980813
2018-11-16 00:00:10 3198.209140
2018-11-17 00:00:09 4307.483500
2018-11-18 00:00:09 3548.902314

Visualizing the relationship between the start time and the length of the trajectory, we can see that the seasonal trend is very consistent with the previously discovered pattern of seasonal movement speed: winter tracks tend to be shorter than summer tracks.

daily_lengths.plot()           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

Dwell detection analysis

We can analyze the trajectory of a certain day separately and extract the dwell segment of the trajectory by dwell detection as follows:

# 设置最大直径为100
MAX_DIAMETER = 100

# 设置最小持续时间为3小时
MIN_DURATION = timedelta(hours=3)

# 获取指定日期(2018年11月17日)的轨迹数据,并赋值给变量one_day
one_day = daily.get_trajectory('30788_2018-11-17 00:00:00')

# 使用停留段检测器(mpd.TrajectoryStopDetector)获取指定轨迹数据的停留段,
# 并根据最小持续时间和最大直径进行筛选,结果存储在变量one_day_stops中
one_day_stops = mpd.TrajectoryStopDetector(one_day).get_stop_segments(
    min_duration=MIN_DURATION, max_diameter=MAX_DIAMETER)           

The code for drawing the trajectory stop visualization is as follows:

# 绘制轨迹
# 创建一个绘图区域(ax),并将指定日期(one_day)的轨迹数据以灰色(slategray)进行绘制
ax = one_day.plot(color='slategray')
# 将停留段数据(one_day_stops)以深粉色(deeppink)的颜色进行绘制
ax = one_day_stops.plot(ax=ax, color='deeppink')
# 获取停留段的起始位置,并将这些起始位置以红色(red)进行绘制
ax = one_day_stops.get_start_locations().plot(ax=ax, color='red')           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

We can also perform stop recognition for all time periods, the code is as follows:

# 获得停留轨迹点
stops = mpd.TrajectoryStopDetector(tc).get_stop_points(min_duration=MIN_DURATION, max_diameter=MAX_DIAMETER)
len(stops)           
362           
# 绘制这些停留的轨迹点
stops.plot(color='deeppink',alpha=0.2)           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

In addition, you can plot the number of dwells for each dwell time:

stops['duration_h'] = (stops['end_time']-stops['start_time']).dt.total_seconds() / 3600
# 横轴为停留时间,纵轴为停留次数
pd.DataFrame(stops)['duration_h'].hist(bins=12)           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

3 Examples of football analysis

This tutorial uses data extracted from video footage of Liverpool's football matches, with each row representing a point in time for the Liverpool team in a match in 2019. The dataset can be downloaded from liverpool_2019.csv.

STEP1 Data processing

input_file = "data/liverpool_2019.csv"
df = pd.read_csv(input_file)
# 删除列
df.drop(columns=['Unnamed: 0'], inplace=True)
print("样本数",len(df))
df.tail()           
样本数 74936           
bgcolor dx two edgecolor frame play player player_num team x and with
74931 blue 0.0 0.0 white 120 Leicester 0 - [3] Liverpool 10267 In defense 98.724826 53.720353 0.0
74932 blue 0.0 0.0 white 121 Leicester 0 - [3] Liverpool 10267 In defense 98.724826 53.720353 0.0
74933 blue 0.0 0.0 white 122 Leicester 0 - [3] Liverpool 10267 In defense 98.724826 53.720353 0.0
74934 blue 0.0 0.0 white 123 Leicester 0 - [3] Liverpool 10267 In defense 98.724826 53.720353 0.0
74935 blue 0.0 0.0 white 124 Leicester 0 - [3] Liverpool 10267 In defense 98.724826 53.720353 0.0

The column names of the read data are explained as follows:

  • play: The score situation after the goal is scored. The team that scored the goal is placed next to the brackets.
  • frame: the frame number of the current position. The data provided has 20 frames per second.
  • player: The player's number. This number is consistent from race to match, but may vary from race to match.
  • player_num: The player's jersey number. This number is the official number and has not changed in the 2019 Liverpool team.
  • x, y: the coordinates of the player/football. The court coordinates range from 0 to 100 on each axis.
  • dx, dy: the change in (x,y) coordinates from the previous frame to the current frame.
  • z: height, from 0 to 1.5 (only the height information of the football is filled).
  • bgcolor: The main color of the team (used as the background color).
  • edgecolor: The secondary color (used as the edge color).

Further we can process the data as follows:

# 获取唯一plays列表
plays = list(df.play.unique())
# 获得时间戳
def to_timestamp(row):
    day = plays.index(row.play) + 1
    start_time = datetime(2019, 1, day, 12, 0, 0)
    # 将frame转换为时间戳
    td = timedelta(milliseconds=1000 / 20 * row.frame)
    return start_time + td

# 将frame转换为时间戳,并添加到DataFrame的新列'time'中
df['time'] = df.apply(to_timestamp, axis=1)
# 将'time'列设为索引,以便后续使用时间作为索引来访问数据
df.set_index('time', inplace=True)

# 根据足球场地的标准尺寸,将x和y坐标缩放到合适的范围
# 许多职业球队的足球场地标准尺寸为105米乘以68米
pitch_length = 105
pitch_width = 68
df.x = df.x / 100 * pitch_length 
df.y = df.y / 100 * pitch_width

# 将以下列转化为分类数据
df['team'] = df['team'].astype('category').cat.as_ordered()
df['player'] = df['player'].astype('category').cat.as_ordered()
df['player_num'] = df['player_num'].astype('category').cat.as_ordered()

df.tail()           
bgcolor dx two edgecolor frame play player player_num team x and with
time
2019-01-19 12:00:06.000 blue 0.0 0.0 white 120 Leicester 0 - [3] Liverpool 10267 In defense 103.661067 36.52984 0.0
2019-01-19 12:00:06.050 blue 0.0 0.0 white 121 Leicester 0 - [3] Liverpool 10267 In defense 103.661067 36.52984 0.0
2019-01-19 12:00:06.100 blue 0.0 0.0 white 122 Leicester 0 - [3] Liverpool 10267 In defense 103.661067 36.52984 0.0
2019-01-19 12:00:06.150 blue 0.0 0.0 white 123 Leicester 0 - [3] Liverpool 10267 In defense 103.661067 36.52984 0.0
2019-01-19 12:00:06.200 blue 0.0 0.0 white 124 Leicester 0 - [3] Liverpool 10267 In defense 103.661067 36.52984 0.0

Once you have the data, you can convert it into a trajectory:

CRS = None
tc = mpd.TrajectoryCollection(df, 'player', x='x', y='y', crs=CRS)
tc           
TrajectoryCollection with 364 trajectories           

Trajectory analysis

We can analyze the trajectory of a single player as follows:

# 获得球员轨迹
PLAY = 2
play_trajs = tc.filter('play', plays[PLAY])
play_trajs           
TrajectoryCollection with 20 trajectories           
# 绘制其进攻或防守轨迹
play_trajs.plot(column='team', colormap={'attack':'hotpink', 'defense':'turquoise'})           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example
# 简化轨迹
generalized = mpd.MinTimeDeltaGeneralizer(play_trajs).generalize(tolerance=timedelta(seconds=0.5))
generalized.add_speed()
# 绘制轨迹速度
generalized.plot(column='speed',legend=True)           
<Axes: >           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example
# 添加完整轨迹数据
ax = generalized.plot(column='team', colormap={'attack':'limegreen', 'defense':'purple'})
ax = generalized.get_start_locations().plot(ax = ax ,label='start', color='orange',zorder=2)
import matplotlib.image as mpimg

# 添加背景图片
# 下载地址https://github.com/movingpandas/movingpandas/raw/main/tutorials/data/soccer_field.png
img = mpimg.imread('data/soccer_field.png')
ax.imshow(img, extent=[0, pitch_length, 0, pitch_width])           
<matplotlib.image.AxesImage at 0x7f0a747a1ed0>           
[Data Analysis & Visualization] Python Plotting Data Map 5 - MovingPandas Plot Example

4 See also

  • ​​MovingPandas入门指北​​
  • ​​liverpool_2019.csv​​
  • ​​movingpandas​​
  • ​​movingpandas-examples​​
  • ​​movingpandas-examples-data​​
  • GeoPandas overlay background map

Read on