电商交易数据清洗和分析

数据源：csv文件，某电商的交易数据，需要对这部分数据进行清洗和分析

工具：python（matplotlib/numpy/pandas），jupyter实现

数据清洗

加载数据分析需要的库

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

加载数据，加载数据之前先用文本编辑器查看一下数据格式，首行和分隔符等

df = pd.read_csv('./order_info_2016.csv',index_col='id')  #index_col='id' 表示将id这一列作为行索引
df.index.name=None  #去掉行索引的名字
df.head() #查看一下前5行

电商交易数据清洗和分析

加载好数据后，先分别使用describe和info方法看下数据的分布情况

电商交易数据清洗和分析

可以看出数据总量是104557，其中channelId有缺失数据，后续要对它进行处理

先整体看一下有没有重复数据

orderId

我们都知道orderId在一个系统中是唯一的，因此需要查看一下其是否满足唯一性

电商交易数据清洗和分析

我们也可以验证一下

电商交易数据清洗和分析

#以下方法也可以查看是否存在重复
df['orderId'].unique().size  #104530
df['orderId'].size           #104557

如果有重复，我们一般也是最后处理，因为其他列可能影响重复值，我们先处理其他列

userId

userId，我们只要从上边的describe和info中查看其值在正常范围内就可以了，对于订单数据，一个用户可能有多个订单，重复是合理的

productId

productId的最小值为0，先来看一下为0的记录数

177条记录，数量不多，可能是商品的上下架引起的，处理完其他值我们把这些删掉

cityId

cityId类似于userId，值都在正常范围，不需要处理

price

price没有空值，且都大于0，注意单位是分，我们把它转换成元

payMoney

payMoney中存在负值，不符合逻辑，因此要删除掉这写记录

#展示payMoney为负值的记录
df[df['payMoney']<0]

电商交易数据清洗和分析

#删除payMoney为负的记录
df.drop(df[df['payMoney']<0].index,inplace=True)

#再检查一下
df[df['payMoney']<0]  #为空

#变成元
df.payMoney = df.payMoney/100

channelId

channelId根据info结果，有些null值，可能是端的bug等原因导致下单的时候没有传入，删除掉少量null值不会影响分析结果，这里我们进行删除

#展示
df[df['channelId'].isnull()]

电商交易数据清洗和分析

#删除
df.drop(df[df['channelId'].isnull()].index,inplace=True)

#检查
df[df['channelId'].isnull()]   #为空

createTime & payTime

createTime 和 payTime都没有null值，不过我们要统计2016年的数据，需要把2016年之前和之后的数据都删除掉，这里只按照创建订单的时间算

#把createTime 和 payTime转换成datetime格式 
df['createTime'] = pd.to_datetime(df['createTime'])
df['payTime'] = pd.to_datetime(df['payTime'])
df[['createTime','payTime']]

电商交易数据清洗和分析

#为删除不符合时间的数据做准备
import datetime
startTime = datetime.datetime(2016,1,1)
endTime = datetime.datetime(2016,12,31,23,59,59)

#createTime在2016年之前的数据要删除
df[df['createTime']<startTime]
df.drop(df[df['createTime']<startTime].index,inplace=True)

#检查
df[df['createTime']<startTime] #空

#payTime早于createTime的时间也要删除，因为不符合逻辑
df[df['payTime']<df['createTime']]
df.drop(df[df['payTime']<df['createTime']].index,inplace=True)

#检查
df[df['payTime']<df['createTime']] #空

#createTime在2016年之后的数据也要删除,,这里没有数据
df[df['createTime']>endTime]

#我们删除掉orderId重复的记录
df.drop(df[df['orderId'].duplicated()].index,inplace=True)

#检查
df[df['orderId'].duplicated()] #空

#把productId为0的也删除
df.drop(df[df['productId']==0].index,inplace=True)

#检查
df[df['productId']==0] #空

数据分析

先看一下总体情况

电商交易数据清洗和分析

#总订单数
print('总订单数：',df['orderId'].count())
#总客户数
print('总客户数：',df['userId'].unique().size)
#总销售额
print('总销售额：',df['payMoney'].sum())
#商品数
print('商品数：',df['productId'].unique().size)

电商交易数据清洗和分析

先按照商品的productId

#商品销量的前十和后十
productId_orderCount = df.groupby('productId').count()['orderId'].sort_values(ascending=False)
print(productId_orderCount.head(10)) #前10
print(productId_orderCount.tail(10)) #后10

电商交易数据清洗和分析

销售额

#销售额
productId_money = df.groupby('productId').sum()['payMoney'].sort_values(ascending=False)
print(productId_money.head(10))
print(productId_money.tail(10))

电商交易数据清洗和分析

看一下销量和销售额后100的交集，如果销量和销售额都不行的话，考虑商品是否可以优化或者下架

problem_productId = productId_orderCount.tail(100).index.intersection(productId_money.tail(100).index)
print(problem_productId)

电商交易数据清洗和分析

城市分析和商品维度类似

cityId_orderCount = df.groupby('cityId').count()['orderId'].sort_values(ascending=False)
cityId_money = df.groupby('cityId').sum()['payMoney'].sort_values(ascending=False)
print(cityId_orderCount.head(10))
print(cityId_orderCount.tail(10))
print(cityId_money.head(10))
print(cityId_money.tail(10))

price

对于价格，我们可以看一下所有商品的价格分布，看看哪种商品卖的比较好，可以用直方图来体现

电商交易数据清洗和分析

价格最低6，最高22956，选择合适的分桶

bin_arr = np.arange(0,25000,1000)
bins = pd.cut(df['price'],bin_arr)
price_count = df['price'].groupby(bins).count()
print(price_count)

电商交易数据清洗和分析

#直方图
plt.figure(figsize=(16,8))
plt.hist(df['price'],bin_arr)
plt.show()

电商交易数据清洗和分析

可以看出很多价格区间没有商品，如有竞争对手的数据，看看是否需要补充商品来填充对应的价格区间

channelId

渠道channelId的分析类似于productId，可以给出成交量最多的渠道，订单数最多的渠道等，渠道很多时候是需要花钱买流量的，所以需要根据渠道的盈利情况和渠道成本进行综合比较，同时也可以渠道和商品等多个维度综合分析

下单时间分析

df['orderHour'] = df['createTime'].dt.hour  #增加一列（小时）
df.groupby('orderHour').count()['orderId'].plot()
plt.show()

电商交易数据清洗和分析

中午12，13，14点下单比较多，应该是午休时间，晚上20点左右几乎是所有互联网产品的下单高峰，下单高峰要保证系统的可用性和稳定性

#按星期来看
df['orderWeek'] = df['createTime'].dt.dayofweek  
df.groupby('orderWeek').count()['orderId'].sort_values(ascending=False)

电商交易数据清洗和分析

周六下单最多，其次是周日和周五

下单多久后支付

#下单多久后支付
def get_seconds(x):
    return x.total_seconds()
df['payDelta'] = (df['payTime']-df['createTime']).apply(get_seconds)

#apply的返回值就是get_seconds的返回值
#total_seconds获取时间差

bins = [0,50,100,1000,10000,100000]
pd.cut(df['payDelta'],bins).value_counts().plot(kind='pie',autopct='%d%%',shadow=True,figsize=(10,8))
plt.show()

电商交易数据清洗和分析

大部分用户都在十几分钟内完成支付，说明用户购买的目的性很强

月成交额

#resample()重新采样，是对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法
#M表示重新采样频率，例如‘M'、‘5min'，Second(15)，sum()是用于产生聚合函数的函数名或数值函数
tur = df.resample('M').sum()['payMoney']    
print(tur)
order_count = df.resample('M').count()['orderId']
print(order_count)
tur.plot()
plt.show()

电商交易数据清洗和分析

七月份的成交额最高，十一月份的成交额最低

电商交易数据清洗和分析

电商交易数据清洗和分析

继续阅读

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

对象锁和全局锁

iview后台管理模版

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入