pandas强化练习
这篇文章写得更好:http://wittyfans.com/coding/%E5%88%A9%E7%94%A8Pandas%E5%88%86%E6%9E%90%E7%BE%8E%E5%9B%BD%E4%BA%A4%E8%AD%A6%E5%BC%80%E6%94%BE%E7%9A%84%E6%90%9C%E6%9F%A5%E6%95%B0%E6%8D%AE.html import pandas as pd
import matplotlib.pyplot as plt #需要声明才能在notebook中画图
%matplotlib inline #下载的罗曼的警务数据,这里以ri代表罗德曼岛警务数据
ri=pd.read_csv('police.csv') ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
ri.shape
(91741, 15)
ri.isnull().sum()
stop_date 0
stop_time 0
county_name 91741
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
移除某列
ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
#写法等同于ri.drop('county_name', axis=1 , inplace=True)
#删除空值的
ri.drop('county_name', axis='columns', inplace=True)
ri.shape
(91741, 14)
ri.columns
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
'driver_age', 'driver_race', 'violation_raw', 'violation',
'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
'stop_duration', 'drugs_related_stop'],
dtype='object')
#删除有空值的行
ri.dropna(axis='columns',how='all').shape
(91741, 14)
pandas过滤功能
保留布尔值为真的数据,这里我们保留violaton值为真的数据
ri[ri.violation=='Speeding'].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
6 | 2005-04-01 | 17:30 | M | 1969.0 | 36.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
values_counts
## 超速违规的驾驶员男女各多少人
print(ri[ri.violation=='Speeding'].driver_gender.value_counts()
)
M 32979
F 15482
Name: driver_gender, dtype: int64
# 超速男女各占多少比例 normalize归一化处理
print(ri[ri.violation=='Speeding'].driver_gender.value_counts(normalize=True))
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
ri.loc[ri.violation=='Speeding','driver_gender'].value_counts(normalize=True)
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
#男性驾驶员中,各种交通违规的比例
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)
Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
#女性驾驶员中各种交通违规的比例
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64
groupby方法
查看不同driver_gender,violation的各种值的占比
#对比以上两种数据
ri.groupby('driver_gender').violation.value_counts(normalize=True)
driver_gender violation
F Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
M Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
mean方法
mean可以默认计算占比
#True为执行搜查,False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))
False 0.965163
True 0.034837
Name: search_conducted, dtype: float64
#这例men可以计算出True的咋还占比
print(ri.search_conducted.mean())
0.03483720473942948
男女分组看他们的搜索值
ri.groupby('driver_gender').search_conducted.mean()
driver_gender
F 0.020033
M 0.043326
Name: search_conducted, dtype: float64
男的搜查比例比女的高
再看一下如果是多重分组,男女搜查的比例
ri.groupby(['violation','driver_gender']).search_conducted.mean()
violation driver_gender
Equipment F 0.042622
M 0.070081
Moving violation F 0.036205
M 0.059831
Other F 0.056522
M 0.047146
Registration/plates F 0.066140
M 0.110376
Seat belt F 0.012598
M 0.037980
Speeding F 0.008720
M 0.024925
Name: search_conducted, dtype: float64
ri.isnull().sum()
stop_date 0
stop_time 0
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
#是否search_conducted为false的时候,search_type都丢失了
ri.search_conducted.value_counts()
False 88545
True 3196
Name: search_conducted, dtype: int64
是不是数值和上面的search_type丢失的值相同啊
再次验证一下
ri[ri.search_conducted==False].search_type.value_counts()
Series([], Name: search_type, dtype: int64)
#value_counts()这个方法时候默认忽略丢失值(空值)
ri[ri.search_conducted==False].search_type.value_counts(dropna=False)
NaN 88545
Name: search_type, dtype: int64
#当searcch_conducted的值为True,search_type从来不丢失
ri[ri.search_conducted==True].search_type.value_counts(dropna=False)
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Inventory,Protective Frisk 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Protective Frisk 2
Inventory,Probable Cause,Reasonable Suspicion 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri[ri.search_conducted==True].search_type.isnull().sum()
0
查看搜索类型
ri.search_type.value_counts(dropna=False)
NaN 88545
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Inventory,Protective Frisk 11
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri['frisk']=ri.search_type=='Protective Frisk'
ri.frisk.dtype
dtype('bool')
ri.frisk.sum()
161
ri.frisk.mean()
0.0017549405391264537
ri.frisk.value_counts()
False 91580
True 161
Name: frisk, dtype: int64
161/(91580+161)
0.0017549405391264537
字符操作
#上面的操作是把ri.search_type=='Protective Frisk'的值付给日['firsk']这一列
#现在是字符串的包含操作
ri['frisk']=ri.search_type.str.contains('Protective Frisk')
ri.frisk.sum()
274
ri.frisk.mean()
0.08573216520650813
#用mean()计算符合条件和不符合条件的占比
ri.frisk.value_counts()
False 2922
True 274
Name: frisk, dtype: int64
#再看一下他们的计算是否和men()的结构一样
274/(2922+274)
0.08573216520650813
上面的这一部分是计算字符串匹配操作
用正确的关键字去计算比例
pandas计算式忽略缺失值的
#那一年的数据最少
ri.stop_date.str.slice(0,4).value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_date, dtype: int64
#将ri.stop_date转化为datetime的格式的dataframe,存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date) #注意这里有dt方法,类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_datetime, dtype: int64
ri.stop_datetime.dt.month.value_counts()
1 8479
5 7935
11 7877
10 7745
3 7742
6 7630
8 7615
7 7568
4 7529
9 7427
12 7152
2 7042
Name: stop_datetime, dtype: int64
#关于毒驾
ri.drugs_related_stop.dtype
dtype('bool')
#基础比例
ri.drugs_related_stop.mean()
0.008883705213590434
#不能使用小时分组,除非你创建了小时这一列
#取出小时列,转换成时间格式,再转化才成小时分组
ri['stop_time_datetime']=pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean()
stop_time_datetime
0 0.019728
1 0.013507
2 0.015462
3 0.017065
4 0.011811
5 0.004762
6 0.003040
7 0.003281
8 0.002687
9 0.006288
10 0.005714
11 0.006976
12 0.004467
13 0.010326
14 0.007810
15 0.006416
16 0.005723
17 0.005517
18 0.010148
19 0.011596
20 0.008084
21 0.013342
22 0.013533
23 0.016344
Name: drugs_related_stop, dtype: float64
#按小时的时毒驾频率分布图
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9d72d30>
#按小时的,毒驾数量分布图
ri.stop_time_datetime.dt.hour.value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5460710>
#按小时分组,毒驾数量排序分布图
ri.stop_time_datetime.dt.hour.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5420860>
ri.groupby(ri.stop_time_datetime.dt.hour).stop_date.count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x557c2e8>
#把无用的数据标记为丢失值
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri[(ri.stop_duration=='1')|(ri.stop_duration=='2')].stop_duration='NaN'
C:\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri.loc[(ri.stop_duration=='1')|(ri.stop_duration=='2'),'stop_duration']='NaN'
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5333
30+ Min 3228
NaN 2
Name: stop_duration, dtype: int64
#用执行的nan类型替换NaN
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5335
30+ Min 3228
Name: stop_duration, dtype: int64
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)
# stop_duration中的各种比例
#Series的map方法可以接受一个函数或含有映射关系的字典型对象。
#对某一个列进行批操作,本文中是批量替换
mapping={'0-15 Min':8,'16-30 Min':23,'30+ Min':45} #记得这不是原地操作原始数据,需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)
#为各种粘皮匹配值
ri.stop_minutes.value_counts()
8.0 69543
23.0 13635
45.0 3228
Name: stop_minutes, dtype: int64
ri.groupby('violation_raw').stop_minutes.mean()
violation_raw
APB 20.987342
Call for Service 22.034669
Equipment/Inspection Violation 11.460345
Motorist Assist/Courtesy 16.916256
Other Traffic Violation 13.900265
Registration Violation 13.745629
Seatbelt Violation 9.741531
Special Detail/Directed Patrol 15.061100
Speeding 10.577690
Suspicious Person 18.750000
Violation of City/Town Ordinance 13.388626
Warrant 21.400000
Name: stop_minutes, dtype: float64
# 使用某种方法如mean、count对某类数据进行操作。 # 过去agg只能groupby之后的数据进行操作,现在还可以对dataframe类、series类进行操作。
ri.groupby('violation_raw').stop_minutes.agg(['mean','count'])
mean | count | |
---|---|---|
violation_raw | ||
APB | 20.987342 | 79 |
Call for Service | 22.034669 | 1298 |
Equipment/Inspection Violation | 11.460345 | 11020 |
Motorist Assist/Courtesy | 16.916256 | 203 |
Other Traffic Violation | 13.900265 | 16223 |
Registration Violation | 13.745629 | 3432 |
Seatbelt Violation | 9.741531 | 2952 |
Special Detail/Directed Patrol | 15.061100 | 2455 |
Speeding | 10.577690 | 48462 |
Suspicious Person | 18.750000 | 56 |
Violation of City/Town Ordinance | 13.388626 | 211 |
Warrant | 21.400000 | 15 |
plot 默认是折线方法
ri.groupby('violation_raw').stop_minutes.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10873ef0>
#换成bartu
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1092eb38>
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10a4a5f8>
ri.groupby('violation').driver_age.describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
violation | ||||||||
Equipment | 11007.0 | 31.781503 | 11.400900 | 16.0 | 23.0 | 28.0 | 38.0 | 89.0 |
Moving violation | 16164.0 | 36.120020 | 13.185805 | 15.0 | 25.0 | 33.0 | 46.0 | 99.0 |
Other | 4204.0 | 39.536870 | 13.034639 | 16.0 | 28.0 | 39.0 | 49.0 | 87.0 |
Registration/plates | 3427.0 | 32.803035 | 11.033675 | 16.0 | 24.0 | 30.0 | 40.0 | 74.0 |
Seat belt | 2952.0 | 32.206301 | 11.213122 | 17.0 | 24.0 | 29.0 | 38.0 | 77.0 |
Speeding | 48361.0 | 33.530097 | 12.821847 | 15.0 | 23.0 | 30.0 | 42.0 | 90.0 |
ri.driver_age.plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x1003a518>
ri.driver_age.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10088080>
ri.hist('driver_age', by='violation')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000100D8438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010111208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000000001013B898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010163F28>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000101945F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010194630>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000102C6C50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103243C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010346908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010370E80>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1470>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True,sharey=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000104C4F98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001059D358>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105C0748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105E9B38>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F60>]],
dtype=object)
ri.head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-02 | 2019-04-05 01:55:00 | 8.0 | 20.0 |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-18 | 2019-04-05 08:15:00 | 8.0 | 40.0 |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-23 | 2019-04-05 23:15:00 | 8.0 | 33.0 |
3 | 2005-02-20 | 17:15 | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False | NaN | 2005-02-20 | 2019-04-05 17:15:00 | 23.0 | 19.0 |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-03-14 | 2019-04-05 10:00:00 | 8.0 | 21.0 |
ri.tail()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91736 | 2015-12-31 | 20:27 | M | 1986.0 | 29.0 | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:27:00 | 8.0 | 29.0 |
91737 | 2015-12-31 | 20:35 | F | 1982.0 | 33.0 | White | Equipment/Inspection Violation | Equipment | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:35:00 | 8.0 | 33.0 |
91738 | 2015-12-31 | 20:45 | M | 1992.0 | 23.0 | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:45:00 | 8.0 | 23.0 |
91739 | 2015-12-31 | 21:42 | M | 1993.0 | 22.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 21:42:00 | 8.0 | 22.0 |
91740 | 2015-12-31 | 22:46 | M | 1959.0 | 56.0 | Hispanic | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 22:46:00 | 8.0 | 56.0 |
ri['new_age']=ri.stop_datetime.dt.year-ri.driver_age_raw
ri[['driver_age','new_age']].hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000107FE7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001083C2E8>]],
dtype=object)
ri[['driver_age','new_age']].describe()
driver_age | new_age | |
---|---|---|
count | 86120.000000 | 86414.000000 |
mean | 34.011333 | 39.784294 |
std | 12.738564 | 110.822145 |
min | 15.000000 | -6794.000000 |
25% | 23.000000 | 24.000000 |
50% | 31.000000 | 31.000000 |
75% | 43.000000 | 43.000000 |
max | 99.000000 | 2015.000000 |
ri[(ri.new_age<15)|(ri.new_age>99)].shape
(294, 19)
ri.driver_age_raw.isnull().sum()
5327
ri.driver_age.isnull().sum()
5621
5621-5327
294
ri[(ri.driver_age_raw.notnull())&(ri.driver_age.isnull())].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
146 | 2005-10-05 | 08:50 | M | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-05 | 2019-04-05 08:50:00 | 8.0 | 2005.0 |
281 | 2005-10-10 | 12:05 | F | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-10 | 2019-04-05 12:05:00 | 8.0 | 2005.0 |
331 | 2005-10-12 | 07:50 | M | 0.0 | NaN | White | Motorist Assist/Courtesy | Other | False | NaN | No Action | False | 0-15 Min | False | NaN | 2005-10-12 | 2019-04-05 07:50:00 | 8.0 | 2005.0 |
414 | 2005-10-17 | 08:32 | M | 2005.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-17 | 2019-04-05 08:32:00 | 8.0 | 0.0 |
455 | 2005-10-18 | 18:30 | F | 0.0 | NaN | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-18 | 2019-04-05 18:30:00 | 8.0 | 2005.0 |
ri.loc[(ri.new_age<15)|(ri.new_age>99),'new_age']=np.nan
ri.new_age.equals(ri.driver_age)
True
pandas强化练习的更多相关文章
- 【强化学习】用pandas 与 numpy 分别实现 q-learning, saras, saras(lambda)算法
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10159331.html 特别感谢:本文的三幅图皆来自莫凡的教程 https://morvan ...
- 【强化学习】python 实现 q-learning 例一
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10134018.html 问题情境 -o---T# T 就是宝藏的位置, o 是探索者的位置 ...
- 深度强化学习:Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
- pandas基础-Python3
未完 for examples: example 1: # Code based on Python 3.x # _*_ coding: utf-8 _*_ # __Author: "LEM ...
- [django]数据导出excel升级强化版(很强大!)
不多说了,原理采用xlwt导出excel文件,所谓的强化版指的是实现在网页上选择一定条件导出对应的数据 之前我的博文出过这类文章,但只是实现导出数据,这次左思右想,再加上网上的搜索,终于找出方法实现条 ...
- 10 Minutes to pandas
摘要 一.创建对象 二.查看数据 三.选择和设置 四.缺失值处理 五.相关操作 六.聚合 七.重排(Reshaping) 八.时间序列 九.Categorical类型 十.画图 十一 ...
- ITTC数据挖掘平台介绍(七)强化的数据库, 虚拟化,脚本编辑器
一. 前言 好久没有更新博客了,最近一直在忙着找工作,目前差不多尘埃落定.特别期待而且准备的都很少能成功,反而是没怎么在意的最终反而能拿到,真是神一样的人生. 言归正传,一直以来,数据挖掘系统的数据类 ...
- 利用Python进行数据分析(15) pandas基础: 字符串操作
字符串对象方法 split()方法拆分字符串: strip()方法去掉空白符和换行符: split()结合strip()使用: "+"符号可以将多个字符串连接起来: join( ...
- 利用Python进行数据分析(10) pandas基础: 处理缺失数据
数据不完整在数据分析的过程中很常见. pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据. pandas使用isnull()和notnull()函数来判断缺失情况. 对于缺失数据一般处理 ...
随机推荐
- 在Asp.Net中使用amChart统计图
怎么在自己的ASP.NET页面插入可动态更新的数据统计图呢?网上的资源倒是不少(Fusioncharts.amCharts……),在这些资源中有一个比较好用:amChart,这个工具很炫,还能与用户交 ...
- (回文串)leetcode各种回文串问题
题目一:最长连续回文子串. 问题分析:回文串顾名思义表示前后读起来都是一样,这里面又是需要连续的.分析这个问题的结构,可以想到多种方法.暴力解决的方式,2层循环遍历得出各个子串,然后再去判断该子串是否 ...
- VirtualBox安装增强功能(Linux)
我们在安装之前,必须得先安装好它所需要的依赖包,不然安装过程必定会出现错误! 一.安装依赖包 #yum install kernel-headers #yum install kernel-devel ...
- JQuery解决事件动画重复问题
开发项目时,经常要写动画效果,有时候会遇到动画重复问题,例如:当鼠标移动到某个元素上时,执行某个动画,当我鼠标多次移动到该元素时,该动画就要连续执行,那么怎么去解决呢? 话不多说,直接添代码,简单明了 ...
- 设计模式10: Facade 外观模式(结构型模式)
Facade 外观模式(结构型模式) 系统的复杂度 假设我们要开发一个坦克模式系统用于模拟坦克车在各种作战环境中的行为,其中坦克系统由引擎.控制器.车轮.车身等各个子系统构成. internal cl ...
- Entity Framework快速入门--直接修改(简要介绍ObjectContext处理机制)
在介绍Entity Framework的修改实体到数据库的方法之前呢,我们先简要的介绍一下ObjectContext的处理机制. 1.ObjectContext的处理机制 ObjectContext是 ...
- 我用Django搭网站(1)-新浪微博登录
新浪微博第三方登录使用的是OAuth2.0,开发前提已经注册开发者帐号,是开发者. OAuth简介 OAuth: OAuth(开放授权)是一个开放标准,允许用户授权第三方网站访问他们存储在另外的服务提 ...
- c# enum遍历
public enum Suit { Spades, Hearts, Clubs, Diamonds } //遍历valueforeach (Suit suit in (Suit[]) Enum.Ge ...
- 【连载】redis库存操作,分布式锁的四种实现方式[二]--基于Redisson实现分布式锁
一.redisson介绍 redisson实现了分布式和可扩展的java数据结构,支持的数据结构有:List, Set, Map, Queue, SortedSet, ConcureentMap, L ...
- python-输入
1. python2版本中 咱们在银行ATM机器前取钱时,肯定需要输入密码,对不? 那么怎样才能让程序知道咱们刚刚输入的是什么呢?? 大家应该知道了,如果要完成ATM机取钱这件事情,需要先从键盘中输入 ...