Pandas字符串操作及实例应用
字符串操作
字符串对象方法
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
first,second,third = pieces
'::'.join(pieces)
'a::b::guido'
'guido' in val
True
注意find和index的区别:如果找不到字符串,index将会引发一个异常(而不是返回-1)
# find不会报错
val.find(':')
-1
# 返回指定字串的出现次数
val.count(',')
2
正则表达式
如果想避免正则表达式中不需要的转义(''),则可以使用原始字符串字面量如r'C:\x'
import re
text = 'foo bar\t baz \tqux'
re.split('\s+',text)
['foo', 'bar', 'baz', 'qux']
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
regex.findall(text)
[' ', '\t ', ' \t']
pandas中矢量化的字符串函数
import numpy as np
from pandas import Series
data = {'Davae':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Wes':np.nan}
data
{'Davae': 'dave@google.com',
'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com',
'Wes': nan}
data2 = Series(data)
data2
Davae dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data2.isnull()
Davae False
Steve False
Rob False
Wes True
dtype: bool
通过data.map,所有字符串和正则表达式方法都能被应用于各个值
对象下面的属性,可以取得所有的字符串
data2.str.contains('gmail')
Davae False
Steve True
Rob True
Wes NaN
dtype: object
# 匹配规则
regex = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
# 映射匹配
res = data2.str.findall(regex,flags = re.IGNORECASE)
res
Davae [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
# 直接使用get属性索引取值
res_first = res.str.get(0)
res_first
Davae (dave, google, com)
Steve (steve, gmail, com)
Rob (rob, gmail, com)
Wes NaN
dtype: object
res_second = res_first.str.get(1)
res_second
Davae google
Steve gmail
Rob gmail
Wes NaN
dtype: object
res_second.str[:2]
Davae go
Steve gm
Rob gm
Wes NaN
dtype: object
示例: USDA食品数据库
import json
import pandas as pd
db = json.load(open(r'C:\Users\1\Desktop\Python\练习代码\基础模块面向对象网络编程\day2\food.json'))
len(db)
db[0].keys()
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
db[0]['nutrients'][0]
{'value': 25.18,
'units': 'g',
'description': 'Protein',
'group': 'Composition'}
nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients[:7]
description group units value
0 Protein Composition g 25.18
1 Total lipid (fat) Composition g 29.20
2 Carbohydrate, by difference Composition g 3.06
3 Ash Other g 3.28
4 Energy Energy kcal 376.00
5 Water Composition g 39.28
6 Energy Energy kJ 1573.00
info_keys = ['description','group','id','manufacturer']
info = pd.DataFrame(db, columns=info_keys)
info.head()
description group id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description 6636 non-null object
group 6636 non-null object
id 6636 non-null int64
manufacturer 5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB
# 查看食品的分类情况
pd.value_counts(info['group'])
Vegetables and Vegetable Products 812
Beef Products 618
Baked Products 496
Breakfast Cereals 403
Legumes and Legume Products 365
Fast Foods 365
Lamb, Veal, and Game Products 345
Sweets 341
Fruits and Fruit Juices 328
Pork Products 328
Beverages 278
Soups, Sauces, and Gravies 275
Finfish and Shellfish Products 255
Baby Foods 209
Cereal Grains and Pasta 183
Ethnic Foods 165
Snacks 162
Nut and Seed Products 128
Poultry Products 116
Sausages and Luncheon Meats 111
Dairy and Egg Products 107
Fats and Oils 97
Meals, Entrees, and Sidedishes 57
Restaurant Foods 51
Spices and Herbs 41
Name: group, dtype: int64
nutrients = []
for rec in db:
fnuts = pd.DataFrame(rec['nutrients'])
fnuts['id'] = rec['id']
nutrients.append(fnuts)
# 拼接所有的营养成分
nutrients = pd.concat(nutrients, ignore_index=True)
nutrients.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 389355 entries, 0 to 389354
Data columns (total 5 columns):
description 389355 non-null object
group 389355 non-null object
units 389355 non-null object
value 389355 non-null float64
id 389355 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 14.9+ MB
# 去重,统计重复的行
nutrients.duplicated().sum()
14179
# 直接得到去重的结果
nutrients = nutrients.drop_duplicates()
# 这是营养成分的描述和分组,上面还有食物的描述和分组
nutrients.head()
description group units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
# 为了便于区别,需要重新命名
col_mapping = {'description':'food',
'group':'fgroup'
}
# 食物的重命名
info = info.rename(columns=col_mapping, copy=False)
info.head()
food fgroup id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
# 营养成分的重命名
col_mapping = {'description':'nutrient',
'group':'nutgroup'
}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients.head()
nutrient nutgroup units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
# 两表合一,on指定两表都有列名,用外连
ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.head()
nutrient nutgroup units value id food fgroup manufacturer
0 Protein Composition g 25.18 1008 Cheese, caraway Dairy and Egg Products
1 Total lipid (fat) Composition g 29.20 1008 Cheese, caraway Dairy and Egg Products
2 Carbohydrate, by difference Composition g 3.06 1008 Cheese, caraway Dairy and Egg Products
3 Ash Other g 3.28 1008 Cheese, caraway Dairy and Egg Products
4 Energy Energy kcal 376.00 1008 Cheese, caraway Dairy and Egg Products
# 按食物和营养成分分组,得到各食物营养成分最多的食物
by_nutrient = ndata.groupby(['nutrient','fgroup'])
get_maximum = lambda x:x.xs(x.value.idxmax())
max_foods = by_nutrient.apply(get_maximum)
max_foods.head()
# 只看其中的value和food
max_foods[['value','food']].head()
value food
nutrient fgroup
Adjusted Protein Sweets 12.900 Baking chocolate, unsweetened, squares
Vegetables and Vegetable Products 2.180 Mushrooms, white, raw
Alanine Baby Foods 0.911 Babyfood, meat, ham, junior
Baked Products 2.320 Leavening agents, yeast, baker's, active dry
Beef Products 2.254 Beef, cured, breakfast strips, cooked
Pandas字符串操作及实例应用的更多相关文章
- 【Python自动化Excel】Python与pandas字符串操作
Python之所以能够成为流行的数据分析语言,有一部分原因在于其简洁易用的字符串处理能力. Python的字符串对象封装了很多开箱即用的内置方法,处理单个字符串时十分方便:对于Excel.csv等表格 ...
- PHP常用字符串操作函数实例总结(trim、nl2br、addcslashes、uudecode、md5等)
/*常用的字符串输出函数 * * echo() 输出字符串 * print() 输出一个或多个字符串 * die() 输出一条信息,并退出当前脚本 * printf() 输出格式化字符串 * spri ...
- 数据分析处理库Pandas——字符串操作
字符串小写 字符串大写 字符串长度 去掉字符串中的空格 去掉字符串中的左空格 去掉字符串中的右空格 字符串替换 按字符串切割 字符串是否包含在另一个字符串中
- python学习笔记(字符串操作、字典操作、三级菜单实例)
字符串操作 name = "alex" print(name.capitalize()) #首字母大写 name = "my name is alex" pri ...
- Python数据科学手册-Pandas:向量化字符串操作、时间序列
向量化字符串操作 Series 和 Index对象 的str属性. 可以正确的处理缺失值 方法列表 正则表达式. Method Description match() Call re.match() ...
- mysql常用字符串操作函数大全,以及实例
今天在论坛中看到一个关于mysql的问题,问题如下 good_id cat_id12654 665,56912655 601,4722 goods_id是商品i ...
- c# 字符串操作
一.字符串操作 //字符串转数组 string mystring="this is a string" char[] mychars=mystring.ToCharArray(); ...
- linux shell 字符串操作
转:http://justcoding.iteye.com/blog/1963463 在做shell批处理程序时候,经常会涉及到字符串相关操作.有很多命令语句,如:awk,sed都可以做字符串各种操作 ...
- .NET面试题解析(03)-string与字符串操作
系列文章目录地址: .NET面试题解析(00)-开篇来谈谈面试 & 系列文章索引 字符串可以说是C#开发中最常用的类型了,也是对系统性能影响很关键的类型,熟练掌握字符串的操作非常重要. 常 ...
随机推荐
- Ubuntu下math库函数编译时未定义问题的解决
自己在Ubuntu下练习C程序时,用到了库函数math.h,虽然在源程序中已添加头文件“math.h”,但仍提示所用函数未定义,原本以为是程序出错了,找了好久,这是怎么回事呢? 后来上网查了下,发现是 ...
- CCF-20170903-JSON查询
这道题当时考ccf,五道题中做的时间最长的一道题....可惜最好只有0分!! 后来重现写了一下--(110行超级麻烦 主要思想:就是先对括号经行匹配,创建的时候分为创建表和创建元素两种情况,难点在于对 ...
- Web.xml详解分析
一.首先了解项目加载的优先级 首先可以肯定的是,加载顺序与它们在 web.xml 文件中的先后顺序无关.即不会因为 filter 写在 listener 的前面而会先加载 filter. 最终得出的结 ...
- tomcat访问错误调试方法
生产环境中经常用到tomcat,所以还是要学一下tomcat的排错的 很重要的一点,就是实时查看catalina.out日志 执行tail -f catalina.out就会实时刷新日志了 catal ...
- Ajax(Asynchronous JavaScript )and xml
JavaScript的两种任务执行模式--同步(synchronous)和异步(Asynchronous) 同步模式 JavaScript的执行环境是单线程的,意味着一次只能执行一个任务,如果有多个任 ...
- java fail-fast和fail-safe
快速失败(fail—fast) 在用迭代器遍历一个集合对象时,如果遍历过程中对集合对象的内容进行了修改(如增加.删除等),则会抛出Concurrent Modification Exception. ...
- 域名到站点的负载均衡技术一览(主要是探讨一台Nginx抵御大并发的解决方案)(转)https://www.cnblogs.com/EasonJim/p/7823410.html
一.问题域 Nginx.LVS.Keepalived.F5.DNS轮询,往往讨论的是接入层的这样几个问题: 1)可用性:任何一台机器挂了,服务受不受影响 2)扩展性:能否通过增加机器,扩充系统的性能 ...
- django 路由分发
对于一个大的工程,可能会有很多应用,比如cmbd,moniter,openstack等等,我们就要用到路由分发 1,首先在跟工程同名的文件夹下的urls中写分发表: from django.conf. ...
- C#.NET XML报文签名与验签
-- MD5Util: using System; using System.Collections.Generic; using System.Security.Cryptography; usin ...
- Spring Boot之执行器端点(Actuator Endpoint)实现剖析
整体实现思路是将端点(Endpoint)适配委托给MVC层策略端点(MvcEndpoint),再通过端点MVC适配器(EndpointMvcAdapter)将端点暴露为HTTP请求方式的MVC端点,最 ...