Pandas字符串操作及实例应用

字符串操作

字符串对象方法

val = 'a,b, guido'

val.split(',')

['a', 'b', ' guido']

pieces = [x.strip() for x in val.split(',')]

pieces

['a', 'b', 'guido']

first,second,third = pieces

'::'.join(pieces)

'a::b::guido'

'guido' in val

True

注意find和index的区别：如果找不到字符串，index将会引发一个异常（而不是返回-1）

# find不会报错

val.find(':')

-1

# 返回指定字串的出现次数

val.count(',')

2

正则表达式

如果想避免正则表达式中不需要的转义('')，则可以使用原始字符串字面量如r'C:\x'

import re

text = 'foo bar\t baz \tqux'

re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

regex = re.compile('\s+')

regex.split(text)

['foo', 'bar', 'baz', 'qux']

regex.findall(text)

[' ', '\t ', ' \t']

pandas中矢量化的字符串函数

import numpy as np

from pandas import Series

data = {'Davae':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Wes':np.nan}

data

{'Davae': 'dave@google.com',

 'Steve': 'steve@gmail.com',

 'Rob': 'rob@gmail.com',

 'Wes': nan}

data2 = Series(data)

data2

Davae    dave@google.com

Steve    steve@gmail.com

Rob        rob@gmail.com

Wes                  NaN

dtype: object

data2.isnull()

Davae    False

Steve    False

Rob      False

Wes       True

dtype: bool

通过data.map，所有字符串和正则表达式方法都能被应用于各个值

对象下面的属性，可以取得所有的字符串

data2.str.contains('gmail')

Davae    False

Steve     True

Rob       True

Wes        NaN

dtype: object

# 匹配规则

regex = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

# 映射匹配

res = data2.str.findall(regex,flags = re.IGNORECASE)

res

Davae    [(dave, google, com)]

Steve    [(steve, gmail, com)]

Rob        [(rob, gmail, com)]

Wes                        NaN

dtype: object

# 直接使用get属性索引取值

res_first = res.str.get(0)

res_first

Davae    (dave, google, com)

Steve    (steve, gmail, com)

Rob        (rob, gmail, com)

Wes                      NaN

dtype: object

res_second = res_first.str.get(1)

res_second

Davae    google

Steve     gmail

Rob       gmail

Wes         NaN

dtype: object

res_second.str[:2]

Davae     go

Steve     gm

Rob       gm

Wes      NaN

dtype: object

示例： USDA食品数据库

import json

import pandas as pd

db = json.load(open(r'C:\Users\1\Desktop\Python\练习代码\基础模块面向对象网络编程\day2\food.json'))

len(db)

db[0].keys()

dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])

db[0]['nutrients'][0]

{'value': 25.18,

 'units': 'g',

 'description': 'Protein',

 'group': 'Composition'}

nutrients = pd.DataFrame(db[0]['nutrients'])

nutrients[:7]

description	group	units	value

0	Protein	                    Composition	 g	25.18

1	Total lipid (fat)	        Composition	 g	29.20

2	Carbohydrate, by difference	Composition	 g	3.06

3	Ash	                          Other	     g	3.28

4	Energy	                     Energy	   kcal	376.00

5	Water	                    Composition	 g	39.28

6	Energy	                        Energy	kJ	1573.00

info_keys = ['description','group','id','manufacturer']

info = pd.DataFrame(db, columns=info_keys)

info.head()

       description	                          group	    id	     manufacturer

0	Cheese, caraway	                    Dairy and Egg Products	1008

1	Cheese, cheddar	                    Dairy and Egg Products	1009

2	Cheese, edam	                    Dairy and Egg Products	1018

3	Cheese, feta	                    Dairy and Egg Products	1019

4	Cheese, mozzarella, part skim milk	Dairy and Egg Products	1028

info.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 6636 entries, 0 to 6635

Data columns (total 4 columns):

description     6636 non-null object

group           6636 non-null object

id              6636 non-null int64

manufacturer    5195 non-null object

dtypes: int64(1), object(3)

memory usage: 207.5+ KB

# 查看食品的分类情况

pd.value_counts(info['group'])

Vegetables and Vegetable Products    812

Beef Products                        618

Baked Products                       496

Breakfast Cereals                    403

Legumes and Legume Products          365

Fast Foods                           365

Lamb, Veal, and Game Products        345

Sweets                               341

Fruits and Fruit Juices              328

Pork Products                        328

Beverages                            278

Soups, Sauces, and Gravies           275

Finfish and Shellfish Products       255

Baby Foods                           209

Cereal Grains and Pasta              183

Ethnic Foods                         165

Snacks                               162

Nut and Seed Products                128

Poultry Products                     116

Sausages and Luncheon Meats          111

Dairy and Egg Products               107

Fats and Oils                         97

Meals, Entrees, and Sidedishes        57

Restaurant Foods                      51

Spices and Herbs                      41

Name: group, dtype: int64

nutrients = []

for rec in db:

    fnuts = pd.DataFrame(rec['nutrients'])

    fnuts['id'] = rec['id']

    nutrients.append(fnuts)

# 拼接所有的营养成分

nutrients = pd.concat(nutrients, ignore_index=True)

nutrients.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 389355 entries, 0 to 389354

Data columns (total 5 columns):

description    389355 non-null object

group          389355 non-null object

units          389355 non-null object

value          389355 non-null float64

id             389355 non-null int64

dtypes: float64(1), int64(1), object(3)

memory usage: 14.9+ MB

# 去重,统计重复的行

nutrients.duplicated().sum()

14179

# 直接得到去重的结果

nutrients = nutrients.drop_duplicates()

# 这是营养成分的描述和分组，上面还有食物的描述和分组

nutrients.head()

    description	                    group	 units	value	id

0	Protein	                     Composition	g	25.18	1008

1	Total lipid (fat)	         Composition	g	29.20	1008

2	Carbohydrate, by difference	 Composition	g	3.06	1008

3	Ash	                              Other	    g	3.28	1008

4	Energy	                        Energy	   kcal	376.00	1008

# 为了便于区别，需要重新命名

col_mapping = {'description':'food',

                'group':'fgroup'

              }

# 食物的重命名

info = info.rename(columns=col_mapping, copy=False)

info.head()

            food	                         fgroup	              id	manufacturer

0	Cheese, caraway	                    Dairy and Egg Products	1008

1	Cheese, cheddar	                    Dairy and Egg Products	1009

2	Cheese, edam	                    Dairy and Egg Products	1018

3	Cheese, feta	                    Dairy and Egg Products	1019

4	Cheese, mozzarella, part skim milk	Dairy and Egg Products	1028

# 营养成分的重命名

col_mapping = {'description':'nutrient',

               'group':'nutgroup'

              }

nutrients = nutrients.rename(columns=col_mapping, copy=False)

nutrients.head()

          nutrient	               nutgroup	units	value	id

0	Protein	                    Composition	   g	25.18	1008

1	Total lipid (fat)	        Composition	   g	29.20	1008

2	Carbohydrate, by difference	Composition	   g	3.06	1008

3	Ash	                              Other	   g	3.28	1008

4	Energy	                         Energy	kcal	376.00	1008

# 两表合一,on指定两表都有列名，用外连

ndata = pd.merge(nutrients, info, on='id', how='outer')

ndata.head()

    nutrient	                  nutgroup	  units	    value	id	         food	           fgroup	          manufacturer

0	Protein	                     Composition	g	   25.18	1008	Cheese, caraway	   Dairy and Egg Products

1	Total lipid (fat)	         Composition	g	   29.20	1008	Cheese, caraway	   Dairy and Egg Products

2	Carbohydrate, by difference	 Composition	g	    3.06	1008	Cheese, caraway	   Dairy and Egg Products

3	Ash	                          Other	        g	    3.28	1008	Cheese, caraway    Dairy and Egg Products

4	Energy	                      Energy	    kcal	376.00	1008	Cheese, caraway	   Dairy and Egg Products	

# 按食物和营养成分分组，得到各食物营养成分最多的食物

by_nutrient = ndata.groupby(['nutrient','fgroup'])

get_maximum = lambda x:x.xs(x.value.idxmax())

max_foods = by_nutrient.apply(get_maximum)

max_foods.head()

# 只看其中的value和food

max_foods[['value','food']].head()

		                                                     value	         food

nutrient	             fgroup

Adjusted Protein	      Sweets	                        12.900	Baking chocolate, unsweetened, squares

                         Vegetables and Vegetable Products	2.180	Mushrooms, white, raw

Alanine	                 Baby Foods	                        0.911	Babyfood, meat, ham, junior

                         Baked Products	                    2.320	Leavening agents, yeast, baker's, active dry

                         Beef Products	                    2.254	Beef, cured, breakfast strips, cooked

Pandas字符串操作及实例应用的更多相关文章

【Python自动化Excel】Python与pandas字符串操作
Python之所以能够成为流行的数据分析语言,有一部分原因在于其简洁易用的字符串处理能力. Python的字符串对象封装了很多开箱即用的内置方法,处理单个字符串时十分方便:对于Excel.csv等表格 ...
PHP常用字符串操作函数实例总结(trim、nl2br、addcslashes、uudecode、md5等)
/*常用的字符串输出函数 * * echo() 输出字符串 * print() 输出一个或多个字符串 * die() 输出一条信息,并退出当前脚本 * printf() 输出格式化字符串 * spri ...
数据分析处理库Pandas——字符串操作
字符串小写字符串大写字符串长度去掉字符串中的空格去掉字符串中的左空格去掉字符串中的右空格字符串替换按字符串切割字符串是否包含在另一个字符串中
python学习笔记（字符串操作、字典操作、三级菜单实例）
字符串操作 name = "alex" print(name.capitalize()) #首字母大写 name = "my name is alex" pri ...
Python数据科学手册-Pandas:向量化字符串操作、时间序列
向量化字符串操作 Series 和 Index对象的str属性. 可以正确的处理缺失值方法列表正则表达式. Method Description match() Call re.match() ...
mysql常用字符串操作函数大全，以及实例
今天在论坛中看到一个关于mysql的问题,问题如下 good_id cat_id12654 665,56912655 601,4722 goods_id是商品i ...
c# 字符串操作
一.字符串操作 //字符串转数组 string mystring="this is a string" char[] mychars=mystring.ToCharArray(); ...
linux shell 字符串操作
转:http://justcoding.iteye.com/blog/1963463 在做shell批处理程序时候,经常会涉及到字符串相关操作.有很多命令语句,如:awk,sed都可以做字符串各种操作 ...
.NET面试题解析(03)-string与字符串操作
系列文章目录地址: .NET面试题解析(00)-开篇来谈谈面试 & 系列文章索引字符串可以说是C#开发中最常用的类型了,也是对系统性能影响很关键的类型,熟练掌握字符串的操作非常重要. 常 ...

随机推荐

day04列表
列表内容详细 1.列表公共独有方法删除 remove pop clear del区别强制转换 #表示多个事物 users=["lili","Joe", ...
使用Blend设计出符合效果的WPF界面
之前不会用blend,感觉好难的,但美工给出的效果自己有没办法实现,所以研究了一下blend,感觉没有想象中的那么难废话不多说,开始界面设计今天拿到美工给的一个界面效果图这个界面说实话,还可以吧 ...
zombodb 几个方便的_cat api
zombodb 暴露所有es _cat/ api 为视图,我们可以通过视图方便的查询es 的信息,默认在zdb的schema 中包含的视图几个方便的view 查看索引统计信息zdb.index_s ...
go 调用windows dll 的方法
go 调用windows dll 的方法 ,代码如下: package main import ( "fmt" "syscall" "time&quo ...
Python3根据基础概率随机生成选项
想要实现一个功能:不同事件发生的基础概率不同,根据基础概率来随机生成选项. 比如,北京的秋天有四种状态,并分别对应一个基础概率,然后随机生成某一天的天气情况. weatherlist = ['Sunn ...
oracle存储结构
数据库的物理存储结构 select * from v$datafile; 数据库的逻辑存储结构,从表空间开始查起一个数据库对象的逻辑存储结构如下表空间-段-区-块 select * from dba_ ...
java实现表格tr拖动
实现功能:js实现表格tr拖动,并保存因为拖动改变的等级. jsp代码 <div id="mainContainer"> <div class="con ...
jQuery基础（二）DOM
DOM节点的创建 jQuery节点创建与属性的处理创建元素节点: $("<div></div>") 创建为文本节点: $("<div> ...
sql 不够七位数在左侧自动补零 ,并循环插入N条记录
select right(cast('0000000000'+rtrim(字段) as varchar(20)),7) declare @i intdeclare @qid int set @i=1s ...
BCC校验小知识
BCC校验其实是奇偶校验的一种,但也是经常使用并且效率较高的一种.所谓BCC校验法,就是在发送前和发送后分别把BCC以前包括ETX字符的所有字符按位异或后,按要求变换(增加或去除一个固定的值)后所得到 ...

Pandas字符串操作及实例应用

字符串操作

正则表达式

如果想避免正则表达式中不需要的转义('')，则可以使用原始字符串字面量如r'C:\x'

pandas中矢量化的字符串函数

通过data.map，所有字符串和正则表达式方法都能被应用于各个值

对象下面的属性，可以取得所有的字符串

示例： USDA食品数据库

Pandas字符串操作及实例应用的更多相关文章

随机推荐

热门专题