数据科学——pandas库

pandas中有两个主要的数据结构，一个是Series，另一个是DataFrame。通过这两类数据，可以下载数据、可视化数据、和分析数据。

Pandas安装：pip install pandas

import numpy as np

import pandas as pd

a = np.array([1,5,3,4,10,0,9])

b = pd.Series([1,5,3,4,10,0,9])

print(a)

print(b)

[ 1  5  3  4 10  0  9]

0     1

1     5

2     3

3     4

4    10

5     0

6     9

dtype: int64

Series就如同列表一样，具有一系列数据，类似一维数组的对象。每个数据对应一个索引值。比如这样一个列表：[9, 3, 8]，如果跟索引值写在一起。

Series有两个属性：values和index有些时候，需要把他竖过来表示，Series就是“竖起来”的array

import pandas as pd

b = pd.Series([1,5,3,4,10,0,9])

print (b.values)

print (b.index)

print (type(b.values))

[ 1  5  3  4 10  0  9]

RangeIndex(start=0, stop=7, step=1)

<class 'numpy.ndarray'>

import pandas as pd

s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])

print (s)

张三    21

李四    19

王五    20

赵六    50

dtype: int64

s['赵六']

通过list构建Series
由数据和索引组成
1. 索引在左，数据在右
2. 索引是自动创建的
获取数据和索引

ser_obj.index, ser_obj.values
预览数据

ser_obj.head(n)

import pandas as pd

countries = ['中国','美国','日本','德国']

countries_s = pd.Series(countries)

print (countries_s)

0    中国

1    美国

2    日本

3    德国

dtype: object

import pandas as pd

country_dicts = {'CH': '中国', 'US': '美国', 'AU': '澳大利亚'}

country_dict_s = pd.Series(country_dicts)

country_dict_s.index.name = 'Code'

country_dict_s.name = 'Country'

print(country_dict_s)

print(country_dict_s.values)

print(country_dict_s.index)

Code

CH      中国

US      美国

AU    澳大利亚

Name: Country, dtype: object

['中国' '美国' '澳大利亚']

Index(['CH', 'US', 'AU'], dtype='object', name='Code')

注：把 key 当索引号了

列表的索引只能是从 0 开始的整数，Series 数据类型在默认情况下，其索引也是如此。不过，区别于列表的是，Series 可以自定义索引

import pandas as pd

data = [1,2,3,4,5]

ind = ['a','b','c','d','e']

s = pd.Series (data, index = ind )

print (s)

a    1

b    2

c    3

d    4

e    5

dtype: int64

Series转换成字典

import pandas as pd

s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])

s1 = s.to_dict ()

print (s1)

{'张三': 21, '李四': 19, '王五': 20, '赵六': 50}

向量化操作

Series 向量化操作（思维）在数据分析和人工智能领域是一个很重要，要把标量转换成向量（数组）

import numpy as np

import pandas as pd

s = range(11)

s1 = pd.Series(s)

total = np.sum(s1)

print('total = ',total)

total =  55

DataFrame

Series 类似于一维数组，DataFrame 是一种二维的数据结构，类似于电子表格。同时具有行索引（index）和列索引（label）。可以看作是由 Series 构成的字典

每一列都是一个Series。多个列对应行，也有一个行索引，DataFrame列优先，每列数据可以是不同的类型，因为有了标号，所以好提取。

DataFrame对象及操作

通过Series构建DataFrame
通过dict构建DataFrame
通过列索引获取列数据（Series类型）
- df_obj[label] 或 df_obj.label
增加列数据，类似dict添加key-value
- df_obj[new_label] = data
删除列
- del df_obj[col_idx]

# 通过 Series 创建Dataframe

import pandas as pd

country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})

country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})

country3 = pd.Series({'Name': '澳大利亚','Language': 'English (AU)', 'Area':'7.692M km2','Happiness Rank': 9})

df = pd.DataFrame([country1, country2, country3], index=['CH', 'US', 'AU'])

print(df)

    Name      Language        Area  Happiness Rank

CH    中国       Chinese  9.597M km2              79

US    美国  English (US)  9.834M km2              14

AU  澳大利亚  English (AU)  7.692M km2               9

# 添加数据

import pandas as pd

country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})

country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})

df = pd.DataFrame([country1, country2], index=['CH', 'US'])

df['Location'] = '地球'

print(df)

   Name      Language        Area  Happiness Rank Location

CH   中国       Chinese  9.597M km2              79       地球

US   美国  English (US)  9.834M km2              14       地球

# 通过 dict 创建Dataframe

import pandas as pd

dt = {0: [9, 8, 7, 6], 1: [3, 2, 1, 0]}

a = pd.DataFrame(dt)

print (a)

import pandas as pd

df1 =pd.DataFrame ([[1,2,3],[4,5,6]],index = ['A','B'],columns = ['C1','C2','C3'])

print (df1)

   C1  C2  C3

A   1   2   3

B   4   5   6

df1.T

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	A	B
C1	1	4
C2	2	5
C3	3	6

df1.shape

(2, 3)

df1.size

df1.head(1)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	C1	C2	C3
A	1	2	3

df1.tail(1)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	C1	C2	C3
B	4	5	6

df1.describe()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	C1	C2	C3
count	2.00000	2.00000	2.00000
mean	2.50000	3.50000	4.50000
std	2.12132	2.12132	2.12132
min	1.00000	2.00000	3.00000
25%	1.75000	2.75000	3.75000
50%	2.50000	3.50000	4.50000
75%	3.25000	4.25000	5.25000
max	4.00000	5.00000	6.00000

df1.loc['B']

C1    4

C2    5

C3    6

Name: B, dtype: int64

df1.loc['B'].loc['C2']

df1.loc['B', 'C1']

df1.iloc[1, 2]

import pandas as pd

data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2014,2015,2016,2017,2018],'Points':[4,25,6,2,3]}

# 指定行索引

df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])

print (df)

      name  year  Points

Day1   Joe  2014       4

Day2   Cat  2015      25

Day3  Mike  2016       6

Day4   Kim  2017       2

Day5   Amy  2018       3

# 可以选择列

print(df['Points'])

Day1     4

Day2    25

Day3     6

Day4     2

Day5     3

Name: Points, dtype: int64

DataFrame上的操作

列举不同取值
数据分组
合并数据
数据清洗

列举不同取值

unique 是一个用来列举 pandas 列中不同取值的方法（函数）

import pandas as pd

data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2012,2012,2013,2018,2018],'Points':[4,25,6,2,3]}

df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])

print (df)

      name  year  Points

Day1   Joe  2012       4

Day2   Cat  2012      25

Day3  Mike  2013       6

Day4   Kim  2018       2

Day5   Amy  2018       3

首先，通过 DataFram 传入索引的方式获取这一列的数据

然后，在这一列上调用 unique 方法就会得到不同的取值!

df['year']

Day1    2012

Day2    2012

Day3    2013

Day4    2018

Day5    2018

Name: year, dtype: int64

df['year'].unique()

array([2012, 2013, 2018], dtype=int64)

数据分组

数据按照某种标准划分为组
将函数（方法）别应用于每个组上
将结果组合成单个数据结构

groupby 是 pandas中最为常用和有效的分组函数，有 sum()、count()、mean() 等统计函数

df = DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],

                'key2':['one', 'two', 'one', 'two', 'one'],

                'data1':np.random.randn(5),

                'data2':np.random.randn(5)})

print(df)

  key1 key2     data1     data2

0    a  one  1.600927 -0.876908

1    a  two  0.159591  0.288545

2    b  one  0.919900 -0.982536

3    b  two  1.158895  1.787031

4    a  one  0.116526  0.795206

grouped = df.groupby(df['key1'])

print(grouped.mean())

         data1     data2

key1

a     0.625681  0.068948

b     1.039398  0.402248

数据合并

合并是指基于某一列将来自不同的DataFrame的列合并起来

举例：假设有两个 DataFrame ：

（1）一个是包含学生的 ID、姓名

（2）第二个包含学生ID、数学、python语言、计算思维三门课的成绩

要求：创建一个新的 DataFrame，包含学生 ID、姓名以及三门课的成绩

df2 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],

                'Math':[91, 88, 75, 68],

                'Python':[81, 82, 87, 76],

                'Computational thinking':[94, 81, 85, 86]})

print(df2)

       Key  Math  Python  Computational thinking

0  2015308    91      81                      94

1  2016312    88      82                      81

2  2017301    75      87                      85

3  2017303    68      76                      86

df3 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],

                'Name':['张三', '李四', '王五', '赵六']})

print(df3)

       Key Name

0  2015308   张三

1  2016312   李四

2  2017301   王五

3  2017303   赵六

dfnew = pd.merge(df1, df2, on='Key')

数据清洗

处理缺失数据
1. 判断数据缺失，ser_obj.isnull(), df_obj.isnull()，相反操作为notnull()!
2. 处理数据缺失
  1. df.fillna()，df.dropna() 填充、删除缺失数据!
  2. df.ffill()，按之前的数据填充!
  3. df.bfill()，按之后的数据填充!

df2

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	Key	Math	Python	Computational thinking
0	2015308	91	81	94
1	2016312	88	82	81
2	2017301	75	87	85
3	2017303	68	76	86

df2.drop([0, 3])

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	Key	Math	Python	Computational thinking
1	2016312	88	82	81
2	2017301	75	87	85

# axis指轴，0是行， 1是列，缺省值是0

df2.drop('Math', axis=1)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	Key	Python	Computational thinking
0	2015308	81	94
1	2016312	82	81
2	2017301	87	85
3	2017303	76	86

Quiz

Q1 For the following code, which of the following statements will not return True?

import pandas as pd

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj1 = pd.Series(sdata)

states = ['California', 'Ohio', 'Oregon', 'Texas']

obj2 = pd.Series(sdata, index=states)

obj3 = pd.isnull(obj2)

import math

math.isnan(obj2['California'])

True

obj2

California        NaN

Ohio          35000.0

Oregon        16000.0

Texas         71000.0

dtype: float64

obj2['California'] == None

False

x = obj2['California']

obj2['California'] != x

True

obj3['California']

True

Q2 In the below python code, the keys of the dictionary d represent student ranks and the value for each key is a student name. Which of the following can be used to extract rows with student ranks that are lower than or equal to 3?

import pandas as pd

d = {

    '1': 'Alice',

    '2': 'Bob',

    '3': 'Rita',

    '4': 'Molly',

    '5': 'Ryan'

}

S = pd.Series(d)

S.iloc[0:3]

1    Alice

2      Bob

3     Rita

dtype: object

Q3 Suppose we have a DataFrame named df. We want to change the original DataFrame df in a way that all the column names are cast to upper case. Which of the following expressions is incorrect to perform the same?

from pandas import DataFrame

score = {'gre_score':[337, 324, 316, 322, 314], 'toefl_score':[118, 107, 104, 110, 103]}

score_df = DataFrame(score, index = [1, 2, 3, 4, 5])

print(score_df)

   gre_score  toefl_score

1        337          118

2        324          107

3        316          104

4        322          110

5        314          103

score_df.where(score_df['toefl_score'] > 105).dropna()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
1	337.0	118.0
2	324.0	107.0
4	322.0	110.0

score_df[score_df['toefl_score'] > 105]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
1	337	118
2	324	107
4	322	110

score_df.where(score_df['toefl_score'] > 105)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
1	337.0	118.0
2	324.0	107.0
3	NaN	NaN
4	322.0	110.0
5	NaN	NaN

Q5 Which of the following can be used to create a DataFrame in Pandas?

Python dict

Pandas Series object

2D ndarray

Q6 Which of the following is an incorrect way to drop entries from the Pandas DataFrame named df shown below?

city_dict = {'one':[0, 4, 8, 12], 'two':[1, 5, 9, 13], 'three':[2, 6, 10, 14], 'four':[3, 7, 11, 15]}

city_df = DataFrame(city_dict, index=['Ohio', 'Colorado', 'Utah', 'New York'])

print(city_df)

          one  two  three  four

Ohio        0    1      2     3

Colorado    4    5      6     7

Utah        8    9     10    11

New York   12   13     14    15

print(city_df.drop('two', axis=1))

          one  three  four

Ohio        0      2     3

Colorado    4      6     7

Utah        8     10    11

New York   12     14    15

print(city_df.drop(['Utah', 'Colorado']))

          one  two  three  four

Ohio        0    1      2     3

New York   12   13     14    15

Q7 For the Series s1 and s2 defined below, which of the following statements will give an error?

import pandas as pd

s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})

s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})

print(s1)

print(s2)

1    Alice

2     Jack

3    Molly

dtype: object

Alice    1

Jack     2

Molly    3

dtype: int64

s2.iloc[1]

s1.loc[1]

'Alice'

s2[1]

s2.loc[1]

Q8 Which of the following statements is incorrect?

We can use s.iteritems() on a pd.Series object s to iterate on it
If s and s1 are two pd.Series objects, we cann't use s.append(s1) to directly append s1 to the existing series s.
If s is a pd.Series object, then we can use s.loc[label] to get all data where the index is equal to label.
loc and iloc ate two usefil and commonly used Pandas methods.

s = pd.Series([1, 2, 3])

s

0    1

1    2

2    3

dtype: int64

s1 = pd.Series([4, 5, 6])

s1

0    4

1    5

2    6

dtype: int64

s.append(s1)

s

0    1

1    2

2    3

dtype: int64

Q9 For the given DataFrame df shown above, we want to get all records with a toefl score greater than 105 but smaller than 115. Which of the following expressions is incorrect to perform the same?

print(score_df)

   gre_score  toefl_score

1        337          118

2        324          107

3        316          104

4        322          110

5        314          103

score_df[(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
2	324	107
4	322	110

score_df[(score_df['toefl_score'].isin(range(106, 115)))]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
2	324	107
4	322	110

(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)

1    False

2     True

3    False

4     True

5    False

Name: toefl_score, dtype: bool

score_df[score_df['toefl_score'].gt(105) & score_df['toefl_score'].lt(115)]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

	gre_score	toefl_score
2	324	107
4	322	110

Q10 Which of the following is the correct way to extract all information related to the student named Alice from the DataFrame df given below:

stu_dict = {'Name':['Alice', 'Jack'], 'Age':[20, 22], 'Gender':['F', 'M']}

stu_df = DataFrame(stu_dict, index=['Mathematics', 'Sociology'])

print(stu_df)

              Name  Age Gender

Mathematics  Alice   20      F

Sociology     Jack   22      M

stu_df.loc['Mathematics']

Name      Alice

Age          20

Gender        F

Name: Mathematics, dtype: object

Pandas——Series and DataFrane的更多相关文章

pandas Series的sort_values()方法
pandas Series的 sort_values() 方法能对Series进行排序,返回一个新的Series: s = pd.Series([np.nan, 1, 3, 10, 5]) 升序排列: ...
pandas.Series
1.系列(Series)是能够保存任何类型的数据(整数,字符串,浮点数,Python对象等)的一维标记数组.轴标签统称为索引. Pandas系列可以使用以下构造函数创建 - pandas.Series ...
pandas数组(pandas Series)-(5)apply方法自定义函数
有时候需要对 pandas Series 里的值进行一些操作,但是没有内置函数,这时候可以自己写一个函数,使用 pandas Series 的 apply 方法,可以对里面的每个值都调用这个函数,然后 ...
pandas数组(pandas Series)-(4)NaN的处理
上一篇pandas数组(pandas Series)-(3)向量化运算里说到,将两个 pandas Series 进行向量化运算的时候,如果某个 key 索引只在其中一个 Series 里出现,计算的 ...
pandas数组(pandas Series)-(3)向量化运算
这篇介绍下有index索引的pandas Series是如何进行向量化运算的: 1. index索引数组相同: s1 = pd.Series([1, 2, 3, 4], index=['a', 'b' ...
pandas数组(pandas Series)-(2)
pandas Series 比 numpy array 要强大很多,体现在很多方面首先, pandas Series 有一些方法,比如: describe 方法可以给出 Series 的一些分析数据 ...
python. pandas(series,dataframe,index) method test
python. pandas(series,dataframe,index,reindex,csv file read and write) method test import pandas as ...
Python Pandas -- Series
pandas.Series class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath ...
pandas.Series.value_counts
pandas.Series.value_counts Series.value_counts(normalize=False, sort=True, ascending=False, bins=Non ...

随机推荐

保证软件开发过程遵循ISO 26262功能安全标准的十个主要进阶步骤
保证软件开发过程遵循ISO 26262功能安全标准的十个主要进阶步骤为保障汽车软件质量,使软件开发符合ISO 26262功能安全标准,需要我们对开发流程做出改进,并与2018年的更新同步. 本视频课 ...
Django框架-模型层3/数据传输/Ajax
目录一.orm查询优化 1.only与defer 2.select_related与prefatch_related 二.模型层choices参数三.MTV与MVC模型 1.MVC 2.MTV 3 ...
后端程序员之路 33、Index搜索引擎实现分析2-对外接口和大体流程
# index_manager的单例是index server对外的唯一接口,part_indexer是index搜索的核心部分,index_manager持有了一组part_indexer. typ ...
matplotlib工具栏源码探析三（添加、删除自定义工具项）
转: matplotlib工具栏源码探析三(添加.删除自定义工具项) matplotlib工具栏源码探析二(添加.删除内置工具项)探讨了工具栏内置工具项的管理,除了内置工具项,很多场景中需要自定义工具 ...
Spring-05 使用注解开发
Spring-05 使用注解开发使用注解开发 1.项目准备在spring4之后,想要使用注解形式,必须得要引入aop的包5 <!-- https://mvnrepository.com/ar ...
Kibana 插件环境搭建教程
原文环境背景, Kibana 7.4.0, Elasticsearch 7.4.0 注意, 执行以下命令时, 尽量在管理员权限的命令行窗口里执行, 避免一些没有权限的报错; 1. 准备 Kibana ...
golang——net/rpc/jsonrpc包学习
1.jsonrpc包该实现了JSON-RPC的ClientCodec和ServerCodec接口,可用于rpc包. 可用于跨语言使用go rpc服务. 2.常用方法 (1)func Dial(net ...
FreeBSD 入门导言
→→→→→导言: 导言,这一部分通常也被称作"前言"."导论"."概论"."楔子"."写在前面".& ...
[BJOI2020] 封印
一.题目点此看题二.解法今天不知道为什么手感这么好,写一发完全没调就过掉了. 我感觉这种多组询问的字符串题是很难的,经常没有什么思路.我先考虑了一下能不能像区间本质不同的子串个数这样直接离线 ...
FZU_1608 Huge Mission 【线段树区间更新】
一.题目 Huge Mission 二.分析区间更新,用线段树的懒标记即可.需要注意的时,由于是在最后才查询的,没有必要每次更新都对$sum$进行求和.还有一点就是初始化的问题,一定记得线段树上每个 ...

Pandas——Series and DataFrane

数据科学——pandas库

Series转换成字典

向量化操作

DataFrame

DataFrame对象及操作

DataFrame上的操作

列举不同取值

数据分组

数据合并

数据清洗

Quiz

Q1 For the following code, which of the following statements will not return True?

Q2 In the below python code, the keys of the dictionary d represent student ranks and the value for each key is a student name. Which of the following can be used to extract rows with student ranks that are lower than or equal to 3?

Q3 Suppose we have a DataFrame named df. We want to change the original DataFrame df in a way that all the column names are cast to upper case. Which of the following expressions is incorrect to perform the same?

Q5 Which of the following can be used to create a DataFrame in Pandas?

Q6 Which of the following is an incorrect way to drop entries from the Pandas DataFrame named df shown below?

Q7 For the Series s1 and s2 defined below, which of the following statements will give an error?

Q8 Which of the following statements is incorrect?

Q9 For the given DataFrame df shown above, we want to get all records with a toefl score greater than 105 but smaller than 115. Which of the following expressions is incorrect to perform the same?

Q10 Which of the following is the correct way to extract all information related to the student named Alice from the DataFrame df given below:

Pandas——Series and DataFrane的更多相关文章

随机推荐

热门专题