数据科学——pandas库

pandas中有两个主要的数据结构,一个是Series,另一个是DataFrame。通过这两类数据,可以下载数据、可视化数据、和分析数据。

Pandas安装:pip install pandas

import numpy as np
import pandas as pd
a = np.array([1,5,3,4,10,0,9])
b = pd.Series([1,5,3,4,10,0,9])
print(a)
print(b)
[ 1  5  3  4 10  0  9]
0 1
1 5
2 3
3 4
4 10
5 0
6 9
dtype: int64

Series就如同列表一样,具有一系列数据,类似一维数组的对象。每个数据对应一个索引值。比如这样一个列表:[9, 3, 8],如果跟索引值写在一起。

Series有两个属性:values和index有些时候,需要把他竖过来表示,Series就是“竖起来”的array

import pandas as pd
b = pd.Series([1,5,3,4,10,0,9])
print (b.values)
print (b.index)
print (type(b.values))
[ 1  5  3  4 10  0  9]
RangeIndex(start=0, stop=7, step=1)
<class 'numpy.ndarray'>
import pandas as pd
s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
print (s)
张三    21
李四 19
王五 20
赵六 50
dtype: int64
s['赵六']
50
  • 通过list构建Series
  • 由数据和索引组成
    1. 索引在左,数据在右
    2. 索引是自动创建的
  • 获取数据和索引

    ser_obj.index, ser_obj.values
  • 预览数据

    ser_obj.head(n)
import pandas as pd
countries = ['中国','美国','日本','德国']
countries_s = pd.Series(countries)
print (countries_s)
0    中国
1 美国
2 日本
3 德国
dtype: object
import pandas as pd
country_dicts = {'CH': '中国', 'US': '美国', 'AU': '澳大利亚'}
country_dict_s = pd.Series(country_dicts)
country_dict_s.index.name = 'Code'
country_dict_s.name = 'Country'
print(country_dict_s)
print(country_dict_s.values)
print(country_dict_s.index)
Code
CH 中国
US 美国
AU 澳大利亚
Name: Country, dtype: object
['中国' '美国' '澳大利亚']
Index(['CH', 'US', 'AU'], dtype='object', name='Code')

注:把 key 当索引号了

列表的索引只能是从 0 开始的整数,Series 数据类型在默认情况下,其索引也是如此。不过,区别于列表的是,Series 可以自定义索引

import pandas as pd
data = [1,2,3,4,5]
ind = ['a','b','c','d','e']
s = pd.Series (data, index = ind )
print (s)
a    1
b 2
c 3
d 4
e 5
dtype: int64

Series转换成字典

import pandas as pd
s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
s1 = s.to_dict ()
print (s1)
{'张三': 21, '李四': 19, '王五': 20, '赵六': 50}

向量化操作

Series 向量化操作(思维)在数据分析和人工智能领域是一个很重要,要把标量转换成向量(数组)

import numpy as np
import pandas as pd
s = range(11)
s1 = pd.Series(s) total = np.sum(s1)
print('total = ',total)
total =  55

DataFrame

Series 类似于一维数组,DataFrame 是一种二维的数据结构,类似于电子表格。同时具有 行索引(index) 和 列索引(label)。可以看作是由 Series 构成的字典

每一列都是一个Series。多个列对应行,也有一个行索引,DataFrame列优先,每列数据可以是不同的类型,因为有了标号,所以好提取。

DataFrame对象及操作

  • 通过Series构建DataFrame
  • 通过dict构建DataFrame
  • 通过列索引获取列数据(Series类型)
    • df_obj[label] 或 df_obj.label
  • 增加列数据,类似dict添加key-value
    • df_obj[new_label] = data
  • 删除列
    • del df_obj[col_idx]
# 通过 Series 创建Dataframe
import pandas as pd
country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
country3 = pd.Series({'Name': '澳大利亚','Language': 'English (AU)', 'Area':'7.692M km2','Happiness Rank': 9})
df = pd.DataFrame([country1, country2, country3], index=['CH', 'US', 'AU'])
print(df)
    Name      Language        Area  Happiness Rank
CH 中国 Chinese 9.597M km2 79
US 美国 English (US) 9.834M km2 14
AU 澳大利亚 English (AU) 7.692M km2 9
# 添加数据
import pandas as pd
country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
df = pd.DataFrame([country1, country2], index=['CH', 'US'])
df['Location'] = '地球'
print(df)
   Name      Language        Area  Happiness Rank Location
CH 中国 Chinese 9.597M km2 79 地球
US 美国 English (US) 9.834M km2 14 地球
# 通过 dict 创建Dataframe
import pandas as pd
dt = {0: [9, 8, 7, 6], 1: [3, 2, 1, 0]}
a = pd.DataFrame(dt)
print (a)
   0  1
0 9 3
1 8 2
2 7 1
3 6 0
import pandas as pd
df1 =pd.DataFrame ([[1,2,3],[4,5,6]],index = ['A','B'],columns = ['C1','C2','C3'])
print (df1)
   C1  C2  C3
A 1 2 3
B 4 5 6
df1.T
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

A B
C1 1 4
C2 2 5
C3 3 6
df1.shape
(2, 3)
df1.size
6
df1.head(1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

C1 C2 C3
A 1 2 3
df1.tail(1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

C1 C2 C3
B 4 5 6
df1.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

C1 C2 C3
count 2.00000 2.00000 2.00000
mean 2.50000 3.50000 4.50000
std 2.12132 2.12132 2.12132
min 1.00000 2.00000 3.00000
25% 1.75000 2.75000 3.75000
50% 2.50000 3.50000 4.50000
75% 3.25000 4.25000 5.25000
max 4.00000 5.00000 6.00000
df1.loc['B']
C1    4
C2 5
C3 6
Name: B, dtype: int64
df1.loc['B'].loc['C2']
5
df1.loc['B', 'C1']
4
df1.iloc[1, 2]
6
import pandas as pd
data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2014,2015,2016,2017,2018],'Points':[4,25,6,2,3]}
# 指定行索引
df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
print (df)
      name  year  Points
Day1 Joe 2014 4
Day2 Cat 2015 25
Day3 Mike 2016 6
Day4 Kim 2017 2
Day5 Amy 2018 3
# 可以选择列
print(df['Points'])
Day1     4
Day2 25
Day3 6
Day4 2
Day5 3
Name: Points, dtype: int64

DataFrame上的操作

  • 列举不同取值
  • 数据分组
  • 合并数据
  • 数据清洗

列举不同取值

unique 是一个用来列举 pandas 列中不同取值的方法(函数)

import pandas as pd
data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2012,2012,2013,2018,2018],'Points':[4,25,6,2,3]}
df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
print (df)
      name  year  Points
Day1 Joe 2012 4
Day2 Cat 2012 25
Day3 Mike 2013 6
Day4 Kim 2018 2
Day5 Amy 2018 3

首先,通过 DataFram 传入 索引 的方式获取这一列的数据

然后,在这一列上 调用 unique 方法就会得到不同的取值!

df['year']
Day1    2012
Day2 2012
Day3 2013
Day4 2018
Day5 2018
Name: year, dtype: int64
df['year'].unique()
array([2012, 2013, 2018], dtype=int64)

数据分组

  • 数据按照某种标准划分为组
  • 将函数(方法)别应用于每个组上
  • 将结果组合成单个数据结构

groupby 是 pandas中最为常用和有效的分组函数,有 sum()、count()、mean() 等统计函数

df = DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
  key1 key2     data1     data2
0 a one 1.600927 -0.876908
1 a two 0.159591 0.288545
2 b one 0.919900 -0.982536
3 b two 1.158895 1.787031
4 a one 0.116526 0.795206
grouped = df.groupby(df['key1'])
print(grouped.mean())
         data1     data2
key1
a 0.625681 0.068948
b 1.039398 0.402248

数据合并

合并是指基于某一列将来自不同的DataFrame的列合并起来

举例:假设有两个 DataFrame :

(1)一个是包含学生的 ID、姓名
(2)第二个包含学生ID、数学、python语言、计算思维三门课的成绩

要求:创建一个新的 DataFrame,包含学生 ID、姓名以及三门课的成绩

df2 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
'Math':[91, 88, 75, 68],
'Python':[81, 82, 87, 76],
'Computational thinking':[94, 81, 85, 86]})
print(df2)
       Key  Math  Python  Computational thinking
0 2015308 91 81 94
1 2016312 88 82 81
2 2017301 75 87 85
3 2017303 68 76 86
df3 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
'Name':['张三', '李四', '王五', '赵六']})
print(df3)
       Key Name
0 2015308 张三
1 2016312 李四
2 2017301 王五
3 2017303 赵六
dfnew = pd.merge(df1, df2, on='Key')

数据清洗

  • 处理缺失数据

    1. 判断数据缺失,ser_obj.isnull(), df_obj.isnull(),相反操作为notnull()!
    2. 处理数据缺失
      1. df.fillna(),df.dropna() 填充、删除缺失数据!
      2. df.ffill(),按之前的数据填充!
      3. df.bfill(),按之后的数据填充!
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

Key Math Python Computational thinking
0 2015308 91 81 94
1 2016312 88 82 81
2 2017301 75 87 85
3 2017303 68 76 86
df2.drop([0, 3])
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

Key Math Python Computational thinking
1 2016312 88 82 81
2 2017301 75 87 85
# axis指轴,0是行, 1是列,缺省值是0
df2.drop('Math', axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

Key Python Computational thinking
0 2015308 81 94
1 2016312 82 81
2 2017301 87 85
3 2017303 76 86

Quiz

Q1 For the following code, which of the following statements will not return True?

import pandas as pd

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
obj3 = pd.isnull(obj2)
import math

math.isnan(obj2['California'])
True
obj2
California        NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
obj2['California'] == None
False
x = obj2['California']
obj2['California'] != x
True
obj3['California']
True

Q2 In the below python code, the keys of the dictionary d represent student ranks and the value for each key is a student name. Which of the following can be used to extract rows with student ranks that are lower than or equal to 3?

import pandas as pd
d = {
'1': 'Alice',
'2': 'Bob',
'3': 'Rita',
'4': 'Molly',
'5': 'Ryan'
}
S = pd.Series(d)
S.iloc[0:3]
1    Alice
2 Bob
3 Rita
dtype: object

Q3 Suppose we have a DataFrame named df. We want to change the original DataFrame df in a way that all the column names are cast to upper case. Which of the following expressions is incorrect to perform the same?

from pandas import DataFrame
score = {'gre_score':[337, 324, 316, 322, 314], 'toefl_score':[118, 107, 104, 110, 103]}
score_df = DataFrame(score, index = [1, 2, 3, 4, 5])
print(score_df)
   gre_score  toefl_score
1 337 118
2 324 107
3 316 104
4 322 110
5 314 103
score_df.where(score_df['toefl_score'] > 105).dropna()
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
1 337.0 118.0
2 324.0 107.0
4 322.0 110.0
score_df[score_df['toefl_score'] > 105]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
1 337 118
2 324 107
4 322 110
score_df.where(score_df['toefl_score'] > 105)
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
1 337.0 118.0
2 324.0 107.0
3 NaN NaN
4 322.0 110.0
5 NaN NaN

Q5 Which of the following can be used to create a DataFrame in Pandas?

Python dict

Pandas Series object

2D ndarray

Q6 Which of the following is an incorrect way to drop entries from the Pandas DataFrame named df shown below?

city_dict = {'one':[0, 4, 8, 12], 'two':[1, 5, 9, 13], 'three':[2, 6, 10, 14], 'four':[3, 7, 11, 15]}
city_df = DataFrame(city_dict, index=['Ohio', 'Colorado', 'Utah', 'New York'])
print(city_df)
          one  two  three  four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
print(city_df.drop('two', axis=1))
          one  three  four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
print(city_df.drop(['Utah', 'Colorado']))
          one  two  three  four
Ohio 0 1 2 3
New York 12 13 14 15

Q7 For the Series s1 and s2 defined below, which of the following statements will give an error?

import pandas as pd
s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})
print(s1)
print(s2)
1    Alice
2 Jack
3 Molly
dtype: object
Alice 1
Jack 2
Molly 3
dtype: int64
s2.iloc[1]
2
s1.loc[1]
'Alice'
s2[1]
2
s2.loc[1]

Q8 Which of the following statements is incorrect?

  • We can use s.iteritems() on a pd.Series object s to iterate on it
  • If s and s1 are two pd.Series objects, we cann't use s.append(s1) to directly append s1 to the existing series s.
  • If s is a pd.Series object, then we can use s.loc[label] to get all data where the index is equal to label.
  • loc and iloc ate two usefil and commonly used Pandas methods.
s = pd.Series([1, 2, 3])
s
0    1
1 2
2 3
dtype: int64
s1 = pd.Series([4, 5, 6])
s1
0    4
1 5
2 6
dtype: int64
s.append(s1)
s
0    1
1 2
2 3
dtype: int64

Q9 For the given DataFrame df shown above, we want to get all records with a toefl score greater than 105 but smaller than 115. Which of the following expressions is incorrect to perform the same?

print(score_df)
   gre_score  toefl_score
1 337 118
2 324 107
3 316 104
4 322 110
5 314 103
score_df[(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
2 324 107
4 322 110
score_df[(score_df['toefl_score'].isin(range(106, 115)))]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
2 324 107
4 322 110
(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)
1    False
2 True
3 False
4 True
5 False
Name: toefl_score, dtype: bool
score_df[score_df['toefl_score'].gt(105) & score_df['toefl_score'].lt(115)]
.dataframe tbody tr th:only-of-type { vertical-align: middle }
{ vertical-align: top }
.dataframe thead th { text-align: right }

gre_score toefl_score
2 324 107
4 322 110

Q10 Which of the following is the correct way to extract all information related to the student named Alice from the DataFrame df given below:

stu_dict = {'Name':['Alice', 'Jack'], 'Age':[20, 22], 'Gender':['F', 'M']}
stu_df = DataFrame(stu_dict, index=['Mathematics', 'Sociology'])
print(stu_df)
              Name  Age Gender
Mathematics Alice 20 F
Sociology Jack 22 M
stu_df.loc['Mathematics']
Name      Alice
Age 20
Gender F
Name: Mathematics, dtype: object

Pandas——Series and DataFrane的更多相关文章

  1. pandas Series的sort_values()方法

    pandas Series的 sort_values() 方法能对Series进行排序,返回一个新的Series: s = pd.Series([np.nan, 1, 3, 10, 5]) 升序排列: ...

  2. pandas.Series

    1.系列(Series)是能够保存任何类型的数据(整数,字符串,浮点数,Python对象等)的一维标记数组.轴标签统称为索引. Pandas系列可以使用以下构造函数创建 - pandas.Series ...

  3. pandas数组(pandas Series)-(5)apply方法自定义函数

    有时候需要对 pandas Series 里的值进行一些操作,但是没有内置函数,这时候可以自己写一个函数,使用 pandas Series 的 apply 方法,可以对里面的每个值都调用这个函数,然后 ...

  4. pandas数组(pandas Series)-(4)NaN的处理

    上一篇pandas数组(pandas Series)-(3)向量化运算里说到,将两个 pandas Series 进行向量化运算的时候,如果某个 key 索引只在其中一个 Series 里出现,计算的 ...

  5. pandas数组(pandas Series)-(3)向量化运算

    这篇介绍下有index索引的pandas Series是如何进行向量化运算的: 1. index索引数组相同: s1 = pd.Series([1, 2, 3, 4], index=['a', 'b' ...

  6. pandas数组(pandas Series)-(2)

    pandas Series 比 numpy array 要强大很多,体现在很多方面 首先, pandas Series 有一些方法,比如: describe 方法可以给出 Series 的一些分析数据 ...

  7. python. pandas(series,dataframe,index) method test

    python. pandas(series,dataframe,index,reindex,csv file read and write) method test import pandas as ...

  8. Python Pandas -- Series

    pandas.Series class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath ...

  9. pandas.Series.value_counts

    pandas.Series.value_counts Series.value_counts(normalize=False, sort=True, ascending=False, bins=Non ...

随机推荐

  1. 正月十五吃汤圆CountDownLatch

    CountDownLatch实际应用 今天是正月十五,给大家拜个晚年啦! 元宵节是中国传统节日,吃汤圆不能少啊,今天我们统计下"叫练"吃汤圆时间,并用代码模拟下叫练吃汤圆!其中用到 ...

  2. 开发在线教育平台项目步骤(Python3.7.6 + Django 3.0)

    1. 新建虚拟环境 mkvirtualenv eduonline 2. 安装mysql数据库 pip install mysqlclient pip install pymysql 3. 在setti ...

  3. 《C++ Primer》笔记 第1章 开始

    输出运算符<< 的计算结果就是其左侧运算对象 std::endl 结束当前行,并将与设备关联的缓冲区中的内容刷到设备中. 程序员常常在调试时添加打印语句.这类语句应该保证"一直& ...

  4. golang操作mysql2

    目录 Go操作MySQL 连接 下载依赖 使用MySQL驱动 初始化连接 SetMaxOpenConns SetMaxIdleConns CRUD 建库建表 查询 单行查询 多行查询 插入数据 更新数 ...

  5. 九. SpringCloud Stream消息驱动

    1. 消息驱动概述 1.1 是什么 在实际应用中有很多消息中间件,比如现在企业里常用的有ActiveMQ.RabbitMQ.RocketMQ.Kafka等,学习所有这些消息中间件无疑需要大量时间经历成 ...

  6. 初探JavaScript原型链污染

    18年p师傅在知识星球出了一些代码审计题目,其中就有一道难度为hard的js题目(Thejs)为原型链污染攻击,而当时我因为太忙了(其实是太菜了,流下了没技术的泪水)并没有认真看过,后续在p师傅写出w ...

  7. vscode动态调试

    前言: 关于vscode动态调试php项目其实在网上有文章,但那些文章或多或少都有些坑点或者转载他人,未经验证过,几度重装系统重新配置的时候在网上看文章配置总是有点问题,所以这次自己写了一篇文章,从头 ...

  8. JAVA_标识符、数据类型、变量

    标识符和关键字 ​ 所有的标识符否应该以字母a ~ z和 A ~Z ,美元符($).下划线(_)开始. ​ 首字符之后可以是字母a ~ z和 A ~Z ,美元符($).下划线(_)的任意字符组合. 注 ...

  9. springboot系列五:springboot整合mybatisplus jsp

    一.用IDEA创建项目 1.添加pom.xml <?xml version="1.0" encoding="UTF-8"?> <project ...

  10. Webpack 学习笔记(1) 开始

    目录 参考资料 1. 基础设定 2. 创建一个包 3. 使用配置文件完成打包命令 4. 使用 NPM Scripts 完成打包命令 参考资料 Getting Started | Webpack web ...