R 中清洗数据

为了更好的用data

找数据和处理数据都是数据挖据中比较重要的步骤

常见三种查看数据的函数

# View the first 6 rows of data
head(weather) # View the last 6 rows of data
tail(weather) # View a condensed summary of the data
str(weather)

Exploring raw data

> # Check the class of bmi
> class(bmi)
[1] "data.frame"
>
> # Check the dimensions of bmi
> dim(bmi)
[1] 199 30
>
> # View the column names of bmi
> names(bmi)
[1] "Country" "Y1980" "Y1981" "Y1982" "Y1983" "Y1984" "Y1985"
[8] "Y1986" "Y1987" "Y1988" "Y1989" "Y1990" "Y1991" "Y1992"
[15] "Y1993" "Y1994" "Y1995" "Y1996" "Y1997" "Y1998" "Y1999"
[22] "Y2000" "Y2001" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006"
[29] "Y2007" "Y2008"

使用dplyr包里面的glimpse函数查看数据结构

> # Load dplyr
> library(dplyr) Attaching package: 'dplyr'
The following objects are masked from 'package:stats': filter, lag
The following objects are masked from 'package:base': intersect, setdiff, setequal, union
>
> # Check the structure of bmi, the dplyr way
>
> glimpse(bmi)
Observations: 199
Variables: 30
$ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "...
$ Y1980 <dbl> 21.48678, 25.22533, 22.25703, 25.66652, 20.94876, 23.31424,...
$ Y1981 <dbl> 21.46552, 25.23981, 22.34745, 25.70868, 20.94371, 23.39054,...
$ Y1982 <dbl> 21.45145, 25.25636, 22.43647, 25.74681, 20.93754, 23.45883,...
$ Y1983 <dbl> 21.43822, 25.27176, 22.52105, 25.78250, 20.93187, 23.53735,...
$ Y1984 <dbl> 21.42734, 25.27901, 22.60633, 25.81874, 20.93569, 23.63584,...
$ Y1985 <dbl> 21.41222, 25.28669, 22.69501, 25.85236, 20.94857, 23.73109,...
$ Y1986 <dbl> 21.40132, 25.29451, 22.76979, 25.89089, 20.96030, 23.83449,...
$ Y1987 <dbl> 21.37679, 25.30217, 22.84096, 25.93414, 20.98025, 23.93649,...
$ Y1988 <dbl> 21.34018, 25.30450, 22.90644, 25.98477, 21.01375, 24.05364,...
$ Y1989 <dbl> 21.29845, 25.31944, 22.97931, 26.04450, 21.05269, 24.16347,...
$ Y1990 <dbl> 21.24818, 25.32357, 23.04600, 26.10936, 21.09007, 24.26782,...
$ Y1991 <dbl> 21.20269, 25.28452, 23.11333, 26.17912, 21.12136, 24.36568,...
$ Y1992 <dbl> 21.14238, 25.23077, 23.18776, 26.24017, 21.14987, 24.45644,...
$ Y1993 <dbl> 21.06376, 25.21192, 23.25764, 26.30356, 21.13938, 24.54096,...
$ Y1994 <dbl> 20.97987, 25.22115, 23.32273, 26.36793, 21.14186, 24.60945,...
$ Y1995 <dbl> 20.91132, 25.25874, 23.39526, 26.43569, 21.16022, 24.66461,...
$ Y1996 <dbl> 20.85155, 25.31097, 23.46811, 26.50769, 21.19076, 24.72544,...
$ Y1997 <dbl> 20.81307, 25.33988, 23.54160, 26.58255, 21.22621, 24.78714,...
$ Y1998 <dbl> 20.78591, 25.39116, 23.61592, 26.66337, 21.27082, 24.84936,...
$ Y1999 <dbl> 20.75469, 25.46555, 23.69486, 26.75078, 21.31954, 24.91721,...
$ Y2000 <dbl> 20.69521, 25.55835, 23.77659, 26.83179, 21.37480, 24.99158,...
$ Y2001 <dbl> 20.62643, 25.66701, 23.86256, 26.92373, 21.43664, 25.05857,...
$ Y2002 <dbl> 20.59848, 25.77167, 23.95294, 27.02525, 21.51765, 25.13039,...
$ Y2003 <dbl> 20.58706, 25.87274, 24.05243, 27.12481, 21.59924, 25.20713,...
$ Y2004 <dbl> 20.57759, 25.98136, 24.15957, 27.23107, 21.69218, 25.29898,...
$ Y2005 <dbl> 20.58084, 26.08939, 24.27001, 27.32827, 21.80564, 25.39965,...
$ Y2006 <dbl> 20.58749, 26.20867, 24.38270, 27.43588, 21.93881, 25.51382,...
$ Y2007 <dbl> 20.60246, 26.32753, 24.48846, 27.53363, 22.08962, 25.64247,...
$ Y2008 <dbl> 20.62058, 26.44657, 24.59620, 27.63048, 22.25083, 25.76602,...
> # View a summary of bmi
> summary(bmi)
Country Y1980 Y1981 Y1982
Length:199 Min. :19.01 Min. :19.04 Min. :19.07
Class :character 1st Qu.:21.27 1st Qu.:21.31 1st Qu.:21.36
Mode :character Median :23.31 Median :23.39 Median :23.46
Mean :23.15 Mean :23.21 Mean :23.26
3rd Qu.:24.82 3rd Qu.:24.89 3rd Qu.:24.94
Max. :28.12 Max. :28.36 Max. :28.58
Y1983 Y1984 Y1985 Y1986
Min. :19.10 Min. :19.13 Min. :19.16 Min. :19.20
1st Qu.:21.42 1st Qu.:21.45 1st Qu.:21.47 1st Qu.:21.49
Median :23.57 Median :23.64 Median :23.73 Median :23.82
Mean :23.32 Mean :23.37 Mean :23.42 Mean :23.48
3rd Qu.:25.02 3rd Qu.:25.06 3rd Qu.:25.11 3rd Qu.:25.20
Max. :28.82 Max. :29.05 Max. :29.28 Max. :29.52
Y1987 Y1988 Y1989 Y1990
Min. :19.23 Min. :19.27 Min. :19.31 Min. :19.35
1st Qu.:21.50 1st Qu.:21.52 1st Qu.:21.55 1st Qu.:21.57
Median :23.87 Median :23.93 Median :24.03 Median :24.14
Mean :23.53 Mean :23.59 Mean :23.65 Mean :23.71
3rd Qu.:25.27 3rd Qu.:25.34 3rd Qu.:25.37 3rd Qu.:25.39
Max. :29.75 Max. :29.98 Max. :30.20 Max. :30.42
Y1991 Y1992 Y1993 Y1994
Min. :19.40 Min. :19.45 Min. :19.51 Min. :19.59
1st Qu.:21.60 1st Qu.:21.65 1st Qu.:21.74 1st Qu.:21.76
Median :24.20 Median :24.19 Median :24.27 Median :24.36
Mean :23.76 Mean :23.82 Mean :23.88 Mean :23.94
3rd Qu.:25.42 3rd Qu.:25.48 3rd Qu.:25.54 3rd Qu.:25.62
Max. :30.64 Max. :30.85 Max. :31.04 Max. :31.23
Y1995 Y1996 Y1997 Y1998
Min. :19.67 Min. :19.71 Min. :19.74 Min. :19.77
1st Qu.:21.83 1st Qu.:21.89 1st Qu.:21.94 1st Qu.:22.00
Median :24.41 Median :24.42 Median :24.50 Median :24.49
Mean :24.00 Mean :24.07 Mean :24.14 Mean :24.21
3rd Qu.:25.70 3rd Qu.:25.78 3rd Qu.:25.85 3rd Qu.:25.94
Max. :31.41 Max. :31.59 Max. :31.77 Max. :31.95
Y1999 Y2000 Y2001 Y2002
Min. :19.80 Min. :19.83 Min. :19.86 Min. :19.84
1st Qu.:22.04 1st Qu.:22.12 1st Qu.:22.22 1st Qu.:22.29
Median :24.61 Median :24.66 Median :24.73 Median :24.81
Mean :24.29 Mean :24.36 Mean :24.44 Mean :24.52
3rd Qu.:26.01 3rd Qu.:26.09 3rd Qu.:26.19 3rd Qu.:26.30
Max. :32.13 Max. :32.32 Max. :32.51 Max. :32.70
Y2003 Y2004 Y2005 Y2006
Min. :19.81 Min. :19.79 Min. :19.79 Min. :19.80
1st Qu.:22.37 1st Qu.:22.45 1st Qu.:22.54 1st Qu.:22.63
Median :24.89 Median :25.00 Median :25.11 Median :25.24
Mean :24.61 Mean :24.70 Mean :24.79 Mean :24.89
3rd Qu.:26.38 3rd Qu.:26.47 3rd Qu.:26.53 3rd Qu.:26.59
Max. :32.90 Max. :33.10 Max. :33.30 Max. :33.49
Y2007 Y2008
Min. :19.83 Min. :19.87
1st Qu.:22.73 1st Qu.:22.83
Median :25.36 Median :25.50
Mean :24.99 Mean :25.10
3rd Qu.:26.66 3rd Qu.:26.82
Max. :33.69 Max. :33.90

$提取指定元素

# Histogram of BMIs from 2008
hist(bmi$Y2008)
# Scatter plot comparing BMIs from 1980 to those from 2008
plot(bmi$Y1980, bmi$Y2008)

Introduction to tidyr

关于tidyr的详细注释及函数参数说明见tidyr

gather()

gather函数类似于Excel(2016起)中的数据透视的功能,能把一个变量名含有变量的二维表转换成一个规范的二维表(类似数据库中关系的那种表,具体看例子)

参数说明gather函数解析

第一个参数放的是原数据,数据类型要是一个数据框;

下面传一个键值对,名字是自己起的,这两个值是做新转换成的二维表的表头,即两个变量名;

第四个是选中要转置的列,这个参数不写的话就默认全部转置;

stu<-data.frame(grade=c("A","B","C","D","E"), female=c(5, 4, 1, 2, 3), male=c(1, 2, 3, 4, 5))

gather(stu, gender, count,-grade)

spread()

spread用来扩展表,把某一列的值(键值对)分开拆成多列。

# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val) # View the head of bmi_wide
head(bmi_wide)
Country Y1980 Y1981 Y1982 Y1983 Y1984 Y1985
1 Afghanistan 21.48678 21.46552 21.45145 21.43822 21.42734 21.41222
2 Albania 25.22533 25.23981 25.25636 25.27176 25.27901 25.28669
3 Algeria 22.25703 22.34745 22.43647 22.52105 22.60633 22.69501
4 Andorra 25.66652 25.70868 25.74681 25.78250 25.81874 25.85236
5 Angola 20.94876 20.94371 20.93754 20.93187 20.93569 20.94857
6 Antigua and Barbuda 23.31424 23.39054 23.45883 23.53735 23.63584 23.73109
Y1986 Y1987 Y1988 Y1989 Y1990 Y1991 Y1992 Y1993
1 21.40132 21.37679 21.34018 21.29845 21.24818 21.20269 21.14238 21.06376
2 25.29451 25.30217 25.30450 25.31944 25.32357 25.28452 25.23077 25.21192
3 22.76979 22.84096 22.90644 22.97931 23.04600 23.11333 23.18776 23.25764
4 25.89089 25.93414 25.98477 26.04450 26.10936 26.17912 26.24017 26.30356
5 20.96030 20.98025 21.01375 21.05269 21.09007 21.12136 21.14987 21.13938
6 23.83449 23.93649 24.05364 24.16347 24.26782 24.36568 24.45644 24.54096
Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001
1 20.97987 20.91132 20.85155 20.81307 20.78591 20.75469 20.69521 20.62643
2 25.22115 25.25874 25.31097 25.33988 25.39116 25.46555 25.55835 25.66701
3 23.32273 23.39526 23.46811 23.54160 23.61592 23.69486 23.77659 23.86256
4 26.36793 26.43569 26.50769 26.58255 26.66337 26.75078 26.83179 26.92373
5 21.14186 21.16022 21.19076 21.22621 21.27082 21.31954 21.37480 21.43664
6 24.60945 24.66461 24.72544 24.78714 24.84936 24.91721 24.99158 25.05857
Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008
1 20.59848 20.58706 20.57759 20.58084 20.58749 20.60246 20.62058
2 25.77167 25.87274 25.98136 26.08939 26.20867 26.32753 26.44657
3 23.95294 24.05243 24.15957 24.27001 24.38270 24.48846 24.59620
4 27.02525 27.12481 27.23107 27.32827 27.43588 27.53363 27.63048
5 21.51765 21.59924 21.69218 21.80564 21.93881 22.08962 22.25083
6 25.13039 25.20713 25.29898 25.39965 25.51382 25.64247 25.76602

spreate()

The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.

# View the head of census_long3
head(census_long3) # Separate the yr_month column into two
census_long4 <- separate(census_long3, yr_month, c("year", "month")) # View the first 6 rows of the result
head(census_long4)

一个简单的例子

下面的这个例子就是参数稍微复杂一点


> # Apply separate() to bmi_cc
> bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")
>
> # Print the head of the result
> head(bmi_cc_clean)
Country ISO year bmi_val
1 Afghanistan AF Y1980 21.48678
2 Albania AL Y1980 25.22533
3 Algeria DZ Y1980 22.25703
4 Andorra AD Y1980 25.66652
5 Angola AO Y1980 20.94876
6 Antigua and Barbuda AG Y1980 23.31424

unite()

The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument。datacamp

> # Apply unite() to bmi_cc_clean
> bmi_cc <- unite(bmi_cc_clean,Country_ISO,Country ,ISO, sep = "-")
>
> # View the head of the result
> head(bmi_cc )
Country_ISO year bmi_val
1 Afghanistan-AF Y1980 21.48678
2 Albania-AL Y1980 25.22533
3 Algeria-DZ Y1980 22.25703
4 Andorra-AD Y1980 25.66652
5 Angola-AO Y1980 20.94876
6 Antigua and Barbuda-AG Y1980 23.31424

常见数据类型

# Make this evaluate to "character"
class("TRUE") # Make this evaluate to "numeric"
class(8484.00) # Make this evaluate to "integer"
class(99L) # Make this evaluate to "factor"
class(factor("factor")) # Make this evaluate to "logical"
class(FALSE)

lubridate

处理两种类型的数据

1.处理时点数据(time instants)

2.处理时段数据(time spans)

  • ymd("...", tz=NULL) / dmy() / mdy() :处理不同顺序的日期数据,使之按年月日的形式排列
  • dym() / ydm()
  • hms("...", roll=FALSE) / hm() / ms() :处理不同顺序的时间数据
  • ymd_hms("...", tz="UTC", locale=Sys.getlocale("LC_TIME"), truncated = 0) / ymd_hm / ymd_h :处理不同顺序的日期时间数据
  • dmy_hms /dmy_hm /dmy_h
  • mdy_hms / mdy_hm / mdy_h

tz ="UTC" :世界标准时间

# Load the lubridate package
> library(lubridate) Attaching package: 'lubridate'
The following object is masked from 'package:base': date
>
> # Parse as date
> dmy("17 Sep 2015")
[1] "2015-09-17"
>
> # Parse as date and time (with no seconds!)
> mdy_hm("July 15, 2012 12:56")
[1] "2012-07-15 12:56:00 UTC"
>
> # Coerce dob to a date (with no time)
> students2$dob <- ymd(students2$dob)
>
> # Coerce nurse_visit to a date and time
> students2$nurse_visit <- ymd_hms(students2$nurse_visit)
>
> # Look at students2 once more with str()
> str(students2)
'data.frame': 395 obs. of 33 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ school : chr "GP" "GP" "GP" "GP" ...
$ sex : chr "F" "F" "F" "F" ...
$ dob : Date, format: "2000-06-05" "1999-11-25" ...
$ address : chr "U" "U" "U" "U" ...
$ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
$ Pstatus : chr "A" "T" "T" "T" ...
$ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
$ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
$ Mjob : chr "at_home" "at_home" "at_home" "health" ...
$ Fjob : chr "teacher" "other" "other" "services" ...
$ reason : chr "course" "course" "other" "home" ...
$ guardian : chr "mother" "father" "mother" "mother" ...
$ traveltime : int 2 1 1 1 1 1 1 2 1 1 ...
$ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ schoolsup : chr "yes" "no" "yes" "no" ...
$ famsup : chr "no" "yes" "no" "yes" ...
$ paid : chr "no" "no" "yes" "yes" ...
$ activities : chr "no" "no" "no" "yes" ...
$ nursery : chr "yes" "no" "yes" "yes" ...
$ higher : chr "yes" "yes" "yes" "yes" ...
$ internet : chr "no" "yes" "yes" "yes" ...
$ romantic : chr "no" "no" "no" "yes" ...
$ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
$ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
$ goout : int 4 3 2 2 2 2 4 4 2 1 ...
$ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
$ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
$ health : int 3 3 3 5 5 5 3 1 1 5 ...
$ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ Grades : chr "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...

stringr包

Trimming and padding strings

str_trim()

he str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.

str_pad()

> library(stringr)
>
> # Trim all leading and trailing whitespace
> str_trim(c(" Filip ", "Nick ", " Jonathan"))
[1] "Filip" "Nick" "Jonathan"
>
> # Pad these strings with leading zeros
> str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = "0")
[1] "00023485W" "08823453Q" "00000994Z"

MISSING data

缺失值

# Call is.na() on the full social_df to spot all NAs 判断哪个位置的是缺失值
is.na(social_df) # Use the any() function to ask whether there are any NAs in the data 判断是否存在缺失值
anyNA(social_df)
# View a summary() of the dataset 查看数据结构
summary(social_df) # Call table() on the status column 表格化输出一列
table(social_df$status)
# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA # Print social_df to the console
social_df # Use complete.cases() to see which rows have no missing values
complete.cases(social_df) # Use na.omit() to remove all rows with any missing values
na.omit(social_df)

Outliers and obvious errors

# Review distributions for all variables
summary(weather6) # Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity==1000) # Look at the data for that day
weather6[ind, ] # Change 1000 to 100
weather6$Max.Humidity[ind] <-100

reference

Cleaning Data in R的更多相关文章

  1. (转) 6 ways of mean-centering data in R

    6 ways of mean-centering data in R 怎么scale我们的数据? 还是要看我们自己数据的特征. 如何找到我们数据的中心? Cluster analysis with K ...

  2. 学习笔记(四): Representation:Feature Engineering/Qualities of Good Features/Cleaning Data/Feature Sets

    目录 Representation Feature Engineering Mapping Raw Data to Features Mapping numeric values Mapping ca ...

  3. Importing data in R 1

    目录 Importing data in R 学习笔记1 flat files:CSV txt文件 packages:readr read_csv() read_tsv read_delim() da ...

  4. Analyzing Microarray Data with R

    1) 熟悉CEL file 从 NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24460)下载GSE24460. 将得到 ...

  5. R0—New packages for reading data into R — fast

    小伙伴儿们有福啦,2015年4月10日,Hadley Wickham大牛(开发了著名的ggplots包和plyr包等)和RStudio小组又出新作啦,新作品readr包和readxl包分别用于R读取t ...

  6. Visualization data using R and bioconductor.--NCBI

  7. data cleaning

    Cleaning data in Python   Table of Contents Set up environments Data analysis packages in Python Cle ...

  8. Managing Spark data handles in R

    When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...

  9. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

随机推荐

  1. stlink 下载报错:Error Flash Download failed - "Cortext-M0+"

    stlink 下载报错:Error Flash Download failed - "Cortext-M0+" 解决方法: STM32 ST-LINK Utility 用这个软件把 ...

  2. 解决Spring Security自定义filter重复执行问题

    今天做项目的时候,发现每次拦截器日志都会打两遍,很纳闷,怀疑是Filter被执行了两遍.结果debug之后发现还真是!记录一下这个神奇的BUG! 问题描述 项目中使用的是Spring-security ...

  3. LAMP环境搭建+配置虚拟域名

    Centos下PHP,Apache,Mysql 的安装 安装Apache yum -y install httpd systemctl start httpd 添加防火墙 firewall-cmd - ...

  4. Node.js文档-模块

    核心模块 Node为Javascript提供了很多服务器级别的API,绝大多数都被包装到了一个具名的核心模块中,例如文件操作的fs核心模块,http服务构建的http模块等,核心模块的使用必须通过re ...

  5. 如何使用Acrok Video Converter Ultimate转换视频?

    Acrok Video Converter Ultimate是一个功能强大的程序,可以帮助您转换几乎任何类型的视频格式,例如MKV,AVI,WMV,MP4,MOV,MTS,MXF,DVD,蓝光等. 下 ...

  6. python三程

    1.1 进程与线程简介 1.什么是进程(process)?(进程是资源集合) 定义:1)进程是资源分配最小单位    2)当一个可执行程序被系统执行(分配内存资源)就变成了一个进程 1. 程序并不能单 ...

  7. 修改 div 的滚动条的样式

    修改 div 的滚动条的样式 需要用到浏览器专属的伪元素,没有万能的办法,支持的浏览器不是很多. 假设有一个(你已经)设好宽高.定好位的 div, <div class="group- ...

  8. python数据类型(第二弹)

    针对上一篇博文提出的若干种python数据类型,笔者将在本文和后续几篇博文中详细介绍. 本文着重介绍python数据类型中的整数型.浮点型.复数型.布尔型以及空值. 对于整数型.浮点型和复数型数据,它 ...

  9. python里奇怪的赋值

    学了几天python了,python简洁,灵活,应用广泛,我已有所感. 1.奇怪的赋值 a,b,c=1,2,3 就一个这样的句子,就把1,2,3分别赋给了变量a,b,c,这也太奇怪了吧,太随意了吧.在 ...

  10. 使用TableHasPrimaryKey或TableHasForeignKey来知道表是否有主键或外键

    从下面2句SQL语句执行来看, 就知道那一张表有主键PrimaryKey或ForeignKey. 比如,表[Q]和[QQ]既没有主键,也没有外键. 当在SQL语句的条件中,使用“=”,那说明查询出来的 ...