[R] [Johns Hopkins] R Programming 作業 Week 2 - Air Pollution
Introduction
For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site.
Data
The zip file containing the data can be downloaded here:
- specdata.zip [2.4MB]
The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:
- Date: the date of the observation in YYYY-MM-DD format (year-month-day)
- sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
- nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)
For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.
Part 1
pollutantmean <- function(directory, pollutant, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'pollutant' is a character vector of length 1 indicating
## the name of the pollutant for which we will calculate the
## mean; either "sulfate" or "nitrate".
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return the mean of the pollutant across all monitors list
## in the 'id' vector (ignoring NA values)
## NOTE: Do not round the result!
}
You can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named pollutantmean.R.
Part 2
Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. A prototype of this function follows
complete <- function(directory, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return a data frame of the form:
## id nobs
## 1 117
## 2 1041
## ...
## where 'id' is the monitor ID number and 'nobs' is the
## number of complete cases
}
ou can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named complete.R. To run the submit script for this part, make sure your working directory has the file complete.R in it.
Part 3
Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows
corr <- function(directory, threshold = 0) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'threshold' is a numeric vector of length 1 indicating the
## number of completely observed observations (on all
## variables) required to compute the correlation between
## nitrate and sulfate; the default is 0
## Return a numeric vector of correlations
## NOTE: Do not round the result!
}
For this function you will need to use the ‘cor’ function in R which calculates the correlation between two vectors. Please read the help page for this function via ‘?cor’ and make sure that you know how to use it.
You can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named corr.R. To run the submit script for this part, make sure your working directory has the file corr.R in it.
--------------------------------------------------------------作答區------------------------------------------------------------------------
可以直接點選連結下載檔案再行解壓縮
或是自訂R的 get_specdata()函數來執行上述步驟
# 設立get_specdata()
get_specdata <- function(dest_file) {
specdata_url <- "https://storage.googleapis.com/jhu_rprg/specdata.zip" #擷取檔案下載的url
download.file(specdata_url, destfile = dest_file) #以download.file下載,destfile = 指定位置 *註:此處~會為R主程式的wd
unzip(dest_file) #unzip檔案至Rstudio的wd
}
get_specdata("~/specdata.zip") #可指定解壓位置的get_specdata()
get_specdata <- function(dest_file, ex_dir) {
specdata_url <- "https://storage.googleapis.com/jhu_rprg/specdata.zip"
download.file(specdata_url, destfile = dest_file)
unzip(dest_file, exdir = ex_dir) #exdir為指定位置*註:此處~會為R主程式的wd
}
get_specdata("~/specdata.zip", "D:/R/Project")
pollutantmean()
pollutantmean <- function(directory,pollutant,id = 1:332) {
CSV_files_dir <- list.files(directory, full.names = T) #將茲目標料夾中的files,匯成list
dataf <-data.frame()
for(i in id){
dataf <- rbind(dataf,read.csv(CSV_files_dir[i])) #rbind將for迴圈的資料綁成新row
}
mean(dataf[,pollutant],na.rm = T) #所有row的 指定column做計算
}
另一種參考
pollutantmean <- function(directory, pollutant, id= 1:332){
pollutants = c() #設立空vector用於接數據
filenames = list.files(directory) #此處沒有用 full_name參數,只會有files name
for(i in id){
filepath=paste(directory,"/" ,filenames[i], sep="") #將檔名與路徑貼起來,製成完整路徑fliepath
data = read.csv(filepath, header = TRUE) #讀取目標檔案及其header,存至data
pollutants = c(pollutants, data[,pollutant]) #將每筆數據加長至vector中,存至pollutants
}
pollutants_mean = mean(pollutants, na.rm=TRUE) #計算並存至pollutants_mean
pollutants_mean #回報
}
練習
pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706
pollutantmean("specdata", "sulfate", 34)
[1] 1.477
pollutantmean("specdata", "nitrate")
[1] 1.703
complete()
complete <- function(directory, id = 1:332) {
CSV_files <- list.files(directory, full.names = TRUE)
datadf <- data.frame()
for (i in id) {
moni_i <- read.csv(CSV_files[i])
nobs <- sum(complete.cases(moni_i)) #complete.cases()可得是否為complete的邏輯vector,sum()加總True值
tmpdf <- data.frame(i, nobs) #將測站ID及其結果存成 df
datadf <- rbind(datadf, tmpdf) #將新的資料綁至新row
}
colnames(datadf) <- c("id", "nobs") #將column賦名
datadf #回報
}
輸出data frame
練習
查看指定感測器中,具有完整資訊的筆數
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310)) #cc5中有"id" "nobs" 兩columns
print(cc$nobs) #nobs的 vector [1] 228 148 124 165 104 460 232
查看指定感測器中,具有完整資訊的筆數
cc <- complete("specdata", 54) #cc中有"id" "nobs" 兩columns
print(cc$nobs) #nobs的 vector
[1] 219
隨機抽樣查看10組感測器,具有完整資訊的筆數
set.seed(42)
cc <- complete("specdata", 332:1) #cc中有 "id" "nobs"兩columns *row是反讀,但此處沒差
use <- sample(332, 10) #332中亂數取10個成 use vector
print(cc[use, "nobs"]) #第 use row 的 "nobs" [1] 711 135 74 445 178 73 49 0 687 237
corr()
corr <- function(directory, threshold = 0) { #門檻defalut = 0
CSV_files <- list.files(directory, full.names = TRUE)
dat <- vector(mode = "numeric", length = 0) #設置空的numeric vector
for (i in 1:length(CSV_files)) {
moni_i <- read.csv(CSV_files[i]) #此處沒有指定id,直接以length讀長度
csum <- sum((!is.na(moni_i$sulfate)) & (!is.na(moni_i$nitrate))) #獲得兩側相都沒na測值的True數量
if (csum > threshold) { #超出門檻的
tmp <- moni_i[which(!is.na(moni_i$sulfate)), ] #留下sulfate是True的
submoni_i <- tmp[which(!is.na(tmp$nitrate)), ] #再留下nitrate是True的
dat <- c(dat, cor(submoni_i$sulfate, submoni_i$nitrate)) #將cor()值綁長至dat vector 中
}
}
dat
}
輸出numeric vector
練習
從排序完成的相關係數中,隨機抽樣5組,並四捨五入至小數點下第四位
cr <- corr("specdata")
cr <- sort(cr)
set.seed(868)
out <- round(cr[sample(length(cr), 5)], 4)
print(out) [1] 0.2688 0.1127 -0.0085 0.4586 0.0447
資料完整數大於129筆的資料組數,其相關係數排序完成後隨機抽樣5組,並四捨五入至小數點下第四位
cr <- corr("specdata", 129)
cr <- sort(cr)
n <- length(cr)
set.seed(197)
out <- c(n, round(cr[sample(n, 5)], 4))
print(out) [1] 243.0000 0.2540 0.0504 -0.1462 -0.1680 0.5969
資料完整度大於2000筆的資料組數,與資料完整度大於1000筆的資料,其相關係數排序完成後以四捨五入呈現至小數點下第四位
cr <- corr("specdata", 2000)
n <- length(cr)
cr <- corr("specdata", 1000)
cr <- sort(cr)
print(c(n, round(cr, 4))) [1] 0.0000 -0.0190 0.0419 0.1901
[R] [Johns Hopkins] R Programming 作業 Week 2 - Air Pollution的更多相关文章
- [R] [Johns Hopkins] R Programming -- week 3
library(datasets) head(airquality) #按月分組 s <- split(airquality, airquality$Month) str(s) summary( ...
- [R] [Johns Hopkins] R Programming -- week 4
#Generating normal distribution (Pseudo) random number x<-rnorm(10) x x2<-rnorm(10,2,1) x2 set ...
- T100——程序从标准签出客制后注意r.c和r.l
标准签出客制后,建议到对应4gl目录,客制目录 r.c afap280_01 r.l afap280_01 ALL 常用Shell操作命令: r.c:编译程序,需在4gl路径之下执行,产生的42m会自 ...
- R语言 启动报错 *** glibc detected *** /usr/lib64/R/bin/exec/R: free(): invalid next size (fast): 0x000000000263a420 *** 错误 解决方案
*** glibc detected *** /usr/lib64/R/bin/exec/R: free(): invalid next size (fast): 0x000000000263a420 ...
- 【R笔记】R语言函数总结
R语言与数据挖掘:公式:数据:方法 R语言特征 对大小写敏感 通常,数字,字母,. 和 _都是允许的(在一些国家还包括重音字母).不过,一个命名必须以 . 或者字母开头,并且如果以 . 开头,第二个字 ...
- [转]2010 Ruby on Rails 書單 與 練習作業
原帖:http://wp.xdite.net/?p=1754 ========= 學習 Ruby on Rails 最快的途徑無非是直接使用 Rails 撰寫產品.而這個過程中若有 mentor 指導 ...
- Python获取爬虫数据, r.text 与 r.content 的区别
1.简单粗暴来讲: text 返回的是unicode 型的数据,一般是在网页的header中定义的编码形式. content返回的是bytes,二级制型的数据. 如果想要提取文本就用text 但是如果 ...
- 判斷作業系統為 64bit 或 32bit z
有時我們在開發Windows 桌面應用程式時,會發生一些弔詭的事情,作業系統位元數就是一個蠻重要的小細節,若您寫的應用程式在Windows 的32bit 作業系統上可以完美的運行,但不見得在64bit ...
- python文件操作打开模式 r,w,a,r+,w+,a+ 区别辨析
主要分成三大类: r 和 r+ "读"功能 r 只读 r+ 读写(先读后写) 辨析:对于r,只有读取功能,利用光标的移动,可以选择要读取的内容. 对于r+,同时具有读和写 ...
随机推荐
- Main Steps to Setup an ODI data sync
0. Get ODI installed 1. Topo physical Architecture/new physical schema 2. New Logical schema 3. New ...
- spring Onions and wine
Before and after the cold dew, the air is drier and the "autumn dryness" is vulnerable. Nu ...
- 信息技术手册可视化进度报告 基于jieba的关键字提取技术
在这一篇博客之前,我已经将word文件中的内容通过爬虫的方式整理到数据库中了,但是为了前台展示的需要,还必须提取出关键字,用于检索. 我用的是jieba分词,GitHub地址:https://gith ...
- Java语法基础学习DayTwenty(反射机制续)
一.Java动态代理 1.代理设计模式的原理 使用一个代理将对象包装起来, 然后用该代理对象取代原始对象. 任何对原始对象的调用都要通过代理. 代理对象决定是否以及何时将方法调用转到原始对象上. 2. ...
- npm run dev的错误
一直出现这个问题,有试过重新npm安装之类的,也试过替换文件,后来才知道原来是我在初始目录下执行run,应该cd到该项目下在run,如图 firsttest是我的项目名字
- 开发H5页面遇到的问题以及解决
1.第一个问题就是规范问题,现在边注释边编程以及语义化命名的问题已经基本的改善,页面的层级结构设计也条理了许多,现在的问题就是我对于页面的更深的应用还不够,比如我知道文档流自上而下从左至右,写在下面的 ...
- 调试利器GDB(上)
什么是GDB: GDB应用: 静态分析工具与动态分析工具: GDB启动方式: GDB启动之后会有一个交互式的命令行,可以输入GDB特定的命令让GDB去工作. gdb test.out意思是这一次gdb ...
- 遇到短信轰炸,别人换ip调你的短信接口怎么办
前端开发者很容易暴露自己的请求地址和参数,我们都知道,一个h5页面,按 F12 是可以看到页面的源码的,所以经常很多人会利用这一点恶意调取别人的接口. 我们公司出现了好多次短信接口被大量调用,导致一天 ...
- Linux文件编辑时光标操作
一.移动光标类命令 h :光标左移一个字符 l :光标右移一个字符 space:光标右移一个字符 Backspace:光标左移一个字符 k或Ctrl+p:光标上移一行 j或Ctrl+n :光标下移一行 ...
- java第三章笔记
java的基本程序设计结构: 1. 声明一个变量之后,必须用赋值语句对变量进行显示初始化,千万不能使用未被初始化的变量. 2.在java中不区分变量的声明与定义. 3.当参与/运算的两个操作数都是整数 ...