js爬虫

1、爬虫相关的包

（1）const request = require('superagent'); // 处理get post put delete head 请求轻量接http请求库,模仿浏览器登陆

（2）const cheerio = require('cheerio'); // 加载html
（3）const fs = require('fs'); // 加载文件系统模块将数据存到一个文件中的时候会用到

fs.writeFile('saveFiles/zybl.txt', content, (error1) => { // 将文件存起来文件路径要存的内容错误
if (error1) throw error1;
// console.log(' text save ');
});

（4）const fs = require('graceful-fs'); // 将文件存为xlse

const writeStream = fs.createWriteStream('saveFiles/trader.xlsx'); //新建xlsx文件

writeStream.write(title);//像slsx里面写入内容

（5）const Promise = require('bluebird'); //异步处理

（6）const Nightmare = require('nightmare');//一个高层次的浏览器自动化图书馆先要安装phantomjs 然后在装nightmare

（7）const co = require('co');

2、爬虫代码

'use strict';

const co = require('co');
const fs = require('fs');
const Nightmare = require('nightmare'); // 可视化的浏览器

const url = 'http://sports.qq.com/isocce/';

const onError = function (err) {
console.log(err);
};

const getHtml = function (pageUrl) {
const pageScraper = new Nightmare();// 打开浏览器
let content = null;

return co(function* run() {
yield pageScraper.goto(pageUrl.url).wait();
console.log('222222' + pageUrl.url);
content = yield pageScraper.evaluate(() => {
const temp = document.querySelector('body').innerHTML;
return temp;
});
console.log('子页面链接');
console.dir(content);

yield fs.writeFile('../../saveFiles/' + pageUrl.title + '.html', content, (err) => {
console.log('存文件.......');
if (err) return console.log(err);
return console.log('Save pageUrl content to ' + pageUrl.title + '.html');
});
});
};

co(function* run() {
const scraper = new Nightmare({
show: true
});// 打开一个可视化的浏览器
let counter = 0;
// let next = null;
let links = [];

yield scraper
.goto(url) // 跳转的地址
.wait();
// .click('#feed-laliga > a');
for (let i = 0; i < 5; i ++) {
yield scraper.wait(2000)
.click('#feed-laliga > a');
}

links = yield scraper
.evaluate(() => {
const temp = document.querySelectorAll('#feed-laliga h3 > a');
const list = [];
for (const each of temp) {
console.log('each');
console.log(each);
list.push({
title: each.innerText,
url: each.href,
});
}
return list;
});
// 在这里加载更多

console.log('这里');
console.dir(links);

for (const link of links) {
if (link !== null && link.url !== 'javascript:void(0)') {
counter += 1;
setTimeout(() => {
getHtml(link);
}, counter * links.length * 250);
}
}
yield scraper.end();
}).catch(onError);

js爬虫的更多相关文章

Node.js爬虫-爬取慕课网课程信息
第一次学习Node.js爬虫,所以这时一个简单的爬虫,Node.js的好处就是可以并发的执行这个爬虫主要就是获取慕课网的课程信息,并把获得的信息存储到一个文件中,其中要用到cheerio库,它可以让 ...
node.js爬虫
这是一个简单的node.js爬虫项目,麻雀虽小五脏俱全. 本项目主要包含一下技术: 发送http抓取页面(http).分析页面(cheerio).中文乱码处理(bufferhelper).异步并发流程 ...
Node.js aitaotu图片批量下载Node.js爬虫1.00版
即使是https网页,解析的方式也不是一致的,需要多试试. 代码: //====================================================== // aitaot ...
Node.js umei图片批量下载Node.js爬虫1.00
这个爬虫在abaike爬虫的基础上改改图片路径和下一页路径就出来了,代码如下: //====================================================== // ...
Node.js abaike图片批量下载Node.js爬虫1.01版
//====================================================== // abaike图片批量下载Node.js爬虫1.01 // 1.01 修正了输出目 ...
Node.js abaike图片批量下载Node.js爬虫1.00版
这个与前作的差别在于地址的不规律性,需要找到下一页的地址再爬过去找. //====================================================== // abaik ...
Node JS爬虫：爬取瀑布流网页高清图
原文链接:Node JS爬虫:爬取瀑布流网页高清图静态为主的网页往往用get方法就能获取页面所有内容.动态网页即异步请求数据的网页则需要用浏览器加载完成后再进行抓取.本文介绍了如何连续爬取瀑布流网页 ...
Node.js 爬虫爬取电影信息
Node.js 爬虫爬取电影信息我的CSDN地址:https://blog.csdn.net/weixin_45580251/article/details/107669713 爬取的是1905电影 ...
Node.js 爬虫初探
前言在学习慕课网视频和Cnode新手入门接触到爬虫,说是爬虫初探,其实并没有用到爬虫相关第三方类库,主要用了node.js基础模块http.网页分析工具cherrio. 使用http直接获取url路 ...
Node.js 爬虫，自动化抓取文章标题和正文
持续进行中... 目标: 动态User-Agent模拟浏览器 √ 支持Proxy设置,避免被服务器端拒绝 √ 支持多核模式,发挥多核CPU性能 √ 支持核内并发模式 √ 自动解码非英文站点,避免乱码出 ...

随机推荐

js 小知识
在iframe 页面获取父级页面的 html var obj = window.parent.document.getElementById('modaliframe'); 解决Jquery 的在一个 ...
JDBC的使用（一）：引用外部jar；代码链接数据库
一:引用外部jar 1.首先不jar文件放到项目下: 2.在Eclipse中,右键相应的项目--构建路径--配置构建路径--库--添加外部jar:选中-打开-应用-确定. 二:代码链接数据库 1.加载 ...
angular模块和组件之间传递信息和操作流程的方法（笔记）
angular的模块之间,以及controller.directive等组件之间,是相对独立的,用以实现解耦合. 为实现相互之间传递信息及操作流程,有以下一些机制: 1.事件机制: $scope.$b ...
springmvc配置多视图 - tiles, velocity, freeMarker, jsp
转自: http://www.cnblogs.com/shanheyongmu/p/5684595.html  <bean id="vel ...
在Update表数据同时将数据备份
分享一条有意思的SQL语句,也是前两天有个朋友在面试的时候碰到的,他当时没有做出来,回来之后问我, 如何在同一条语句中实现,更新表的时候同时备份更新前的记录数据. --在修改数据前,先要把修改前的数据 ...
面向过程 vs 面向对象
从网上摘录了一些面向过程vs.面向对象的分析,先简单记录如下,稍后会继续整理. 为什么会出现面向对象分析方法? 因为现实世界太复杂多变,面向过程的分析方法无法实现. 面向过程采用面向过程必须了解整个 ...
主页面、iframe之间调用以及传值
主页面.iframe之间的调用和传值,无非就是两个交互形式: 主页面与子页面的交互子页面之间的交互接下来要讲的是四种交互传值的方式:利用postMessage方法传值.DOM操作传值.URL方式传 ...
xl2tp部署
参考 http://blog.51yip.com/linux/1795.html 说到VPN,就会想到google,满心的疼.以前写过一篇关于vpn的文单,请参考:centos5.5 vpn 安装配置 ...
PC管理端与评委云打分配合步骤及疑难问题汇编,即如何使用PC管理端的云服务管理功能
一.前期环境及数据准备 A.PC管理端主要流程 1.进入菜单编辑/选项/服务器界面,如下图所示,采用我官方所提供的云服务,不采用自己假设的AppWeb服务. 切记:AppWeb服务和云服务只能二选 ...
ASP.Net中的Web Resource
http://support.microsoft.com/kb/910442,这是中文的,机器翻译的,不太容易看懂,英文的是:http://support.microsoft.com/kb/91044 ...

js爬虫

js爬虫的更多相关文章

随机推荐

热门专题