Twitter数据抓取的方法(三)
Scraping Tweets Directly from Twitters Search – Update
Published August 1, 2015
Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!
As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script.
The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.
Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:
max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.
As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.
“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.
For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.
The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.
So, lets take a look at some of the code I’ve changed:
String minTweet = null; |
Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.
Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.
You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.
Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.
Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.
All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:
private boolean has_more_items; |
Twitter数据抓取的方法(三)的更多相关文章
- Twitter数据抓取的方法(一)
Scraping Tweets Directly from Twitters Search Page – Part 1 Published January 8, 2015 EDIT – Since I ...
- Twitter数据抓取的方法(二)
Scraping Tweets Directly from Twitters Search Page – Part 2 Published January 11, 2015 In the previo ...
- Twitter数据抓取
说明:这里分三个系列介绍Twitter数据的非API抓取方法.有兴趣的QQ群交流: BitCrawler网络爬虫QQ群 322937592 1.Twitter数据抓取(一) 2.Twitter数据抓取 ...
- Twitter数据非API采集方法
说明:这里分三个系列介绍Twitter数据的非API抓取方法. 在一个老外的博看上看到的,想详细了解的可以自己去看原文. 这种方法可以采集基于关键字在twitter上搜索的结果推文,已经实现自动翻页功 ...
- python爬虫数据抓取方法汇总
概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...
- 数据抓取的艺术(三):抓取Google数据之心得
本来是想把这部分内容放到前一篇<数据抓取的艺术(二):数据抓取程序优化>之中.但是随着任务的完成,我越来越感觉到其中深深的趣味,现总结如下: (1)时间 时间是一个与抓取规模相形而 ...
- 数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置
数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置 2013-05-15 15:08:14 分类: Python/Ruby 数据抓取是一门艺术,和其他软件不同,世界上 ...
- 【Python入门只需20分钟】从安装到数据抓取、存储原来这么简单
基于大众对Python的大肆吹捧和赞赏,作为一名Java从业人员,我本着批判与好奇的心态买了本python方面的书<毫无障碍学Python>.仅仅看了书前面一小部分的我......决定做一 ...
- Python爬虫工程师必学——App数据抓取实战 ✌✌
Python爬虫工程师必学——App数据抓取实战 (一个人学习或许会很枯燥,但是寻找更多志同道合的朋友一起,学习将会变得更加有意义✌✌) 爬虫分为几大方向,WEB网页数据抓取.APP数据抓取.软件系统 ...
随机推荐
- Android中Handler使用浅析
1. Handler使用引出 现在作为客户,有这样一个需求,当打开Activity界面时,开始倒计时,倒计时结束后跳转新的界面(思维活跃的朋友可能立马想到如果打开后自动倒计时,就类似于各个APP的欢迎 ...
- ES6新属性笔记
一.destructuring--解构赋值 1.数组解构赋值 (1)完全解构 let [a,b,c] = [1,2,3];//普通 console.log(a+":"+b+&quo ...
- 问题 : lang.NoClassDefFoundError: org/springframework/core/annotation/AnnotatedElementUtils,的解决方法
今天在做junit 测试的时候 出现了一个问题,花了一段时间 才解决. java.lang.NoClassDefFoundError: org/springframework/core/annota ...
- SQL SERVER - 谈死锁的监控分析解决思路
1 背景 1.1 报警情况 最近整理笔记,打算全部迁移到EVERNOTE.整理到锁这一部分,里边刚好有个自己记录下来的案例,重新整理分享下给大家. 某日中午,收到报警短信,DB死锁异常,单分钟死锁12 ...
- 寻找与疾病相关的SNP位点——R语言从SNPedia批量提取搜索数据
是单核苷酸多态性,人的基因是相似的,有些位点上存在差异,这种某个位点的核苷酸差异就做单核苷酸多态性,它影响着生物的性状,影响着对某些疾病的易感性.SNPedia是一个SNP调査百科,它引用各种已经发布 ...
- JMX简单入门
在一个系统中常常会有一些配置信息,比如服务的IP地址,端口号什么的,那么如何来来处理这些可配置项呢? 程序新手一般是写死在程序里,到要改变时就去改程序,然后再编译发布: 程序熟手则一般把这些信息写在一 ...
- C++ 11 学习3:显示虚函数重载(override)
5.显示虚函数重载 在 C++ 里,在子类中容易意外的重载虚函数.举例来说: struct Base { virtual void some_func(); }; struct Derived : B ...
- spring exception
Spring MVC异常处理SimpleMappingExceptionResolver[转] (2012-12-07 13:45:33) 转载▼ 标签: 杂谈 分类: 技术分享 Spring3.0中 ...
- 【2-26】string/math/datetime类的定义及其应用
一string类 (1)字符串.Length Length作用于求字符串的长度,返回一个int值 (2)字符串.TrimStart(); TrimStart():可删除前空格,返回一个stri ...
- wemall app商城源码机器人检测
wemall-mobile是基于WeMall的Android app商城,只需要在原商城目录下上传接口文件即可完成服务端的配置,客户端可定制修改.本文分享wemall app商城源码Android之 ...