Scraping Tweets Directly from Twitters Search – Update

Published August 1, 2015

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
if(minTweet==null) {
minTweet = response.getTweets().get(0).getId();
}
continueSearch = saveTweets(response.getTweets());
String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
if(!minTweet.equals(maxTweet)) {
try {
Thread.sleep(rateDelay);
} catch (InterruptedException e) {
e.printStackTrace();
}
String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
url = constructURL(query, maxPosition);
}
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
if(query==null || query.isEmpty()) {
throw new InvalidQueryException(query);
}
try {
URIBuilder uriBuilder;
uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
uriBuilder.addParameter(QUERY_PARAM, query);
uriBuilder.addParameter(TYPE_PARAM, "tweets");
if (maxPosition != null) {
uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
}
return uriBuilder.build().toURL();
} catch(MalformedURLException | URISyntaxException e) {
e.printStackTrace();
throw new InvalidQueryException(query);
}
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

Twitter数据抓取的方法(三)的更多相关文章

  1. Twitter数据抓取的方法(一)

    Scraping Tweets Directly from Twitters Search Page – Part 1 Published January 8, 2015 EDIT – Since I ...

  2. Twitter数据抓取的方法(二)

    Scraping Tweets Directly from Twitters Search Page – Part 2 Published January 11, 2015 In the previo ...

  3. Twitter数据抓取

    说明:这里分三个系列介绍Twitter数据的非API抓取方法.有兴趣的QQ群交流: BitCrawler网络爬虫QQ群 322937592 1.Twitter数据抓取(一) 2.Twitter数据抓取 ...

  4. Twitter数据非API采集方法

    说明:这里分三个系列介绍Twitter数据的非API抓取方法. 在一个老外的博看上看到的,想详细了解的可以自己去看原文. 这种方法可以采集基于关键字在twitter上搜索的结果推文,已经实现自动翻页功 ...

  5. python爬虫数据抓取方法汇总

    概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...

  6. 数据抓取的艺术(三):抓取Google数据之心得

    本来是想把这部分内容放到前一篇<数据抓取的艺术(二):数据抓取程序优化>之中.但是随着任务的完成,我越来越感觉到其中深深的趣味,现总结如下: (1)时间     时间是一个与抓取规模相形而 ...

  7. 数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置

     数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置 2013-05-15 15:08:14 分类: Python/Ruby     数据抓取是一门艺术,和其他软件不同,世界上 ...

  8. 【Python入门只需20分钟】从安装到数据抓取、存储原来这么简单

    基于大众对Python的大肆吹捧和赞赏,作为一名Java从业人员,我本着批判与好奇的心态买了本python方面的书<毫无障碍学Python>.仅仅看了书前面一小部分的我......决定做一 ...

  9. Python爬虫工程师必学——App数据抓取实战 ✌✌

    Python爬虫工程师必学——App数据抓取实战 (一个人学习或许会很枯燥,但是寻找更多志同道合的朋友一起,学习将会变得更加有意义✌✌) 爬虫分为几大方向,WEB网页数据抓取.APP数据抓取.软件系统 ...

随机推荐

  1. 每天一个linux命令(35)--free命令

    free命令可以显示Linux系统中空闲的.易用的物理内存及swap内存,及被内核使用的buffer.在Linux系统监控的工具中,free 命令是最经常使用的命令之一. 1.命令格式: free [ ...

  2. 2017-3-2 C#基础 集合

    要使用集合必须先引用命名空间,using System.Collections; 集合与数组的不同: 数组:同一类型,固定长度集合:不同类型,不固定长度 集合主要分为六大类:普通集合,泛型集合,哈希表 ...

  3. 用ASP.NET创建网站

    ASP.NET提供三种框架来创建web应用:WebForms,ASP.NET MVC和ASP.NET WebPages.这三种框架都是稳定成熟的,你可以用任何一种方式开发一个很棒的web应用.不管你选 ...

  4. Linux服务器下Java环境搭建

    前言: 在centOS下,像阿里云等都预先设置了jdk,不过不是SUN的java JDK,一般情况要重新装jdk,而且一般情况下自己装的Jdk相对来说易控制版本,稳定性更高.所以以下是我卸载预装jdk ...

  5. JS获取网站StatusCode,若存在写入文件

    JS获取网站状态码,若网站存在,写入TXT文件,适用于IE. <script> //写文件      function writeFile(filename,filecontent){   ...

  6. 【原】cookie小结

    前记:前段时间搞一个活动,开发的时间被严重压缩,忙到飞起,以致于都没怎么写文章了,内疚. 2月份参加了一场面试,有一些关于cookie的问题回答的不是很好,所以这篇文章我们来对cooKie做一个探讨和 ...

  7. 唐伯猫的 sql server 2008 的安装和操作记录

    在服务器win 2008 server r2 上安装sql 首先下载sql  server 2008  ,云盘有存储sql,很多论坛也有下载SQLEXPRADV_x64_CHS.exe 双击sql s ...

  8. 关于使用lazytag的线段树两种查询方式的比较研究

    说到线段树,想来大家并不陌生——最基本的思路就是将其规划成块,然后只要每次修改时维护一下即可. 但是尤其是涉及到区间修改时,lazytag的使用往往能够对于程序的质量起到决定性作用(Ex:一般JSOI ...

  9. 同步 VS 异步

    同步请求资源 请求msdn上的一个页面计算页面大小 static void Main(string[] args) { string url = "https://docs.microsof ...

  10. Xamarin自定义布局系列——PivotPage,多页面切换控件

    PivotPage ---- 多页面切换控件 PivotPage是一个多页面切换控件,类似安卓中的ViewPager和UWP中的Pivot枢轴控件. 起初打算直接通过ScrollView+StackL ...