Twitter数据抓取的方法(一)
Scraping Tweets Directly from Twitters Search Page – Part 1
EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.
In fairly recent news,
Twitter has started indexing it’s entire history of Tweets going all
the way back to 2006. Hurrah for data scientists! However, even with
this news (at time of writing), their search API is still restricted to
the past seven days of Tweets. While I doubt this will be the case
permanently, as a useful exercise this post presents how we can search
for Tweets from Twitter without necessarily using their API. Besides the
indexing, there is also the advantage that Twitter is a little more
liberal with rate limits, and you don’t require any authentication keys.
The post will be split up into two parts, this first part looking at
what we can extract from Twitter and how we might start to go about it,
and the second a tutorial on how we can implement this in Java.
Right, to begin, lets say we want to search Twitter for all tweets
related to the query “Babylon 5”. You can access Twitters advanced
search without being logged in: https://twitter.com/search-advanced
If we take a look at the URL that’s constructed when we perform the search we get:
https://twitter.com/search?q=Babylon%205&src=typd
As we can see, there are two query parameters, q (our query encoded) and src (assumed to be the source of the query, i.e. typed). However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on All the URL changes to:
https://twitter.com/search?f=realtime&q=Babylon%205&src=typd
The difference here is the f=realtime parameter that appears to specify we receive Tweets in realtime as opposed to a subset of top Tweets. Useful to know, but currently we’re only getting the first 25 Tweets back. If we scroll down though, we notice that more Tweets are loaded on the page via AJAX. Logging all XMLHttpRequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the page, Twitter makes an AJAX call a URL similar to:
https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features=1&include_entities=1&last_note_ts=85&scroll_cursor=TWEET-553069642609344512-553159310448918528-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
On further inspection, we see that it is also a JSON response, which is very useful! Before we look at the response though, let’s have a look at that URL and some of it’s parameters.
First off, it’s slightly different to the default search URL. The path is /i/search/timeline as opposed to /search. Secondly, while we notice our familiar parameters q, f, and src, from before, there are several additional ones. The most important and essential new one though is scroll_cursor. This is what Twitter uses to paginate the results. If you remove scroll_cursor from that URL, you end up with your first page of results again.
Now lets take a look now at the JSON response that Twitter provides:
{
has_more_items: boolean,
items_html: "...",
is_scrolling_request: boolean,
is_refresh_request: boolean,
scroll_cursor: "...",
refresh_cursor: "...",
focused_refresh_interval: int
}
Again, not all parameters for this post are important to take note of, but the ones that are include: has_more_items, items_html, and scroll_cursor.
has_more_items – This lets you know with a boolean value whether or not there are any more results after this query.
items_html – This is where all the tweets are which Twitter uses to append to the bottom of their timeline. It requires parsing, but there is a good amount of information in there to be extracted which we will look at in a minute.
scroll_cursor – A pagination value that allows us to extract the next page of results.
Remember our scroll_cursor parameter from earlier on? Well for each search request you make to twitter, the value of this key in the response provides you with the next set of tweets, allowing you to recursively call Twitter until either has_more_items is false, or your previous scroll_cursor equals the last scroll_cursor you had.
Now that we know how to access Twitters own search functionality, lets turn our attention to the tweets themselves. As mentioned before, items_html in the response is where all the tweets are at. However, it comes in a block of HTML as Twitter injects that block at the bottom of the page each time that call is made. The HTML inside is a list of li elements, each element a Tweet. I won’t post the HTML for one here, as even one tweet has a lot of HTML in it, but if you want to look at it, copy the items_html value (omiting the quotes around the HTML content) and paste it into something like JSBeautifier to see the formatted results for yourself.
If we look over the HTML, aside from the tweets text, there is actually a lot of useful information encapsulated in this data packet. The most important item is the Tweet id itself. If you check, it’s actually in the root li element. Now, we could stop here as with that ID, you can query Twitters official API, and if it’s a public Tweet, you can get all kinds of information. However, that’d defeat the purpose of not using the API, so lets see what we can extract from what we already have.
The table below shows various CSS selector queries that you can use to extract the information with.
| Selector | Value |
|---|---|
| div.original-tweet[data-tweet-id] | The authors twitter handle |
| div.original-tweet[data-name] | The name of the author |
| div.original-tweet[data-user-id] | The user ID of the author |
| span._timestamp[data-time] | Timestamp of the post |
| span._timestamp[data-time-ms] | Timestamp of the post in ms |
| p.tweet-text | Text of Tweet |
| span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Retweets |
| span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Favourites |
That’s quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.
What have we learned here? Well, to summarize, we know how to construct a Twitter URL query, the response we get from said query, and the information we can extract from said response. The second part of this tutorial (to follow shortly) will introduce some code as to how we can implement the above.
Twitter数据抓取的方法(一)的更多相关文章
- Twitter数据抓取的方法(二)
Scraping Tweets Directly from Twitters Search Page – Part 2 Published January 11, 2015 In the previo ...
- Twitter数据抓取的方法(三)
Scraping Tweets Directly from Twitters Search – Update Published August 1, 2015 Sorry for my delayed ...
- Twitter数据抓取
说明:这里分三个系列介绍Twitter数据的非API抓取方法.有兴趣的QQ群交流: BitCrawler网络爬虫QQ群 322937592 1.Twitter数据抓取(一) 2.Twitter数据抓取 ...
- Twitter数据非API采集方法
说明:这里分三个系列介绍Twitter数据的非API抓取方法. 在一个老外的博看上看到的,想详细了解的可以自己去看原文. 这种方法可以采集基于关键字在twitter上搜索的结果推文,已经实现自动翻页功 ...
- python爬虫数据抓取方法汇总
概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...
- Phantomjs+Nodejs+Mysql数据抓取(2.抓取图片)
概要 这篇博客是在上一篇博客Phantomjs+Nodejs+Mysql数据抓取(1.抓取数据) http://blog.csdn.net/jokerkon/article/details/50868 ...
- Java实现多种方式的http数据抓取
前言: 时下互联网第一波的浪潮已消逝,随着而来的基于万千数据的物联网时代,因而数据成为企业的重要战略资源之一.基于数据抓取技术,本文介绍了java相关抓取工具,并附上demo源码供感兴趣的朋友测试! ...
- 数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置
数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置 2013-05-15 15:08:14 分类: Python/Ruby 数据抓取是一门艺术,和其他软件不同,世界上 ...
- [nodejs,expressjs,angularjs2] LOL英雄列表数据抓取及查询显示应用
新手练习,尝试使用angularjs2 [angularjs2 数据绑定,监听数据变化自动修改相应dom值,非常方便好用,但与传统js(jquery)的使用方法会很不同,Dom操作也不太习惯] 应用效 ...
随机推荐
- FFmpeg入门,简单播放器
一个偶然的机缘,好像要做直播相关的项目 为了筹备,前期做一些只是储备,于是开始学习ffmpeg 这是学习的第一课 做一个简单的播放器,播放视频画面帧 思路是,将视频文件解码,得到帧,然后使用定时器,1 ...
- 关于ImageLoader的一些东西
网络图片异步加载 其实有关图片加载存在这样一个问题,图片的下载始终是一个耗时的操作,这个时候如果把图片加载放在主线程中话的是不明智的,模拟一个这样的场景, 假如在一个listview或Recycler ...
- Android业务组件化之Gradle和Sonatype Nexus搭建私有maven仓库
前言: 公司的业务组件化推进的已经差不多三四个月的时间了,各个业务组件之间的解耦工作已经基本完成,各个业务组件以module的形式存在项目中,然后项目依赖本地的module,多少有点不太利于项目的并行 ...
- Lambda表达式和Java集合框架
本文github地址 前言 我们先从最熟悉的Java集合框架(Java Collections Framework, JCF)开始说起. 为引入Lambda表达式,Java8新增了java.util. ...
- Java日常总结之LinkedList、ArrayList的效率分析
前言: 在我们平常开发中难免会用到List集合来存储数据,一般都会选择ArrayList和LinkedList,以前只是大致知道ArrayList查询效率高LinkedList插入删除效率高,今天来实 ...
- 在Windows平台搭建轻巧的Python开发环境——面向工程和科研的扩展包配置
首先,下载最新版本的Python. 为什么强调最新版本呢,因为新版本的漏洞通常会少得多,而且反映了未来的趋势. 既然要学,何不起点高一点? 官方下载地址:https://www.python.org/ ...
- Java进程通讯
管道(Pipe):管道可用于具有亲缘关系进程间的通信,允许一个进程和另一个与它有共同祖先的进程之间进行通信. 创建子进程Java有两种方式 //第一种 Runtime rt = Runtime.get ...
- 笔记本win10关机异常解决
自从使用了win10 以后,小编已经情不自禁的爱上了她——迄今为止最NB的windows系统 但令人头疼的问题也随之而来,前几天购置了一款三星的SSD固态硬盘,马上就装了win10,某天晚上关机以后发 ...
- HttpURLConnection实现两个服务端的对接
在企业开发中,很多时候需要用到两个服务端的对接,在java类中进行连接并传递参数,其中的HttpURLConnection是一种轻量化,并且简单的方法! package httptest; impor ...
- 读书笔记 effective c++ Item 37 永远不要重新定义继承而来的函数默认参数值
从一开始就让我们简化这次的讨论.你有两类你能够继承的函数:虚函数和非虚函数.然而,重新定义一个非虚函数总是错误的(Item 36),所以我们可以安全的把这个条款的讨论限定在继承带默认参数值的虚函数上. ...