Twitter数据抓取的方法(一)

Scraping Tweets Directly from Twitters Search Page – Part 1

Published January 8, 2015

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.

In fairly recent news,
Twitter has started indexing it’s entire history of Tweets going all
the way back to 2006. Hurrah for data scientists! However, even with
this news (at time of writing), their search API is still restricted to
the past seven days of Tweets. While I doubt this will be the case
permanently, as a useful exercise this post presents how we can search
for Tweets from Twitter without necessarily using their API. Besides the
indexing, there is also the advantage that Twitter is a little more
liberal with rate limits, and you don’t require any authentication keys.

The post will be split up into two parts, this first part looking at
what we can extract from Twitter and how we might start to go about it,
and the second a tutorial on how we can implement this in Java.

Right, to begin, lets say we want to search Twitter for all tweets
related to the query “Babylon 5”. You can access Twitters advanced
search without being logged in: https://twitter.com/search-advanced

If we take a look at the URL that’s constructed when we perform the search we get:

https://twitter.com/search?q=Babylon%205&src=typd

As we can see, there are two query parameters, q (our query encoded) and src (assumed to be the source of the query, i.e. typed). However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on All the URL changes to:

https://twitter.com/search?f=realtime&q=Babylon%205&src=typd

The difference here is the f=realtime parameter that appears to specify we receive Tweets in realtime as opposed to a subset of top Tweets. Useful to know, but currently we’re only getting the first 25 Tweets back. If we scroll down though, we notice that more Tweets are loaded on the page via AJAX. Logging all XMLHttpRequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the page, Twitter makes an AJAX call a URL similar to:

https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features=1&include_entities=1&last_note_ts=85&scroll_cursor=TWEET-553069642609344512-553159310448918528-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

On further inspection, we see that it is also a JSON response, which is very useful! Before we look at the response though, let’s have a look at that URL and some of it’s parameters.

First off, it’s slightly different to the default search URL. The path is /i/search/timeline as opposed to /search. Secondly, while we notice our familiar parameters q, f, and src, from before, there are several additional ones. The most important and essential new one though is scroll_cursor. This is what Twitter uses to paginate the results. If you remove scroll_cursor from that URL, you end up with your first page of results again.

Now lets take a look now at the JSON response that Twitter provides:

{

has_more_items: boolean,

items_html: "...",

is_scrolling_request: boolean,

is_refresh_request: boolean,

scroll_cursor: "...",

refresh_cursor: "...",

focused_refresh_interval: int

}

Again, not all parameters for this post are important to take note of, but the ones that are include: has_more_items, items_html, and scroll_cursor.

has_more_items – This lets you know with a boolean value whether or not there are any more results after this query.

items_html – This is where all the tweets are which Twitter uses to append to the bottom of their timeline. It requires parsing, but there is a good amount of information in there to be extracted which we will look at in a minute.

scroll_cursor – A pagination value that allows us to extract the next page of results.

Remember our scroll_cursor parameter from earlier on? Well for each search request you make to twitter, the value of this key in the response provides you with the next set of tweets, allowing you to recursively call Twitter until either has_more_items is false, or your previous scroll_cursor equals the last scroll_cursor you had.

Now that we know how to access Twitters own search functionality, lets turn our attention to the tweets themselves. As mentioned before, items_html in the response is where all the tweets are at. However, it comes in a block of HTML as Twitter injects that block at the bottom of the page each time that call is made. The HTML inside is a list of li elements, each element a Tweet. I won’t post the HTML for one here, as even one tweet has a lot of HTML in it, but if you want to look at it, copy the items_html value (omiting the quotes around the HTML content) and paste it into something like JSBeautifier to see the formatted results for yourself.

If we look over the HTML, aside from the tweets text, there is actually a lot of useful information encapsulated in this data packet. The most important item is the Tweet id itself. If you check, it’s actually in the root li element. Now, we could stop here as with that ID, you can query Twitters official API, and if it’s a public Tweet, you can get all kinds of information. However, that’d defeat the purpose of not using the API, so lets see what we can extract from what we already have.

The table below shows various CSS selector queries that you can use to extract the information with.

Embedded Tweet Data
Selector	Value
div.original-tweet[data-tweet-id]	The authors twitter handle
div.original-tweet[data-name]	The name of the author
div.original-tweet[data-user-id]	The user ID of the author
span._timestamp[data-time]	Timestamp of the post
span._timestamp[data-time-ms]	Timestamp of the post in ms
p.tweet-text	Text of Tweet
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count]	Number of Retweets
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count]	Number of Favourites

That’s quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.

What have we learned here? Well, to summarize, we know how to construct a Twitter URL query, the response we get from said query, and the information we can extract from said response. The second part of this tutorial (to follow shortly) will introduce some code as to how we can implement the above.

Twitter数据抓取的方法(一)的更多相关文章

Twitter数据抓取的方法(二)
Scraping Tweets Directly from Twitters Search Page – Part 2 Published January 11, 2015 In the previo ...
Twitter数据抓取的方法(三)
Scraping Tweets Directly from Twitters Search – Update Published August 1, 2015 Sorry for my delayed ...
Twitter数据抓取
说明:这里分三个系列介绍Twitter数据的非API抓取方法.有兴趣的QQ群交流: BitCrawler网络爬虫QQ群 322937592 1.Twitter数据抓取(一) 2.Twitter数据抓取 ...
Twitter数据非API采集方法
说明:这里分三个系列介绍Twitter数据的非API抓取方法. 在一个老外的博看上看到的,想详细了解的可以自己去看原文. 这种方法可以采集基于关键字在twitter上搜索的结果推文,已经实现自动翻页功 ...
python爬虫数据抓取方法汇总
概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...
Phantomjs+Nodejs+Mysql数据抓取（2.抓取图片）
概要这篇博客是在上一篇博客Phantomjs+Nodejs+Mysql数据抓取(1.抓取数据) http://blog.csdn.net/jokerkon/article/details/50868 ...
Java实现多种方式的http数据抓取
前言: 时下互联网第一波的浪潮已消逝,随着而来的基于万千数据的物联网时代,因而数据成为企业的重要战略资源之一.基于数据抓取技术,本文介绍了java相关抓取工具,并附上demo源码供感兴趣的朋友测试! ...
数据抓取的艺术（一）：Selenium+Phantomjs数据抓取环境配置
数据抓取的艺术(一):Selenium+Phantomjs数据抓取环境配置 2013-05-15 15:08:14 分类: Python/Ruby 数据抓取是一门艺术,和其他软件不同,世界上 ...
[nodejs,expressjs,angularjs2] LOL英雄列表数据抓取及查询显示应用
新手练习,尝试使用angularjs2 [angularjs2 数据绑定,监听数据变化自动修改相应dom值,非常方便好用,但与传统js(jquery)的使用方法会很不同,Dom操作也不太习惯] 应用效 ...

随机推荐

HTML复习
JavaScript基础——变量、语句、注释
一.变量的命名规则 1.变量名由数字.字母.下划线组成 2.变量名的首字母不能是数字,只能是字母或者下划线 3.不能使用关键字和保留字作为变量名 4.变量严格区分大小写,例如在JavaScript中o ...
MyBatis从入门到放弃一：从SqlSession实现增删改查
前言开博客这是第一次写系列文章,从内心上讲是有点担心自己写不好,写不全,毕竟是作为java/mybatis学习的过程想把学习的路线和遇到的问题都总结下来,也让知识点在脑海里能形成一个体系. 开发环境 ...
PCB行业版特色功能展示
普实PCB行业版,专为PCB行业需求而定制.秉承一体化.集团化.移动化为设计理念,采用互联网技术.云计算技术.移动应用技术开发的新一代系统帮助PCB企业创新管理模式.引领商业变革!系统从接到订单开始, ...
Java工程师：四个月小白变大咖，你能做到吗？
你眼中的Java工程师是什么样子? 技术大牛?闷骚男?IT民工?没有女朋友?全是汉子?很邋遢?贼眉鼠眼? 今天,中软国际卓越工程师,Java精英班正式开课啦.你想看看他们都是一群怎样的人吗? 今天的武 ...
Selenium 基本操作--元素定位
对页面元素进行操作 1. 输入框输入 driver.findElement(By.id("id号")).sendKeys(“输入框输入内容”): 例:
8个超炫酷的jQuery相册插件欣赏
在网页中,相册应用十分常见,如果你经常逛一些社交网站,那么你应该会注意到很多各式各样的网页相册应用.今天我们要来分享一些最新收集的jQuery相册插件,这些精美的jQuery相册插件可以帮助你快速搭建 ...
UI 设计模式手势识别器
1> target / action 设计模式 : target ['tɑːgɪt] 1>什么是耦合 : 耦合是衡量一个程序呢写的好坏的标准之一耦合是衡量模块与模块之间关 ...
学生选课数据库SQL语句45道练习题整理及mysql常用函数（20161019）
学生选课数据库SQL语句45道练习题: 一. 设有一数据库,包括四个表:学生表(Student).课程表(Course).成绩表(Score)以及教师信息表(Teacher).四 ...
ES4：ElasticSearch 使用C#添加和更新文档
这是ElasticSearch 2.4 版本系列的第四篇: 第一篇:ES1:Windows下安装ElasticSearch 第二篇:ES2:ElasticSearch 集群配置第三篇:ES3:Ela ...

Twitter数据抓取的方法(一)

Twitter数据抓取的方法(一)的更多相关文章

随机推荐

热门专题