目录

Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

1. 收集数据 Collecting data

1.1 注册应用 Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

https://dev.twitter.com/overview/terms/agreement-and-policy
https://dev.twitter.com/rest/public/rate-limiting

1.2 访问数据 Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:


pip install tweepy==3.5.0

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:


import tweepy
from tweepy import OAuthHandler consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET' auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret) api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10):
# Process a single status
print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

  • So the code above can be re-written to process/store the JSON:
for status in tweepy.Cursor(api.home_timeline).items(10):
# Process a single status
process_or_store(status._json)
  • What if we want to have a list of all our followers? There you go:
for friend in tweepy.Cursor(api.friends).items():
process_or_store(friend._json)
  • And how about a list of all our tweets? Simple:
for tweet in tweepy.Cursor(api.user_timeline).items():
process_or_store(tweet._json)

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

def process_or_store(tweet):
print(json.dumps(tweet))

1.3 使用数据流 Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:


from tweepy import Stream
from tweepy.streaming import StreamListener class MyListener(StreamListener): def on_data(self, data):
try:
with open('python.json', 'a') as f:
f.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True def on_error(self, status):
print(status)
return True twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:


##config.py
consumer_key = 'your-consumer-key'
consumer_secret = 'your-consumer-secret'
access_token = 'your-access-token'
access_secret = 'your-access-secret'
##twitter_stream_download.py
# To run this code, first edit config.py with your configuration, then:
#
# mkdir data
# python twitter_stream_download.py -q apple -d data
#
# It will produce the list of tweets for the query "apple"
# in the file data/stream_apple.json import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import argparse
import string
import config
import json def get_parser():
"""Get parser for command line arguments."""
parser = argparse.ArgumentParser(description="Twitter Downloader")
parser.add_argument("-q",
"--query",
dest="query",
help="Query/Filter",
default='-')
parser.add_argument("-d",
"--data-dir",
dest="data_dir",
help="Output/Data Directory")
return parser class MyListener(StreamListener):
"""Custom StreamListener for streaming data.""" def __init__(self, data_dir, query):
query_fname = format_filename(query)
self.outfile = "%s/stream_%s.json" % (data_dir, query_fname) def on_data(self, data):
try:
with open(self.outfile, 'a') as f:
f.write(data)
print(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
time.sleep(5)
return True def on_error(self, status):
print(status)
return True def format_filename(fname):
"""Convert file name into a safe string.
Arguments:
fname -- the file name to convert
Return:
String -- converted file name
"""
return ''.join(convert_valid(one_char) for one_char in fname) def convert_valid(one_char):
"""Convert a character into '_' if invalid.
Arguments:
one_char -- the char to convert
Return:
Character -- converted char
"""
valid_chars = "-_.%s%s" % (string.ascii_letters, string.digits)
if one_char in valid_chars:
return one_char
else:
return '_' @classmethod
def parse(cls, api, raw):
status = cls.first_parse(api, raw)
setattr(status, 'json', json.dumps(raw))
return status if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
auth = OAuthHandler(config.consumer_key, config.consumer_secret)
auth.set_access_token(config.access_token, config.access_secret)
api = tweepy.API(auth) twitter_stream = Stream(auth, MyListener(args.data_dir, args.query))
twitter_stream.filter(track=[args.query])

使用Python对Twitter进行数据挖掘(Mining Twitter Data with Python)的更多相关文章

  1. Mining Twitter Data with Python

    目录 1.Collecting data 1.1 Register Your App 1.2 Accessing the Data 1.3 Streaming 2.Text Pre-processin ...

  2. 数据挖掘(二)用python实现数据探索:汇总统计和可视化

    今天我们来讲一讲有关数据探索的问题.其实这个概念还蛮容易理解的,就是我们刚拿到数据之后对数据进行的一个探索的过程,旨在了解数据的属性与分布,发现数据一些明显的规律,这样的话一方面有助于我们进行数据预处 ...

  3. Python 中的实用数据挖掘

    本文是 2014 年 12 月我在布拉格经济大学做的名为‘ Python 数据科学’讲座的笔记.欢迎通过 @RadimRehurek 进行提问和评论. 本次讲座的目的是展示一些关于机器学习的高级概念. ...

  4. python快速入门——进入数据挖掘你该有的基础知识

    这篇文章是用来总结python中重要的语法,通过这些了解你可以快速了解一段python代码的含义 Python 的基础语法来带你快速入门 Python 语言.如果你想对 Python 有全面的了解请关 ...

  5. 精通 Oracle+Python,第 7 部分:面向服务的 Python 架构

    面向服务的架构 (SOA) 在当今的业务战略中具有至关重要的作用.混搭企业组件已成为所有任务关键的企业应用程序的标准要求,从而确保在企业架构的各层实现顺畅的服务编排.对此,Python 是一个不错的选 ...

  6. 【Python五篇慢慢弹】快速上手学python

    快速上手学python 作者:白宁超 2016年10月4日19:59:39 摘要:python语言俨然不算新技术,七八年前甚至更早已有很多人研习,只是没有现在流行罢了.之所以当下如此盛行,我想肯定是多 ...

  7. 《精通Python网络爬虫》|百度网盘免费下载|Python爬虫实战

    <精通Python网络爬虫>|百度网盘免费下载|Python爬虫实战 提取码:7wr5 内容简介 为什么写这本书 网络爬虫其实很早就出现了,最开始网络爬虫主要应用在各种搜索引擎中.在搜索引 ...

  8. windows和linux中搭建python集成开发环境IDE——如何设置多个python环境

    本系列分为两篇: 1.[转]windows和linux中搭建python集成开发环境IDE 2.[转]linux和windows下安装python集成开发环境及其python包 3.windows和l ...

  9. python成长之路【第一篇】:python简介和入门

    一.Python简介 Python(英语发音:/ˈpaɪθən/), 是一种面向对象.解释型计算机程序设计语言. 二.安装python windows: 1.下载安装包 https://www.pyt ...

随机推荐

  1. ubuntu搭建php开发环境记录

    这两天自己在阿里云上面买了一个ecs,系统选的是ubuntu16.04,第一件事就是先搭环境,这次准备使用lamp组合. Apache安装 首先安装apache服务器,ubuntu下面使用apt-ge ...

  2. Python 多线程、进程

    本节内容 操作系统发展史介绍 进程.与线程区别 python GIL全局解释器锁 线程 语法 join 线程锁之Lock\Rlock\信号量 将线程变为守护进程 Event事件 queue队列 生产者 ...

  3. ubuntu16下用QT5实现对话框应用

    ubuntu16下用QT5,实现对话框程序,步骤:生成界面Dialog.ui,将它应用到主程序,通过主程序显示. 一 界面练习 1 Dialog.ui界面生成 在命令行输入:designer 进入界面 ...

  4. DOCKER解析(转)

    Docker基本概念详解 本文只是对Docker的概念做了较为详细的介绍,并不涉及一些像Docker环境的安装以及Docker的一些常见操作和命令. 阅读本文大概需要15分钟,通过阅读本文你将知道一下 ...

  5. C语言递归函数讲解

    递归函数是什么? 是函数.................... 你可以把它理解成是for循环与死循环的结合的函数.简单的说:递归函数是有条件终止的死循环函数: 死循环函数这里是指在函数体中调用自身: ...

  6. C++中绝对值的运算

    首先,输入-42333380005结果取出来的绝对值却是616292955. 开始我以为是long型的取值范围有问题,就把long型全部改为long long型的了,结果还是一样,就觉得绝对值这个函数 ...

  7. eclipse中tomcat可以start启动,无法debug启动的解决

    设置断点,进行程序调试,但是debug启动tomcat,却无法启动,并且会报超时异常. 原因可能是eclipse和tomcat启动时读取文件发生冲突 去掉所有的断点,然后重新debug启动,再设置断点 ...

  8. Jmeter获取redis数据

    背景:jmeter写注册登录接口时,需要获取验短信验证码,一般都是存在数据库,但我们的开发把验证码存到redis里面了 步骤:1.写个redis工具类 2.打成jar包,导入jmeter lib\ex ...

  9. linux svnserver的安装使用备用

    先说一下初弄者的误区,svn上传到svnserver的文件是变化了的,会被打包加入svn的版本库里边一般存在db 文件下 每次提交会生成0,1,2 这样排序的文件,在  /var/svn/apple/ ...

  10. android中一些特殊字符的使用(如:←↑→↓等箭头符号)

    在项目中,有时候在一些控件(如Button.TextView)中要添加一些符号,如下图所示:                         这个时候可以使用图片的方式来显示,不过这些可以直接使用Un ...