【337】Text Mining Using Twitter Streaming API and Python
Reference: An Introduction to Text Mining using Twitter Streaming API and Python
Reference: How to Register a Twitter App in 8 Easy Steps
- Getting Data from Twitter Streaming API
- Reading and Understanding the data
- Mining the tweets
Key Methods:
- Map()
- Lambda()
- Set()
- Pandas.DataFrame()
- matplotlib
1. Getting Data from Twitter Streaming API
twitter_streaming.py, this file is used to extract information from Twitter.
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream #Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
You can use the following command to store information in the specific file. (By CMD)
python twitter_streaming.py > twitter_data.txt
Then we will get the information from the above text file and store them in JSON format.
import json
tweets_data_path = r"..\twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
Data are stored in tweets_data, and we can get the specific information by the following scripts.
Reference: python JSON only get keys in first level
# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
num += 1
if num == 10:
break
else:
tweet_text = tweet["text"]
tweet_lang = tweet["lang"]
print(str(num))
print(tweet_lang)
print(tweet_text)
print() # get all the keys from json
tweets_data[0].keys()
2. Reading and Understanding the data
Now we can also get the specific key by list(), map() and lambda() with the following scripts.
Reference: Python中map与lambda的结合使用
>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'
Or we can also use set() method to get the unique values of the list.
Reference: Python set() 函数
Reference: Python统计列表中的重复项出现的次数的方法
>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}
Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.
>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en 1119
ja 278
es 113
pt 36
und 26
...
Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.
>>> tweets_by_lang = tweets['lang'].value_counts() >>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.
>>> tweets_by_country = tweets['country'].value_counts() >>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()

3. Mining the tweets
Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:
- We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
- Target tweets that have "programming" or tutorial" keywords.
- Extract links from the relevant tweets.
Adding Python, Ruby, and Javascript tags
First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).
Python provides a library for regular expression called re. We will start by importing this library.
Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.
>>> import re
>>> def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().
>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))
We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:
>>> print(tweets['python'].value_counts()[True])
447
>>> print(tweets['ruby'].value_counts()[True])
529
>>> print(tweets['javascript'].value_counts()[True])
275
We can make a simple comparison chart by executing the following:
>>> prg_langs = ['python', 'ruby', 'javascript']
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.
Targeting relevant tweets
We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.
>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))
We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.
>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))
We can print the counts of relevant tweet by executing the commands below.
>>> print(tweets['programming'].value_counts()[True])
55
>>> print(tweets['tutorial'].value_counts()[True])
22
>>> print(tweets['relevant'].value_counts()[True])
74
We can compare now the popularity of the programming languages by executing the commands below.
tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])
11
Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison
>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

Extracting links from the relevants tweets
Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.
>>> def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''
Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.
>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.
将原有 DataFrame 进行截取。
>>> tweets_relevant = tweets[tweets['relevant'] == True]
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']
We can now print out all links for python, ruby, and javascript by executing the commands below:
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])
40 https://t.co/zoAgyQuMAZ
105 https://t.co/ogaPbuIbEW
274 https://t.co/y4sUmovFOn
329 https://t.co/A030fqWeWA
339 https://t.co/LaaVc5T2rQ
391 https://t.co/8bYvlziCZb
413 https://t.co/8bYvlziCZb
436 https://t.co/EByqxT1qyN
444 https://t.co/8bYvlziCZb
445 https://t.co/5Jujg6h31B
462 https://t.co/UrFHlOaJYf
476 https://t.co/5Jujg6h31B
477 https://t.co/EByqxT1qyN
589 https://t.co/UrFHlOaJYf
603 https://t.co/5Jujg6h31B
822 https://t.co/Oc21FrzQc5
1060 https://t.co/qOAIuKfyD0
1097 https://t.co/qOAIuKfyD0
1248 https://t.co/V3ZNKuYsK7
1278 https://t.co/qOAIuKfyD0
1411 https://t.co/szHRHavQKy
1594 https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])
782 https://t.co/JgY40r2NSo
833 https://t.co/JgY40r2NSo
1177 https://t.co/xycOG3ndi9
1254 https://t.co/xycOG3ndi9
1293 https://t.co/LMHW050TGs
1328 https://t.co/SS4DzEnSBZ
1393 https://t.co/NZlUce5Ne8
1619 https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])
130 https://t.co/AbJFaSI0B8
286 https://t.co/7dNBIsQ5Gq
467 https://t.co/3YIK588j8t
471 https://t.co/vjBJWWzvfv
830 https://t.co/T4mUjwUcgL
1093 https://t.co/wvLZLjuVKF
1180 https://t.co/luxL2qbxte
1526 https://t.co/G3ZTFL0RKv
Name: link, dtype: object
【337】Text Mining Using Twitter Streaming API and Python的更多相关文章
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- 【LeetCode】299. Bulls and Cows 解题报告(Python)
[LeetCode]299. Bulls and Cows 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...
- 【LeetCode】743. Network Delay Time 解题报告(Python)
[LeetCode]743. Network Delay Time 解题报告(Python) 标签(空格分隔): LeetCode 作者: 负雪明烛 id: fuxuemingzhu 个人博客: ht ...
- 【LeetCode】518. Coin Change 2 解题报告(Python)
[LeetCode]518. Coin Change 2 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目 ...
- 【LeetCode】474. Ones and Zeroes 解题报告(Python)
[LeetCode]474. Ones and Zeroes 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ ...
- 【LeetCode】731. My Calendar II 解题报告(Python)
[LeetCode]731. My Calendar II 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...
- 【LeetCode】785. Is Graph Bipartite? 解题报告(Python)
[LeetCode]785. Is Graph Bipartite? 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu. ...
- 【LeetCode】895. Maximum Frequency Stack 解题报告(Python)
[LeetCode]895. Maximum Frequency Stack 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxueming ...
- 【LeetCode】764. Largest Plus Sign 解题报告(Python)
[LeetCode]764. Largest Plus Sign 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn ...
随机推荐
- mySQL 教程 第2章 安装和介绍mySQL
设置mySQL字符集 支持中文的字符集是utf8,该设置可以更改mySQL配置文件进行全局设置,也可以针对数据库设置,也可以针对表设置,也可以针对列设置.字符集更改后新插入的数据生效,对以前不生效. ...
- 如何使用Django 启动命令行及执行脚本
使用django启动命令行和脚本,可以方便的使用django框架做开发,例如,数据库的操作等. 下面分别介绍使用方法. django shell的启动 启动命令: $/data/python-virt ...
- 虚拟机设置成桥接模式x86_openwrt也可以上网
一.虚拟机桥接设置 1. 2.选择 虚拟机 >>设置 三.ip设置,要在同一网段,能够分配到路由的ip地址 1.主机ip 2.虚拟机中x86_openwrt ip 四.openwrt设置 ...
- AngularJS+ThinkPHP实例教程
总体思路 thinkphp通过RESTful方式提供数据给angular,前端(包括模板页面)全部由angular来接管. 示例 实现一个用户管理模块,走通增删改查4个操作,通过该示例,演示如何在th ...
- ML(4.2): R CART
CART模型 :即Classification And Regression Trees.它和一般回归分析类似,是用来对变量进行解释和预测的工具,也是数据挖掘中的一种常用算法.如果因变量是连续数据,相 ...
- meta标签和JS实现页面刷新与重定向
下面列了五个例子来详细说明,这几个例子的主要功能是:在5秒后,自动跳转到同目录下的hello.html(根据自己需要自行修改)文件.1) html的实现 1 2 3 4 5 6 <head& ...
- theme为dialog的Activity如何充满全屏
转自:http://blog.csdn.net/fzh0803/article/details/9787615 分类: android_点滴记录2013-08-06 10:33 2005人阅读 评论 ...
- css position 和 块级/行内元素解释
一.position 属性: static:元素框正常生成.块级元素生成一个矩形框,作为文档流的一部分,行内元素则会创建一个或多个行框,置于其父元素中. relative:元素框偏移某个距离.元素仍保 ...
- 关于String.valueOf()和.toString的问题
以下是String.valueOf()的源代码 public static String valueOf(Object obj) { return (obj == null) ? " ...
- CentOS所有版本下载地址分享
简述 CentOS(Community Enterprise Operating System - 社区企业操作系统)是Linux发行版之一,它是来自于Red Hat Enterprise Linux ...