Reference: An Introduction to Text Mining using Twitter Streaming API and Python

Reference: How to Register a Twitter App in 8 Easy Steps

  • Getting Data from Twitter Streaming API
  • Reading and Understanding the data
  • Mining the tweets

Key Methods:

  • Map()
  • Lambda()
  • Set()
  • Pandas.DataFrame()
  • matplotlib

1. Getting Data from Twitter Streaming API

twitter_streaming.py, this file is used to extract information from Twitter.

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream #Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])

You can use the following command to store information in the specific file. (By CMD)

python twitter_streaming.py > twitter_data.txt

Then we will get the information from the above text file and store them in JSON format.

import json
tweets_data_path = r"..\twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue

Data are stored in tweets_data, and we can get the specific information by the following scripts.

Reference: python JSON only get keys in first level

# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
num += 1
if num == 10:
break
else:
tweet_text = tweet["text"]
tweet_lang = tweet["lang"]
print(str(num))
print(tweet_lang)
print(tweet_text)
print() # get all the keys from json
tweets_data[0].keys()

2. Reading and Understanding the data

Now we can also get the specific key by list(), map() and lambda() with the following scripts.

Reference: Python中map与lambda的结合使用

>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'

Or we can also use set() method to get the unique values of the list.

Reference: Python set() 函数

Reference: Python统计列表中的重复项出现的次数的方法

>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}

Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en 1119
ja 278
es 113
pt 36
und 26
...

Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

>>> tweets_by_lang = tweets['lang'].value_counts()

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

>>> tweets_by_country = tweets['country'].value_counts()

>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()

3. Mining the tweets

Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

  • We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
  • Target tweets that have "programming" or tutorial" keywords.
  • Extract links from the relevant tweets.

Adding Python, Ruby, and Javascript tags

First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

Python provides a library for regular expression called re. We will start by importing this library.

Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

>>> import re
>>> def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False

Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))

We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

>>> print(tweets['python'].value_counts()[True])
447
>>> print(tweets['ruby'].value_counts()[True])
529
>>> print(tweets['javascript'].value_counts()[True])
275

We can make a simple comparison chart by executing the following:

>>> prg_langs = ['python', 'ruby', 'javascript']
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

Targeting relevant tweets

We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))

We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))

We can print the counts of relevant tweet by executing the commands below.

>>> print(tweets['programming'].value_counts()[True])
55
>>> print(tweets['tutorial'].value_counts()[True])
22
>>> print(tweets['relevant'].value_counts()[True])
74

We can compare now the popularity of the programming languages by executing the commands below.

tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])
11

Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

Extracting links from the relevants tweets

Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

>>> def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

将原有 DataFrame 进行截取。

>>> tweets_relevant = tweets[tweets['relevant'] == True]
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for python, ruby, and javascript by executing the commands below:

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])
40 https://t.co/zoAgyQuMAZ
105 https://t.co/ogaPbuIbEW
274 https://t.co/y4sUmovFOn
329 https://t.co/A030fqWeWA
339 https://t.co/LaaVc5T2rQ
391 https://t.co/8bYvlziCZb
413 https://t.co/8bYvlziCZb
436 https://t.co/EByqxT1qyN
444 https://t.co/8bYvlziCZb
445 https://t.co/5Jujg6h31B
462 https://t.co/UrFHlOaJYf
476 https://t.co/5Jujg6h31B
477 https://t.co/EByqxT1qyN
589 https://t.co/UrFHlOaJYf
603 https://t.co/5Jujg6h31B
822 https://t.co/Oc21FrzQc5
1060 https://t.co/qOAIuKfyD0
1097 https://t.co/qOAIuKfyD0
1248 https://t.co/V3ZNKuYsK7
1278 https://t.co/qOAIuKfyD0
1411 https://t.co/szHRHavQKy
1594 https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])
782 https://t.co/JgY40r2NSo
833 https://t.co/JgY40r2NSo
1177 https://t.co/xycOG3ndi9
1254 https://t.co/xycOG3ndi9
1293 https://t.co/LMHW050TGs
1328 https://t.co/SS4DzEnSBZ
1393 https://t.co/NZlUce5Ne8
1619 https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])
130 https://t.co/AbJFaSI0B8
286 https://t.co/7dNBIsQ5Gq
467 https://t.co/3YIK588j8t
471 https://t.co/vjBJWWzvfv
830 https://t.co/T4mUjwUcgL
1093 https://t.co/wvLZLjuVKF
1180 https://t.co/luxL2qbxte
1526 https://t.co/G3ZTFL0RKv
Name: link, dtype: object

【337】Text Mining Using Twitter Streaming API and Python的更多相关文章

  1. An Introduction to Text Mining using Twitter Streaming

    Text mining is the application of natural language processing techniques and analytical methods to t ...

  2. 【LeetCode】299. Bulls and Cows 解题报告(Python)

    [LeetCode]299. Bulls and Cows 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  3. 【LeetCode】743. Network Delay Time 解题报告(Python)

    [LeetCode]743. Network Delay Time 解题报告(Python) 标签(空格分隔): LeetCode 作者: 负雪明烛 id: fuxuemingzhu 个人博客: ht ...

  4. 【LeetCode】518. Coin Change 2 解题报告(Python)

    [LeetCode]518. Coin Change 2 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目 ...

  5. 【LeetCode】474. Ones and Zeroes 解题报告(Python)

    [LeetCode]474. Ones and Zeroes 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ ...

  6. 【LeetCode】731. My Calendar II 解题报告(Python)

    [LeetCode]731. My Calendar II 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  7. 【LeetCode】785. Is Graph Bipartite? 解题报告(Python)

    [LeetCode]785. Is Graph Bipartite? 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu. ...

  8. 【LeetCode】895. Maximum Frequency Stack 解题报告(Python)

    [LeetCode]895. Maximum Frequency Stack 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxueming ...

  9. 【LeetCode】764. Largest Plus Sign 解题报告(Python)

    [LeetCode]764. Largest Plus Sign 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn ...

随机推荐

  1. sklearn的画图

    from sklearn.metrics import roc_curve fpr, tpr, thresholds=roc_curve(y_train_5, y_scores) fpr, tpr & ...

  2. 虚拟机 VMware Tools 安装

    Ubuntu 或具有图形用户界面的 Ubuntu Server 要挂载 CD 镜像并解压,请按以下步骤操作: 启动此虚拟机. 使用具有管理员权限或 root 用户权限的帐户登录此虚拟机. 选择:对于F ...

  3. Microsoft Dynamics CRM 2011 中 SQL 语句总结

    一.查询用户和团队 插入的类型:OwnerIdType 可以为用户或团队: 1.团队: select top 1 ObjectTypeCode from metadataschema.entity w ...

  4. go test 初始化--- TestMain的使用

    go test 功能,提高了开发和测试的效率. 有时会遇到这样的场景: 进行测试之前需要初始化操作(例如打开连接),测试结束后,需要做清理工作(例如关闭连接)等等.这个时候就可以使用TestMain( ...

  5. sql 判断 数据库 表 字段 是否存在

    select * From master.dbo.sysdatabases where name='数据库名'select * from sysobjects where id = object_id ...

  6. 开启postgresql的远程权限

    cd /etc/postxxxx/版本号/main vim postgresql.conf 修改 #listen_addresses ='localhost'为 listen_addresses =' ...

  7. php 内容插入数据库需要mysql_escape_string处理一下 展示内容时候用htmlentities

    php 内容插入数据库需要mysql_escape_string处理一下 mysql_escape_string (PHP 4 >= 4.0.3, PHP 5, 注意:在PHP5.3中已经弃用这 ...

  8. 学习笔记之Unit testing/Integration testing/dotnet test and xUnit

    source code https://github.com/haotang923/dotnet/tree/master/src Unit testing C# code in .NET Core u ...

  9. 更喜欢从一而终?bing测试在新窗口打开链接遭美国网友痛批

                  原链接地址:http://www.cnbeta.com/articles/186529.htm 我们都知道在中国网站点击一个链接之后,默认在新窗口或新标签打开,大家也很熟悉 ...

  10. 长短时记忆网络(LSTM)

    长短时记忆网络 循环神经网络很难训练的原因导致它的实际应用中很处理长距离的依赖.本文将介绍改进后的循环神经网络:长短时记忆网络(Long Short Term Memory Network, LSTM ...