Text mining is the application of natural language processing techniques and analytical methods to text data in order to derive relevant information.

Text mining is getting a lot attention these last years, due to an exponential increase in digital text data from web pages, google's projects such as google books and google ngram, and social media services such as Twitter.

Twitter data constitutes a rich source that can be used for capturing information about any topic imaginable. This data can be used in different use cases such as finding trends related to a specific keyword, measuring brand sentiment, and gathering feedback about new products and services.

In this tutorial, I will use Twitter data to compare the popularity of 3 programming languages: Python, Javascript and Ruby, and to retrieve links to programming tutorials. In the first paragraph, I will explaing how to connect to Twitter Streaming API and how to get the data. In the second paragraph, I will explain how to structure the data for analysis, and in the last paragraph, I will explain how to filter the data and extract links from tweets.

Using only 2 days worth of Twitter data, I could retrieve 644 links to python tutorials, 413 to javascript tutorials and 136 to ruby tutorials. Furthermore, I could confirm that python is 1.5 times more popular than javascript and 4 times more popular than ruby.

1. Getting Data from Twitter Streaming API

API stands for Application Programming Interface. It is a tool that makes the interaction with computer programs and web services easy. Many web services provides APIs to developers to interact with their services and to access data in programmatic way. For this tutorial, we will use Twitter Streaming API to download tweets related to 3 keywords: "python", "javascript", and "ruby".

Step 1: Getting Twitter API keys

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements:

  • Create a twitter account if you do not already have one.
  • Go to https://apps.twitter.com/ and log in with your twitter credentials.
  • Click "Create New App"
  • Fill out the form, agree to the terms, and click "Create your Twitter application"
  • In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
  • Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".

Step 2: Connecting to Twitter Streaming API and downloading data

We will be using a Python library called Tweepy to connect to Twitter Streaming API and downloading the data. If you don't have Tweepy installed in your machine, go to this link, and follow the installation instructions.

Next create, a file called twitter_streaming.py, and copy into it the code below. Make sure to enter your credentials into access_token, access_token_secret, consumer_key, and consumer_secret.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#twitter_streamming.py
# Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream # Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" # This is a basic listener that just prints received tweets to stdout. class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__':
# This handles Twitter authetification and the connection to Twitter
# Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
# This line filter Twitter Streams to capture data by the keywords:
# 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])

If you run the program from your terminal using the command: python twitter_streaming.py, you will see data flowing like the picture below.

You can stop the program by pressing Ctrl-C.

We want to capture this data into a file that we will use later for the analysis. You can do so by piping the output to a file using the following command: python twitter_streaming.py > twitter_data.txt.

I run the program for 2 days (from 2014/07/15 till 2014/07/17) to get a meaningful data sample. This file size is 242 MB.

  1. Reading and Understanding the data

The data that we stored twitter_data.txt is in JSON format. JSON stands for JavaScript Object Notation. This format makes it easy to humans to read the data, and for machines to parse it. Below is an example for one tweet in JSON format. You can see that the tweet contains additional information in addition to the main text which in this example: "Yaayyy I learned some JavaScript today! #thatwasntsohard #yesitwas #stoptalkingtoyourself #hashbrown #hashtag".

{"created_at":"Tue Jul 15 14:19:30 +0000 2014","id":489051636304990208,"id_str":"489051636304990208","text":"Yaayyy I learned some JavaScript today! #thatwasntsohard #yesitwas #stoptalkingtoyourself #hashbrown #hashtag","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2301702187,"id_str":"2301702187","name":"Toni Barlettano","screen_name":"itsmetonib","location":"Greater NYC Area","url":"http:\/\/www.tonib.me","description":"So Full of Art   |   \nToni Barlettano Creative Media + Design","protected":false,"followers_count":8,"friends_count":25,"listed_count":0,"created_at":"Mon Jan 20 16:49:46 +0000 2014","favourites_count":6,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verified":false,"statuses_count":20,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/425313048320958464\/Z2GcderW_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/425313048320958464\/Z2GcderW_normal.jpeg","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"thatwasntsohard","indices":[40,56]},{"text":"yesitwas","indices":[57,66]},{"text":"stoptalkingtoyourself","indices":[67,89]},{"text":"hashbrown","indices":[90,100]},{"text":"hashtag","indices":[101,109]}],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}

For the remaining of this tutorial, we will be using 4 Python libraries json for parsing the data, pandas for data manipulation, matplotlib for creating charts, adn re for regular expressions. The json and re libraries are installed by default in Python. You should install pandas and matplotlib if you don't have them in your machine.

We will start first by uploading json and pandas using the commands below:

import json
import pandas as pd
import matplotlib.pyplot as plt tweets_data_path = '../data/twitter_data.txt' tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue print(len(tweets_data)) # we will structure the tweets data into a pandas DataFrame to simplify
# the data manipulation. We will start by creating an empty DataFrame
# called tweets using the following command. tweets = pd.DataFrame() # Next, we will add 3 columns to the tweets DataFrame called text, lang,
# and country. text column contains the tweet, lang column contains the
# language in which the tweet was written, and country the country from
# which the tweet was sent. tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
tweets['lang'] = map(lambda tweet: tweet['lang'], tweets_data)
tweets['country'] = map(lambda tweet: tweet['place']['country'] if tweet[
'place'] is not None else None, tweets_data) # Next, we will create 2 charts: The first one describing the Top 5
# languages in which the tweets were written, and the second the Top 5
# countries from which the tweets were sent. tweets_by_lang = tweets['lang'].value_counts() fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red') # and the second the Top 5 countries from which the tweets were sent.
tweets_by_country = tweets['country'].value_counts() fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Countries', fontsize=15)
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')

转发自adilmoujahid

An Introduction to Text Mining using Twitter Streaming的更多相关文章

  1. 【337】Text Mining Using Twitter Streaming API and Python

    Reference: An Introduction to Text Mining using Twitter Streaming API and Python Reference: How to R ...

  2. 正则表达式和文本挖掘(Text Mining)

    在进行文本挖掘时,TSQL中的通配符(Wildchar)显得功能不足,这时,使用“CLR+正则表达式”是非常不错的选择,正则表达式看似非常复杂,但,万变不离其宗,熟练掌握正则表达式的元数据,就能熟练和 ...

  3. coursera 公开课 文本挖掘和分析(text mining and analytics) week 1 笔记

    一.课程简介: text mining and analytics 是一门在coursera上的公开课,由美国伊利诺伊大学香槟分校(UIUC)计算机系教授 chengxiang zhai 讲授,公开课 ...

  4. (Deep) Neural Networks (Deep Learning) , NLP and Text Mining

    (Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者 普通的Neural Netw ...

  5. [转帖]Introduction to text manipulation on UNIX-based systems

    Introduction to text manipulation on UNIX-based systems https://www.ibm.com/developerworks/aix/libra ...

  6. Unsupervised Learning and Text Mining of Emotion Terms Using R

    Unsupervised learning refers to data science approaches that involve learning without a prior knowle ...

  7. Text Mining and Analytics WEEK1

    第一周目标 解释自然语言处理中的一些基本概念 解释不同的方式来表示文本数据 解释的两种基本的词联想以及如何从文本数据挖掘聚合关系 尝试回答以下问题 为了理解一个自然语言句子,计算机必须做些什么? 什么 ...

  8. Azure平台 对Twitter 推文关键字进行实时大数据分析

    Learn how to do real-time sentiment analysis of big data using HBase in an HDInsight (Hadoop) cluste ...

  9. OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification-paper

    这篇论文的related work非常详尽地介绍了各种readability的语料 abstract这个paper描述了onestopengilish这个三个level的文本语料的收集和整理,阐述了再 ...

随机推荐

  1. CentOS之NTP服务器配置

    本文使用CentOS 6.5作为搭建环境 一.服务器端配置 1.安装所需软件包 yum -y install ntp ntpdate---------------------------------- ...

  2. LCD RGB 控制技术 时钟篇(下)

    我们先回顾一下之前的典型时序图 在这个典型的时序图里面,除了上篇博文讲述的HSYNC VSYNC VDEN VCLK这几信号外,我们还能看见诸如HSPW. VSPW,HBPD. HFPD,VBPD. ...

  3. python-web 创建一个输入链接生成的网站

    第一步:写一个自定义程序 #coding=utf-8 import os #Python的标准库中的os模块包含普遍的操作系统功能import re #引入正则表达式对象import urllib # ...

  4. CFGym 101490E 题解

    一.题目链接 http://codeforces.com/gym/101490 二.题面 三.题意 给你一个图,n个点,m条边,一个x,从顶点1走到顶点n.假设从顶点1走到顶点n的最短路为d,x代表你 ...

  5. javascript如何判断是手机还是电脑访问本网页

    var system ={}; var p = navigator.platform; system.win = p.indexOf("Win") == 0; system.mac ...

  6. application/json 和 application/x-www-form-urlencoded的区别

    public static string HttpPost(string url, string body) { //ServicePointManager.ServerCertificateVali ...

  7. 利用MessageFormat实现短信模板的匹配

    其实没什么技术含量,因为老是想不起来,所以在此文做下记录. 通常我们的应用系统中都会有很多短信的发送,或者是信息邮件等的推送,而这些信息却有着相同的共性,比如只是用户名换了下. 像下面这条,除了红色字 ...

  8. 熟悉下apple 马甲包

    一.什么是马甲包 马甲包是利用App store 规则漏洞,通过技术手段,多次上架同一款产品的方法.马甲包和主产品包拥有同样的内容和功能,除了icon和应用名称不能完全一致,其他基本一致. 二.为什么 ...

  9. 好记性不如烂笔头-linux学习笔记2kickstart自动化安装和cacti

    kickstart自动化安装的逻辑梳理 主要是安装tftp nfs dhcp 然后配置kickstart 原来就是先安装tftp 可实现不同机器的文件下载 然后在安装nfs 就是主服务器的文件系统 然 ...

  10. Pthreads n 体问题

    ▶ <并行程序设计导论>第六章中讨论了 n 体问题,分别使用了 MPI,Pthreads,OpenMP 来进行实现,这里是 Pthreads 的代码,分为基本算法和简化算法(引力计算量为基 ...