【443】Tweets Analysis Q&A
【Question 01】
When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).
The solution is to add double quotation marks on both sides of the content, like this:
fo.write("\"" + str(tweet["user"]["location"]) + "\"")
【Question 02】
When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.
One solution is opening this file with notepad++.
Another solution is adding codes at the beginning of the writing file, like this:
fo = open(r"D:\Twitter Data\Data\test\tweets.csv", "w")
fo.write("\ufeff")
【Question 03】
Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.
So we should replace those characters with space or nothing, like this:
text = str(tweet["text"])
text = text.replace("\n", " ")
text = text.replace("\"", "")
text = text.replace("\'", "")
fo.write("\"" + text + "\"")
Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.
【Question 04】
After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?
Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.
Codes like this:
... count = 0
or line in tweets_file:
try:
count += 1
if (count < 10000):
continue
... if (count > 20000):
break
except:
continue
...
【443】Tweets Analysis Q&A的更多相关文章
- 【BZOJ4815】[CQOI2017]小Q的表格(莫比乌斯反演,分块)
[BZOJ4815][CQOI2017]小Q的表格(莫比乌斯反演,分块) 题面 BZOJ 洛谷 题解 神仙题啊. 首先\(f(a,b)=f(b,a)\)告诉我们矩阵只要算一半就好了. 接下来是\(b* ...
- 【二分图】ZJOI2007小Q的游戏
660. [ZJOI2007] 小Q的矩阵游戏 ★☆ 输入文件:qmatrix.in 输出文件:qmatrix.out 简单对比 时间限制:1 s 内存限制:128 MB [问题描述] ...
- 【BZOJ4813】[CQOI2017]小Q的棋盘(贪心)
[BZOJ4813][CQOI2017]小Q的棋盘(贪心) 题面 BZOJ 洛谷 题解 果然是老年选手了,这种题都不会做了.... 先想想一个点如果被访问过只有两种情况,第一种是进入了这个点所在的子树 ...
- 【bzoj4813】[Cqoi2017]小Q的棋盘 树上dfs+贪心
题目描述 小Q正在设计一种棋类游戏.在小Q设计的游戏中,棋子可以放在棋盘上的格点中.某些格点之间有连线,棋子只能在有连线的格点之间移动.整个棋盘上共有V个格点,编号为0,1,2…,V-1,它们是连通的 ...
- 【439】Tweets processing by Python
参数说明: coordinates:Represents the geographic location of this Tweet as reported by the user or cl ...
- 【HDOJ】4515 小Q系列故事——世界上最遥远的距离
简单题目,先把时间都归到整年,然后再计算.同时为了防止减法出现xx月00日的情况,需要将d先多增加1,再恢复回来. #include <cstdio> #include <cstri ...
- 【444】Data Analysis (shp, arcpy)
ABS suburbs data of AUS 1. Dissolve Merge polygons with the same attribute of "SA2_NAME16&quo ...
- 【LeetCode】字符串 string(共112题)
[3]Longest Substring Without Repeating Characters (2019年1月22日,复习) [5]Longest Palindromic Substring ( ...
- P5346 【XR-1】柯南家族
题目地址:P5346 [XR-1]柯南家族 Q:官方题解会咕么? A:不会!(大雾 题解环节 首先,我们假设已经求出了 \(n\) 个人聪明程度的排名. \(op = 1\) 是可以 \(O(1)\) ...
随机推荐
- 08 node.js 的使用
创建包 目录结构 cmd cd 到当前目录: \ 执行 npm init //创建一个包 1 2. 3. 4.包的安装 npm install jquery --save npm install ...
- Oracle 分区表中本地索引和全局索引的适用场景
背景 分区表创建好了之后,如果需要最大化分区表的性能就需要结合索引的使用,分区表有两种索引:本地索引和全局索引.既然存在着两种的索引类型,相信存在即合理.既然存在就会有存在的原因,也就是在特定的场景中 ...
- PostgreSQL 11 新特性之覆盖索引(Covering Index)(转载)
通常来说,索引可以用于提高查询的速度.通过索引,可以快速访问表中的指定数据,避免了表上的扫描.有时候,索引不仅仅能够用于定位表中的数据.某些查询可能只需要访问索引的数据,就能够获取所需要的结果,而不需 ...
- imagick的简单使用
原文:https://blog.csdn.net/wulove52/article/details/78376142 PHP建图通常都用GD库,因为是内置的不需要在服务器上额外安装插件,所以用起来比较 ...
- AtCoder Grand Contest 008题解
传送门 \(A\) 分类讨论就行了 然而我竟然有一种讨论不动的感觉 int x,y; inline int min(R int x,R int y){return x<y?x:y;} inlin ...
- RS码的突发纠错能力
RS码便于纠突发错误.所谓突发错误,是指burst errors. 即一长串连续位出错.例如 0011XXXX1010. 其中X表示出错.如果是GF(2^4)中定义的RS码,则可以由一个符号错误纠正. ...
- opendir,readdir,closedir
结构体dirent: struct dirent { ino_t d_ino; //inode number off_t d_off; //offset to the next diret unsi ...
- 【CSP模拟赛】方程(数学)
题目描述 求关于x的方程:x1+x2+……xk=n的非负整数解的个数. 输入格式 仅一行,包含两个正整数n,k. 输出格式 一个整数,表示方程不同解的个数,这个数可能很大,你只需输出mod 20080 ...
- avalon中的ms-attr?
<div class="item" id="move5" style="margin:0" > <a class=&quo ...
- 页面截取字段和转码,页面截取字段时候需要进入JS
截取字段 ${fn:substring(info.cpflmc,0,20)}${fn:length(info.cpflmc)>40?'...':''} 表头list ...