【443】Tweets Analysis Q&A
【Question 01】
When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).
The solution is to add double quotation marks on both sides of the content, like this:
fo.write("\"" + str(tweet["user"]["location"]) + "\"")
【Question 02】
When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.
One solution is opening this file with notepad++.
Another solution is adding codes at the beginning of the writing file, like this:
fo = open(r"D:\Twitter Data\Data\test\tweets.csv", "w")
fo.write("\ufeff")
【Question 03】
Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.
So we should replace those characters with space or nothing, like this:
text = str(tweet["text"])
text = text.replace("\n", " ")
text = text.replace("\"", "")
text = text.replace("\'", "")
fo.write("\"" + text + "\"")
Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.
【Question 04】
After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?
Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.
Codes like this:
... count = 0
or line in tweets_file:
try:
count += 1
if (count < 10000):
continue
... if (count > 20000):
break
except:
continue
...
【443】Tweets Analysis Q&A的更多相关文章
- 【BZOJ4815】[CQOI2017]小Q的表格(莫比乌斯反演,分块)
[BZOJ4815][CQOI2017]小Q的表格(莫比乌斯反演,分块) 题面 BZOJ 洛谷 题解 神仙题啊. 首先\(f(a,b)=f(b,a)\)告诉我们矩阵只要算一半就好了. 接下来是\(b* ...
- 【二分图】ZJOI2007小Q的游戏
660. [ZJOI2007] 小Q的矩阵游戏 ★☆ 输入文件:qmatrix.in 输出文件:qmatrix.out 简单对比 时间限制:1 s 内存限制:128 MB [问题描述] ...
- 【BZOJ4813】[CQOI2017]小Q的棋盘(贪心)
[BZOJ4813][CQOI2017]小Q的棋盘(贪心) 题面 BZOJ 洛谷 题解 果然是老年选手了,这种题都不会做了.... 先想想一个点如果被访问过只有两种情况,第一种是进入了这个点所在的子树 ...
- 【bzoj4813】[Cqoi2017]小Q的棋盘 树上dfs+贪心
题目描述 小Q正在设计一种棋类游戏.在小Q设计的游戏中,棋子可以放在棋盘上的格点中.某些格点之间有连线,棋子只能在有连线的格点之间移动.整个棋盘上共有V个格点,编号为0,1,2…,V-1,它们是连通的 ...
- 【439】Tweets processing by Python
参数说明: coordinates:Represents the geographic location of this Tweet as reported by the user or cl ...
- 【HDOJ】4515 小Q系列故事——世界上最遥远的距离
简单题目,先把时间都归到整年,然后再计算.同时为了防止减法出现xx月00日的情况,需要将d先多增加1,再恢复回来. #include <cstdio> #include <cstri ...
- 【444】Data Analysis (shp, arcpy)
ABS suburbs data of AUS 1. Dissolve Merge polygons with the same attribute of "SA2_NAME16&quo ...
- 【LeetCode】字符串 string(共112题)
[3]Longest Substring Without Repeating Characters (2019年1月22日,复习) [5]Longest Palindromic Substring ( ...
- P5346 【XR-1】柯南家族
题目地址:P5346 [XR-1]柯南家族 Q:官方题解会咕么? A:不会!(大雾 题解环节 首先,我们假设已经求出了 \(n\) 个人聪明程度的排名. \(op = 1\) 是可以 \(O(1)\) ...
随机推荐
- Oracle 中 CONTAINS 函数的用法
Oracle 中 CONTAINS 函数的用法 1. 查询住址在北京的学生 SELECT student_id,student_name FROM students WHERE CONTAINS( a ...
- java singleton(单例设计模式)
单例设计模式的主要作用是: 1.控制资源的使用,我们对资源使用线程同步来实现并发访问. 2.节约资源,我们对一个类只进行一个实例化进行全局的资源访问,节约了内存. 3.作为通信媒介,也是数据共享,可以 ...
- 接口-httpClient
最近在工作的过程中有遇到httpClient接口,今天特意些一个小示例对这个知识点进行温习. 下面是代码小片段: package com.sinosoft.lis.mgubq.zhaoyongqian ...
- LightOJ - 1058 - Parallelogram Counting(数学,计算几何)
链接: https://vjudge.net/problem/LightOJ-1058 题意: There are n distinct points in the plane, given by t ...
- mysql 8.0.17 安装配置方法图文教程
1.URL:https://www.jb51.net/article/167782.htm 2.装好之后需要使用add user中的用户名和密码登录(之前安装数据库时出现的) 使用navicat连接时 ...
- 浏览器URL中“#” “?” &“”作用
1. # 10年9月,twitter改版.一个显著变化,就是URL加入了"#!"符号.比如,改版前的用户主页网址为http://twitter.com/username改版后,就变 ...
- 1-STM32物联网开发WIFI+GPRS(Wi-Fi入门篇)_简介
这期的教程为公开教程将用这块开发板学习WIFI(SDK) 实现哪些功能呢!无非就是那写网络功能...但是涉及的挺多 最近一直在忙方案篇的内容,所以公开版的例程耽误了,现在开始补上 我准备改变一下提供资 ...
- javascript 中的对象初始化
参考 developer.mozilla.org 网站.这个是一个前端必须经常光顾的网站. 记录一下对象的创建方法,虽然很简单,但是确需要非常注意. Objects can be initialize ...
- mac php 安装php多版本
之前的开发,PHP的版本都是基于php7.3 .but!!! 接到一个老项目 tp3.1的.没法用php7.3 只能在装一个php5.6了.真坑爹.为啥还要TP3.1的项目.并且是刚开发的新项目. 真 ...
- Spark设置自定义的InputFormat读取HDFS文件
本文通过MetaWeblog自动发布,原文及更新链接:https://extendswind.top/posts/technical/problem_spark_reading_hdfs_serial ...