【Question 01】

  When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).

  The solution is to add double quotation marks on both sides of the content, like this:

fo.write("\"" + str(tweet["user"]["location"]) + "\"")

【Question 02】

  When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.

  ref:csv 文件打开乱码,有哪些方法可以解决?

  One solution is opening this file with notepad++.

  Another solution is adding codes at the beginning of the writing file, like this:

fo = open(r"D:\Twitter Data\Data\test\tweets.csv", "w")
fo.write("\ufeff")

【Question 03】

  Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.

  So we should replace those characters with space or nothing, like this:

text = str(tweet["text"])
text = text.replace("\n", " ")
text = text.replace("\"", "")
text = text.replace("\'", "")
fo.write("\"" + text + "\"")

  Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.

【Question 04】

  After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?

  Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.

  Codes like this:

...

count = 0
or line in tweets_file:
try:
count += 1
if (count < 10000):
continue
... if (count > 20000):
break
except:
continue
...

【443】Tweets Analysis Q&A的更多相关文章

  1. 【BZOJ4815】[CQOI2017]小Q的表格(莫比乌斯反演,分块)

    [BZOJ4815][CQOI2017]小Q的表格(莫比乌斯反演,分块) 题面 BZOJ 洛谷 题解 神仙题啊. 首先\(f(a,b)=f(b,a)\)告诉我们矩阵只要算一半就好了. 接下来是\(b* ...

  2. 【二分图】ZJOI2007小Q的游戏

    660. [ZJOI2007] 小Q的矩阵游戏 ★☆   输入文件:qmatrix.in   输出文件:qmatrix.out   简单对比 时间限制:1 s   内存限制:128 MB [问题描述] ...

  3. 【BZOJ4813】[CQOI2017]小Q的棋盘(贪心)

    [BZOJ4813][CQOI2017]小Q的棋盘(贪心) 题面 BZOJ 洛谷 题解 果然是老年选手了,这种题都不会做了.... 先想想一个点如果被访问过只有两种情况,第一种是进入了这个点所在的子树 ...

  4. 【bzoj4813】[Cqoi2017]小Q的棋盘 树上dfs+贪心

    题目描述 小Q正在设计一种棋类游戏.在小Q设计的游戏中,棋子可以放在棋盘上的格点中.某些格点之间有连线,棋子只能在有连线的格点之间移动.整个棋盘上共有V个格点,编号为0,1,2…,V-1,它们是连通的 ...

  5. 【439】Tweets processing by Python

        参数说明: coordinates:Represents the geographic location of this Tweet as reported by the user or cl ...

  6. 【HDOJ】4515 小Q系列故事——世界上最遥远的距离

    简单题目,先把时间都归到整年,然后再计算.同时为了防止减法出现xx月00日的情况,需要将d先多增加1,再恢复回来. #include <cstdio> #include <cstri ...

  7. 【444】Data Analysis (shp, arcpy)

      ABS suburbs data of AUS 1. Dissolve Merge polygons with the same attribute of "SA2_NAME16&quo ...

  8. 【LeetCode】字符串 string(共112题)

    [3]Longest Substring Without Repeating Characters (2019年1月22日,复习) [5]Longest Palindromic Substring ( ...

  9. P5346 【XR-1】柯南家族

    题目地址:P5346 [XR-1]柯南家族 Q:官方题解会咕么? A:不会!(大雾 题解环节 首先,我们假设已经求出了 \(n\) 个人聪明程度的排名. \(op = 1\) 是可以 \(O(1)\) ...

随机推荐

  1. (java)Jsoup爬虫学习--获取网页所有的图片,链接和其他信息,并检查url和文本信息

    Jsoup爬虫学习--获取网页所有的图片,链接和其他信息,并检查url和文本信息 此例将页面图片和url全部输出,重点不太明确,可根据自己的需要输出和截取: import org.jsoup.Jsou ...

  2. 【Python学习】Python3 基础语法

    ==================================================================================================== ...

  3. 三.protobuf3标量值类型

    Protobuf3 标量值类型 标量消息字段可以具有以下类型之一——该表显示了.proto文件中指定的类型,以及自动生成的类中的相应类型: .proto类型 说明 C++ 类型 Java 类型 Pyt ...

  4. Zookeeper数据类型、节点类型、角色、watcher监听机制

    1.Zookeeper数据类型:层次化目录结构+少量数据 Zookeeper包含层次化的目录结构,每个Znode都有唯一的路径标识,Znode可以包含数据和子节点. 其中Znode数据可以有多个版本, ...

  5. hibernate之关联关系一对多

    什么是关联(association) 关联指的是类之间的引用关系.如果类A与类B关联,那么被引用的类B将被定义为类A的属性.例如:  public class B{        private St ...

  6. MySQL读写分离项目配置

    一.项目信息 1.拓扑 二.环境规划 1.主机信息 2.软件信息 3.MySQL中间件 三.配置

  7. NIO原理详解

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/CharJay_Lin/article/d ...

  8. C# 打开 EXE 文件

    命名空间是using System.Diagnostics; 在编写程序时经常会使用到调用可执行程序的情况,本文将简单介绍C#调用exe的方法.在C#中,通过Process类来进行进程操作. Proc ...

  9. SSM 整合 ehcache spring 配置文件报错

    添加 <!-- end MyBatis使用ehcache缓存 --> <cache:annotation-driven cache-manager="cacheManage ...

  10. yyy

    def delete(ps): import os filename = ps[-] delelemetns = ps[] with open(filename, encoding='utf-8') ...