Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

  1. [转] Implementing a CNN for Text Classification in TensorFlow

    Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...

  2. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  3. Implementing a CNN for Text Classification in TensorFlow

    参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. CNN tensorflow text classification CNN文本分类的例子

    from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...

  6. 将迁移学习用于文本分类 《 Universal Language Model Fine-tuning for Text Classification》

    将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...

  7. [Bayes] Maximum Likelihood estimates for text classification

    Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...

  8. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  9. 论文阅读:《Bag of Tricks for Efficient Text Classification》

    论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

  1. luogu P3210 [HNOI2010]取石头游戏

    传送门 不会结论做个鬼系列 题意其实是在头尾(最多)两个栈以及中间一些双端队列依次取数,然后每个人都要最大化自己的价值 有一个结论,如果一段序列中,出现了三个相邻位置\(A,B,C\),满足\(A\l ...

  2. Scala学习笔记(3)

    数组 ----------------------------------- 0.若长度固定则用Array,若长度可能变化则使用ArrayBuffer 1.提供初始值的时候不要使用new. 2.用() ...

  3. SpringBoot在macOS下启动慢的原因

    title: SpringBoot在macOS下启动慢的原因 comments: false date: 2019-07-16 09:48:24 description: 在 macOS + JDk1 ...

  4. thinkphp5+layui多图片上传

    准备资料 下载layui <!DOCTYPE html> <html> <head> <meta charset="utf-8"> ...

  5. js数组与对象的区别

    数组和对象两者都可以用来表示数据的集合,曾一度搞不清楚”数组”(array)和”对象”(object)的根本区别在哪里. 有一个数组a=[1,2,3,4],还有一个对象a={0:1,1:2,2:3,3 ...

  6. spring boot引入thymeleaf导致中文乱码

    加上下面这句代码 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" / ...

  7. vue 列表渲染 v-for

    1.数组列表       v-for 块中,我们拥有对父作用域属性的完全访问权限.v-for 还支持一个可选的第二个参数为当前项的索引 1.1 普通渲染       v-for="item ...

  8. Juery入门2

    1.Jquery操作文档 <!DOCTYPE html> <html lang="en"> <head> <meta charset=&q ...

  9. 接口测试参数化详解(Jmeter)

    简介 接口测试是目前最主流的自动化测试手段,它组合不同的参数向服务器发送请求,接受和解析响应结果,通过测试数据的交换逻辑来验证服务端程序工作的正确性.我们在测试过程中需要考虑不同的输入组合,来覆盖不同 ...

  10. CenterNet

    Objects as Points anchor-free系列的目标检测算法,只检测目标中心位置,无需采用NMS 1.主干网络 采用Hourglass Networks [1](还有resnet18 ...