Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

  1. [转] Implementing a CNN for Text Classification in TensorFlow

    Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...

  2. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  3. Implementing a CNN for Text Classification in TensorFlow

    参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. CNN tensorflow text classification CNN文本分类的例子

    from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...

  6. 将迁移学习用于文本分类 《 Universal Language Model Fine-tuning for Text Classification》

    将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...

  7. [Bayes] Maximum Likelihood estimates for text classification

    Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...

  8. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  9. 论文阅读:《Bag of Tricks for Efficient Text Classification》

    论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

  1. Atcoder Regular 098 区间Pre=Xor Q询问区间连续K去最小值最小极差

    C 用scanf("%s")就会WA..不知道为什么 /*Huyyt*/ #include<bits/stdc++.h> #define mem(a,b) memset ...

  2. java高并发核心要点|系列2|锁的底层实现原理

    上篇文章,我们主要讲了解决多线程之间共享数据的核心问题和解决方案,也讲了锁的简单分类. 那么,这把锁,我们应该怎么去实现呢?如果你是java语言设计者,你又会怎么去设计这个线程锁呢? 直觉告诉我们,我 ...

  3. GitHub上一些有趣的开源项目[持续更新]

    TheAlgorithms/C-Plus-Plus 用C++实现了常见的算法,如排序算法,查找算法,以及一些常见的数据数据结构,如链表,二叉树. 链接:https://github.com/TheAl ...

  4. ubuntu 设置apt-get 代理

    1 添加apt-get 代理配置文件 sudo vi /etc/apt/apt.conf.d/proxy.conf 2 添加内容 Acquire::http::Proxy "http://w ...

  5. vue笔记(更新中)

    1.禁止div点击 2.禁止选择 -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; - ...

  6. 深入理解JAVA虚拟机 程序编译和代码优化

    泛型类型擦除 C#中的泛型,不论是代码中,还是编译后,还是运行期,都是切实存在的.List<String>和List<Int>是两个截然不同的类型,有自己的虚方法表和类型数据, ...

  7. 【串线篇】SpringBoot数据访问【数据源/mybatis/指定映射文件位置】

    一.配置数据源 1.1.jdbc版本 JDBC(.tomcat.jdbc.pool.DataSource作为数据源) <?xml version="1.0" encoding ...

  8. python 脚本制作

    U盘拷贝 当U盘插入主机时 被系统识别挂载后 通过python代码自动的去读取U盘中的资料并且进行拷贝 寄存在U盘上的 把硬盘上的资料进行读取并移动到U盘里 有点像 繁殖性 传输性 破坏性 破坏系统或 ...

  9. 【JZOJ1282】打工

    题目 分析 显然,有一个结论, 在有效的方案中,第i位的数一定小于等于i. 所以,设\(f_{i,j,k}\)表示,做到第i位,前i位的最大值为j,前i位是否与输入的序列的前i位相等. 转移方程随便搞 ...

  10. 从fileGDB中获取List

    /// <summary> /// 从FGDB中获取 /// </summary> /// <param name="fileGDBPath"> ...