Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

  1. [转] Implementing a CNN for Text Classification in TensorFlow

    Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...

  2. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  3. Implementing a CNN for Text Classification in TensorFlow

    参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. CNN tensorflow text classification CNN文本分类的例子

    from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...

  6. 将迁移学习用于文本分类 《 Universal Language Model Fine-tuning for Text Classification》

    将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...

  7. [Bayes] Maximum Likelihood estimates for text classification

    Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...

  8. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  9. 论文阅读:《Bag of Tricks for Efficient Text Classification》

    论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

  1. Echarts多个图表响应式以及其他问题

    1.限制柱状图的宽度(自适应的柱子很大) barMaxWidth:30//设置柱状最大的宽度 2.设置y轴的label标签显示(单位 元 转 万) axisLabel: {      formatte ...

  2. Core Graphics Paths

    Paths中的几个重要元素 Points void CGContextMoveToPoint (    CGContextRef c,    CGFloat x,    CGFloat y ); 指定 ...

  3. 新手的Linux zcat命令示例

    Zcat是一个命令行实用程序,用于查看压缩文件的内容.它将压缩文件扩展为标准输出,允许您查看内容. 分类:Linux命令操作系统 2018-08-13 00:00:00 通常,使用gzip压缩的文件可 ...

  4. Codeforces 919 行+列前缀和 树上记忆化搜索(树形DP)

    A B C #include <bits/stdc++.h> #define PI acos(-1.0) #define mem(a,b) memset((a),b,sizeof(a)) ...

  5. eddx

    eddx是亿图绘图文件,可以使用EdrawSoft Edraw Max软件打开.这是一款流程图绘图软件,它内置丰富的预定义模板和例子,可以创建各种图示.包括商务绘图.工程及科学绘图.思维导图和数据库. ...

  6. cve-2019-1609,Harbor任意管理员注册漏洞复现

    一.Harbor介绍 以Docker为代表的容器技术的出现,改变了传统的交付方式.通过把业务及其依赖的环境打包进Docker镜像,解决了开发环境和生产环境的差异问题,提升了业务交付的效率.如何高效地管 ...

  7. Bootstrap 中文文档教程

    Bootstrap 中文文档教程 Bootstrap 中文文档教程 全局样式和grid布局—Bootstrap中文使用指南 全局样式1.要求html5文档类型 Bootstrap使用的css属性和ht ...

  8. tar命令--数据归档(二)

    tar -cf all.tar *.jpg 这条命令是将所有.jpg的文件打成一个名为all.tar的包.-c是表示产生新的包,-f指定包的文件名. tar -rf all.tar *.gif 这条命 ...

  9. thinkphp之cookie操作

    cookie设置 命名空间 代码

  10. 【ZJOJ1321】灯

    题目 贝希和她的闺密们在她们的牛棚中玩游戏.但是天不从人愿,突然,牛棚的电源跳闸了,所有的灯都被关闭了.贝希是一个很胆小的女生,在伸手不见拇指的无尽的黑暗中,她感到惊恐,痛苦与绝望.她希望您能够帮帮她 ...