Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

  1. [转] Implementing a CNN for Text Classification in TensorFlow

    Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...

  2. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  3. Implementing a CNN for Text Classification in TensorFlow

    参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. CNN tensorflow text classification CNN文本分类的例子

    from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...

  6. 将迁移学习用于文本分类 《 Universal Language Model Fine-tuning for Text Classification》

    将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...

  7. [Bayes] Maximum Likelihood estimates for text classification

    Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...

  8. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  9. 论文阅读:《Bag of Tricks for Efficient Text Classification》

    论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

  1. Servlet监听器——实现在线登录人数统计小例子

    一.概念 servlet监听器的主要目的是给web应用增加事件处理机制,以便更好的监视和控制web应用的状态变化,从而在后台调用相应处理程序. 二.监听器的类型 1.根据监听对象的类型和范围,分为3类 ...

  2. vue点击出现蒙版

      需求: 1.点击一个事件时弹出一个蒙版: 2.蒙版上有取消,删除事件:(点击取消时候蒙版消失,点击删除时,删除蒙版并消失): 3.点击空白地方,蒙版也消失:   <template> ...

  3. NoSQL与其常见的产品

    一. 什么是NoSQL NoSQL(NoSQL = Not Only SQL ),意即"不仅仅是SQL",它是一种非关系型数据库. 二. 为什么要有NoSQL 在现代的计算系统上每 ...

  4. CodeReview的一些原则

    架构/设计/常规角度: 单一职责原则 一个类只能干一个事情,一个方法最好也只干一件事情.一个类既干UI的事情,又干逻辑的事情,这个在低质量的代码里是很常见. 行为是否统一 缓存是否统一 错误处理是否统 ...

  5. Delphi 赋值语句和程序的顺序结构

  6. 配置 http 反向代理

    [root@nginx ~]# cd /etc/nginx/ 1 [root@nginx nginx]# cp nginx.conf nginx.conf.bak #备份一个原配置文件 2 [root ...

  7. icmp, IPPROTO_ICMP - Linux IPv4 ICMP 核心模块.

    DESCRIPTION 描述 本网络核心协议模块实现了基于 RFC792 协议中定义的<互联网控制报文协议>.它针对网络主机间通讯出错的情况作出回应并给出诊断信息.用户不能直接使用本模块. ...

  8. linux如何配置使用sendEmail发送邮件

    sendEmail是一个轻量级.命令行的SMTP邮件客户端.如果你需要使用命令行发送邮件,那么sendEmail是非常完美的选择.使用简单并且功能强大.这个被设计用在php.bash.perl和web ...

  9. QWidget 设置背景图片

    QWidget 设置背景图片办法: 利用 QPaltette QPixmap pixmap("back.png"); QPalette palette; palette.setBr ...

  10. mongodb,robomongo 数据查询

    可视化管理工具:Robomongo 是开源,免费的MongoDB管理工具,下载地址:Robomongo下载 1.  基本查询:    构造查询数据.    > db.test.findOne() ...