Text Classification

2024-10-07 06:32:25 原文

Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

[转] Implementing a CNN for Text Classification in TensorFlow
Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...
[Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
Implementing a CNN for Text Classification in TensorFlow
参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...
论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
CNN tensorflow text classification CNN文本分类的例子
from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...
将迁移学习用于文本分类《 Universal Language Model Fine-tuning for Text Classification》
将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...
[Bayes] Maximum Likelihood estimates for text classification
Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...
#论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
论文阅读：《Bag of Tricks for Efficient Text Classification》
论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

php小程序生成二维码
<?php getwxacode(); //生成二维码 function getwxacode(){ $url = "https://api.weixin.qq.com/wxa/get ...
Maven搭建简单的SPring+SpringMVC+Hibernate框架
公司的项目用到的框架是Spring+SpringMVC+Hibernate 以前没有用过,所以要系统的学习一下,首先要学会怎么搭建第一步创建一个Maven的web项目创建方法以前的博客中有提 ...
SQL查询操作
有7个筛选条件任意一个条件都可以筛选.采用LINQ查询比较繁琐,且操作步骤增加,选择用SQL判断. public DataTable GetData(string cboCld, string cbo ...
Django 创建 hello world
前言用Django 创建 hello 哈哈,对这个还是有点意思的创建文件在你的目录下比如我是 F:\python\django 的输入下面的代码: django-admin startproj ...
【Luogu5294】[HNOI2019]序列
题目链接题意给定一个序列,要求将它改造成一个非降序列,修改一个数的代价为其改变量的平方. 最小化总代价. 另有\(Q\) 次询问,每次修改一个位置上的数.(询问之间独立,互不影响) Sol 神仙 ...
react在视频中截图,保存为base64位
wq:之前看了网上很多教程,有点模糊,但是最后还是搞了出来 1 不要将视频放到canvas上面! 之前一直将video重新画到canvas上面,然后再次将第一个canvas放到第二个canvas上 ...
【leetcode】1207. Unique Number of Occurrences
题目如下: Given an array of integers arr, write a function that returns true if and only if the number o ...
AQS源码分析笔记
经过昨晚的培训.对AQS源码的理解有所加强,现在写个小笔记记录一下同样,还是先写个测试代码,debug走一遍流程, 然后再总结一番即可. 测试代码 import java.util.concurre ...
JPA学习（四、JPA_映射关联关系）
框架学习之JPA(四) JPA是Java Persistence API的简称,中文名Java持久层API,是JDK 5.0注解或XML描述对象-关系表的映射关系,并将运行期的实体对象持久化到数据库中 ...
185.[USACO Oct08] 挖水井 (第三次考试大整理)
185. [USACO Oct08] 挖水井输入文件:water.in 输出文件:water.out 简单对比时间限制:1 s 内存限制:128 MB 农夫约翰决定给他的N(1< ...