VIPS: a VIsion based Page Segmentation Algorithm
VIPS: a VIsion based Page Segmentation Algorithm
VIPS: a VIsion based Page Segmentation Algorithm
Introduction
The VIsion-based Page Segmentation (VIPS) algorithm aims to extract the semantic structure of a web page based on its visual presentation. Such semantic structure is a tree structure; each node in the tree corresponds to a block. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception, the bigger is the DoC value, the more coherent is the block. The VIPS algo-rithm makes full use of page layout structure. It first extracts all the suitable blocks from the html DOM tree, and then it finds the separators between these blocks. Here, separators denote the hori-zontal or vertical lines in a web page that visually cross with no blocks. Based on these separators, the semantic tree of the web page is constructed. Thus, a web page can be represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. Contents with different topics are distinguished as separate blocks.
Paper List
Original Paper
- Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. "Extracting Content Structure for Web Pages based on Visual Representation", in the Fifth Asia Pacific Web Conference (APWeb2003), 2003.
- Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. "VIPS: a Vision-based Page Segmentation Algorithm", Microsoft Technical Report (MSR-TR-2003-79),2003. ( An updated version of the technical report pdf )
Applications using VIPS
- Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma. "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation", in the Twelfth International World Wide Web Conference (WWW2003), May 2003.
- Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, "Learning Block Importance Models for Web Pages", in the Proceeding of the Thirteenth World Wide Web conference (WWW 2004), 203-211, New York, May, 2004.
- Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang. "Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure", in the 2004 IEEE International Conference on Multimedia and EXPO (ICME'2004), June 2004.
- Xiaofei He, Deng Cai, Ji-Rong Wen, Wei-Ying Ma and Hong-Jiang Zhang. "ImageSeer: Clustering and Searching WWW Images Using Link and Page Layout Analysis", Microsoft Technical Report (MSR-TR-2004-38), 2004.
- Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. "Block-based Web Search", in the 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004.
- Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma. "Block-level Link Analysis", in the 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004.
- Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen. "Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis", in 12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004.
- Xin-Jing Wang, Wei-Ying Ma, Gui-Rong Xue, and Xing Li, "Multi-Model Similarity Propagation and its Application for Web Image Retrieval",in 12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004.
If you find the VIPS algoirthm useful, we appreciate it very much if you can cite our following works:@Inproceedings{CHWM04
author = "Deng Cai and Xiaofei He and Ji-Rong Wen and Wei-Ying Ma",
title = "Block-level link analysis",
booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
pages = {440--447},
year = "2004"}@Inproceedings{CYWM04
author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
title = "Block-based web search",
booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
pages = {456--463},
year = "2004"}@Inproceedings{YCWM03,
author = "Shipeng Yu and Deng Cai and Ji-Rong Wen and Wei-Ying Ma",
title = "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation",
booktitle = "Twelfth International World Wide Web Conference (WWW2003)",
year = "2003"}@Inproceedings{CYWM03,
author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
title = "Extracting Content Structure for Web Pages based on Visual Representation",
booktitle = "Fifth Asia Pacific Web Conference (APWeb2003)",
year = "2003"}
Demo
Copyright Notice: All these programs can only be used for research.
VIPS dll (The VIPS DLL is always under development. All versions are downloadable here.)
VIPS dll (pageanalyzer.dll) (release date: 03/26/2008. One bug fixed. Thanks Ankur Gupta for pointing out the bug.)
- VIPS dll (pageanalyzer.dll) (release date: 01/16/2006. Some people requested for the HTML source code output, I added it. Also I changed some interfaces so you need to rebuild your program if you want to use this new dll. Meanwhile, please download the newest demo.)
VIPS Demo (release date: 01/16/2006) (You should download VIPS dll and register it first! This demo can only work on the new VIPS dll)
- VIPS dll (pageanalyzer.dll) (release date: 03/20/2005, some bugs fixed)
- VIPS dll (pageanalyzer.dll) (release date: 08/20/2004)
- VIPS Demo (release date: 08/20/2004) (You should download VIPS dll and register it first!)
How to use VIPS dll.
- You should familiar with how to host a webbrowser(Internet Explorer) in your program. Some articles in MSDN are very useful.
A more powerful example of using VIPS dll in VS2003 (release date: 01/25/2006)
(This example provides source code on how to process batch job using VIPS dll. The framework of this example is based on MFCbrowser, which is a demo project in MSDN. You only need to focus on the MFCbrowserView.cpp and MFCbrowserView.h. I added some comments and hopefully these two files are self explained. Email me if you still have any questions.)
- A example of using VIPS dll in VC6.0 (release date: 08/20/2004) (You should download VIPS dll and register it first!)
Notice: we are currently working to enhance the VIPS algorithm, any suggestions or problems can be send to dengcai2 AT cs DOT uiuc DOT edu.
VIPS: a VIsion based Page Segmentation Algorithm的更多相关文章
- A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问
这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...
- Flash-aware Page Replacement Algorithm
1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...
- tpopela/vips_java
tpopela/vips_java Implementation of Vision Based Page Segmentation algorithm in Java
- 基于视觉信息的网页分块算法(VIPS) - yysdsyl的专栏 - 博客频道 - CSDN.NET
基于视觉信息的网页分块算法(VIPS) - yysdsyl的专栏 - 博客频道 - CSDN.NET 于视觉信息的网页分块算法(VIPS) 2012-07-29 15:22 1233人阅读 评论(1) ...
- Awesome Deep Vision
Awesome Deep Vision A curated list of deep learning resources for computer vision, inspired by awes ...
- Computer Vision Tutorials from Conferences (3) -- CVPR
CVPR 2013 (http://www.pamitc.org/cvpr13/tutorials.php) Foundations of Spatial SpectroscopyJames Cogg ...
- Rigid motion segmentation
In computer vision, rigid motion segmentation is the process of separating regions, features, or tra ...
- Evolutionary Computing: 3. Genetic Algorithm(2)
承接上一章,接着写Genetic Algorithm. 本章主要写排列表达(permutation representations) 开始先引一个具体的例子来进行表述 Outline 问题描述 排列表 ...
- VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法]
VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法] - tingya的专栏 - 博客频道 - CSDN.NET VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法] 分类 ...
随机推荐
- MFC通过ODBC连接mysql(使用VS2012编写MFC)
原创文章,转载请注明原文:MFC通过ODBC连接mysql(使用VS2012编写MFC) By Lucio.Yang 1.ODBC连接mysql 首先ODBC是什么呢? 开放数据库互连(Open Da ...
- fckeditor使用详解
FCKEditor是一个很好的用于Web页面中的格式化文本编译控件.现在越来越多的论坛的发帖页面中更多的使用了这个控件,我们这里将如何在基于Java的web开发中使用FCKEditor控件的步骤提供给 ...
- head first 设计模式读书笔记 之 策略模式
作为一个php开发者,深知曾经很多程序员都鄙视php,为什么呢?因为他们认为php的语法是dirty的,并且由于开发者水平参差不齐导致php的代码更加乱上加乱,维护起来简直一坨shit一样.随着php ...
- 帝国cms <!--list.var1-->产生不同样式
制作帝国列表模板正常情况下 列表内容模板(list.var) (*) 写 <li class=''>[!--title--]<a href='[!--titleurl--]'> ...
- rsyslog 传输mysql 日志
在另外一种环境中,让我们假定你已经在机器上安装了一个名为"foobar"的应用程序,它会在/var/log下生成foobar.log日志文件.现在,你想要将它的日志定向到rsysl ...
- Unix/Linux环境C编程入门教程(9) unbntu CCPP开发环境搭建
1. 首先启动VMware,如果没有安装,请查看前面VMware的安装视频 2 启动虚拟机向导,选择自定义 3 单击下一步 4 选择稍后安装操作系统 5 .选择unbntu 64linux ...
- VM 映像 PowerShell 教学系列博客文章
编辑人员注释:本文章是与Microsoft Azure工程的项目经理Kay Singh共同撰写的 正如我在第一篇博客文章中所承诺的,我又回来了,为大家分步介绍如何在PowerShell中使用VM ...
- golang仿AS3写的ByteArray
用golang写了个仿AS3写的ByteArray,稍微有点差别,demo能成功运行,还未进行其他测试 主要参考的是golang自带库里的Buffer,结合了binary 来看看demo: packa ...
- Android4.0设置界面改动总结(三)
Android4.0设置界面改动总结大概介绍了一下设置改tab风格,事实上原理非常easy,理解两个基本的函数就可以: ①.invalidateHeaders(),调用此函数将又一次调用onBuild ...
- asp.net DropDownList实现ToolTip功能
在绑定DropDownList控件时,可能出现绑定显示的文本过长以至于超过控件长度的内容看不到,这时候就需要使用ToolTip完成其功能,即鼠标放到相应选项后就可显示其完成内容. 首先,在页面引入jQ ...

