VIPS: a VIsion based Page Segmentation Algorithm

VIPS: a VIsion based Page Segmentation Algorithm


Introduction

The VIsion-based Page Segmentation (VIPS) algorithm aims to extract the semantic structure of a web page based on its visual presentation. Such semantic structure is a tree structure; each node in the tree corresponds to a block. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception, the bigger is the DoC value, the more coherent is the block. The VIPS algo-rithm makes full use of page layout structure. It first extracts all the suitable blocks from the html DOM tree, and then it finds the separators between these blocks. Here, separators denote the hori-zontal or vertical lines in a web page that visually cross with no blocks. Based on these separators, the semantic tree of the web page is constructed. Thus, a web page can be represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. Contents with different topics are distinguished as separate blocks.

 



Paper List

Original Paper

Applications using VIPS


If you find the VIPS algoirthm useful, we appreciate it very much if you can cite our following works:

@Inproceedings{CHWM04
author = "Deng Cai and Xiaofei He and Ji-Rong Wen and Wei-Ying Ma",
title = "Block-level link analysis",
booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
pages = {440--447},
year = "2004"}

@Inproceedings{CYWM04
author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
title = "Block-based web search",
booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
pages = {456--463},
year = "2004"}

@Inproceedings{YCWM03,
author = "Shipeng Yu and Deng Cai and Ji-Rong Wen and Wei-Ying Ma",
title = "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation",
booktitle = "Twelfth International World Wide Web Conference (WWW2003)",
year = "2003"}

@Inproceedings{CYWM03,
author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
title = "Extracting Content Structure for Web Pages based on Visual Representation",
booktitle = "Fifth Asia Pacific Web Conference (APWeb2003)",
year = "2003"}

 


Demo

Copyright Notice: All these programs can only be used for research.

VIPS dll (The VIPS DLL is always under development. All versions are downloadable here.)

  • VIPS dll (pageanalyzer.dll) (release date: 03/26/2008. One bug fixed. Thanks Ankur Gupta for pointing out the bug.)

     

  • VIPS dll (pageanalyzer.dll) (release date: 01/16/2006. Some people requested for the HTML source code output, I added it. Also I changed some interfaces so you need to rebuild your program if you want to use this new dll. Meanwhile, please download the newest demo.)

     

  • VIPS Demo (release date: 01/16/2006) (You should download VIPS dll and register it first! This demo can only work on the new VIPS dll)

     

  • VIPS dll (pageanalyzer.dll) (release date: 03/20/2005, some bugs fixed)

     

  • VIPS dll (pageanalyzer.dll) (release date: 08/20/2004)

     

  • VIPS Demo (release date: 08/20/2004) (You should download VIPS dll and register it first!)

How to use VIPS dll.

  • You should familiar with how to host a webbrowser(Internet Explorer) in your program. Some articles in MSDN are very useful.

     

  • A more powerful example of using VIPS dll in VS2003 (release date: 01/25/2006)
    (This example provides source code on how to process batch job using VIPS dll. The framework of this example is based on MFCbrowser, which is a demo project in MSDN. You only need to focus on the MFCbrowserView.cpp and MFCbrowserView.h. I added some comments and hopefully these two files are self explained. Email me if you still have any questions.)

     

  • A example of using VIPS dll in VC6.0 (release date: 08/20/2004) (You should download VIPS dll and register it first!)

Notice: we are currently working to enhance the VIPS algorithm, any suggestions or problems can be send to dengcai2 AT cs DOT uiuc DOT edu.

VIPS: a VIsion based Page Segmentation Algorithm的更多相关文章

  1. A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问

    这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...

  2. Flash-aware Page Replacement Algorithm

    1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...

  3. tpopela/vips_java

    tpopela/vips_java Implementation of Vision Based Page Segmentation algorithm in Java

  4. 基于视觉信息的网页分块算法(VIPS) - yysdsyl的专栏 - 博客频道 - CSDN.NET

    基于视觉信息的网页分块算法(VIPS) - yysdsyl的专栏 - 博客频道 - CSDN.NET 于视觉信息的网页分块算法(VIPS) 2012-07-29 15:22 1233人阅读 评论(1) ...

  5. Awesome Deep Vision

    Awesome Deep Vision  A curated list of deep learning resources for computer vision, inspired by awes ...

  6. Computer Vision Tutorials from Conferences (3) -- CVPR

    CVPR 2013 (http://www.pamitc.org/cvpr13/tutorials.php) Foundations of Spatial SpectroscopyJames Cogg ...

  7. Rigid motion segmentation

    In computer vision, rigid motion segmentation is the process of separating regions, features, or tra ...

  8. Evolutionary Computing: 3. Genetic Algorithm(2)

    承接上一章,接着写Genetic Algorithm. 本章主要写排列表达(permutation representations) 开始先引一个具体的例子来进行表述 Outline 问题描述 排列表 ...

  9. VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法]

    VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法] - tingya的专栏 - 博客频道 - CSDN.NET VIPS:基于视觉的页面分割算法[微软下一代搜索引擎核心分页算法] 分类 ...

随机推荐

  1. js-计算器

    <div class="main"><h1>HTML5-计算器</h1>            <input id="num1& ...

  2. ReentrantLock(重入锁)以及公平性

    ReentrantLock(重入锁)以及公平性 标签(空格分隔): java NIO 如果在绝对时间上,先对锁进行获取的请求一定被先满足,那么这个锁是公平的,反之,是不公平的,也就是说等待时间最长的线 ...

  3. win7系统还原教程

    当我们的win7系统出现故障了导致系统不能稳定运行而我们没有更好的解决办法时,我们一般的方式是对系统进行还原或重新安装win7系统了,本文主要讨论win7系统还原,抛开第三方软件不说,win7系统自带 ...

  4. (15)Visual Studio中使用PCL项目加入WCF WebService参考

    原文 Visual Studio中使用PCL项目加入WCF WebService参考 Visual Studio中使用PCL项目加入WCF WebService参考 作者:Steven Chang 2 ...

  5. 在iOS当中发送电子邮件和短信

    iOS实现发送电子邮件的方法很简单,首先导入MessageUI.framework框架,然后代码如下: #import "RPViewController.h" //添加邮件头文件 ...

  6. Count the Colors(线段树,找颜色段条数)

    Count the Colors Time Limit: 2 Seconds      Memory Limit: 65536 KB Painting some colored segments on ...

  7. The Water Problem(排序)

    The Water Problem Time Limit: 1500/1000 MS (Java/Others)    Memory Limit: 131072/131072 K (Java/Othe ...

  8. android插件化-apkplugdemo源代码阅读指南-10

    阅读本节内容前可先了解 apkplug基础教程 本教程是基于apkplug V1.6.8 版本号编写  最新开发方式以官网为准 可下载最新的apkplugdemo源代码http://git.oschi ...

  9. 整合Spring.net到asp.net网站开发中初探

    整合Spring.net到asp.net网站开发中初探 http://www.veryhuo.com 2009-10-21 烈火网 投递稿件 我有话说   Spring提供了一个轻量级的用于构建企业级 ...

  10. MySQL学习笔记(5)

    子查询Subquery 出现在其他sql语句内的select子句. 子查询的外层查询可以是:SELECT,INSERT,UPDATE,SET或DO. 子查询可以返回标量,一行,一列或子查询. ①使用比 ...