ZH奶酪：PHP 使用DOMDocument抓取网页

原文链接:http://blog.csdn.net/xyzhaopeng/article/details/6626340

从一个HTML页面的一个表格中提取数据并且将这个数据整理出来加入到MySQL数据库中。

假设目标HTML中我感兴趣的Table有3列，分别是ID，Name，内容。

index.php

<pre class="php" name="code"><?php  

/*

 * To change this template, choose Tools | Templates

 * and open the template in the editor.

 */

    $urlTarget = "http://www.xxxx.com/targethtmlpage.html";  

    require_once('ContentManager.php');  

    //建立Dom对象，分析HTML文件；

    $htmDoc = new DOMDocument;

    $htmDoc->loadHTMLFile($urlTarget );

    $htmDoc->normalizeDocument();  

    //获得到此文档中每一个Table对象；

    $tables_list = $htmDoc->getElementsByTagName('table');     

    //测试Table Count；

    $tables_count = $tables_list->length;

    foreach ($tables_list as $table)

    {

        //得到Table对象的class属性

        $tableProp = $table->getAttribute('class');

        if ($tableProp == 'target_table_class')

        {

            $contentMgr = new ContentManager();

            $contentMgr->ParseFromDOMElement($table);  

            //这里myParser就完成了分析动作。然后就可以进行需要的操作了。

            //比如写入MySQL。

            $contentMgr->SerializeToDB();

        }

    }

?>

</pre><br>

ContentManager.php

    <?php  

    /*

     * To change this template, choose Tools | Templates

     * and open the template in the editor.

     */  

    /**

     * Description of ContentParser

     *

     * @author xxxxx

     */

    require_once('ContentInfo.php');

    class ContentManager {

        //put your code here

        var $ContentList;

        public function __construct() {

            $this->ContentList = new ArrayObject();

        }  

        public function ParseFromDOMElement(DOMElement $table)

        {

            $rows_list = $fundsTable->getElementsByTagName('tr');

            $rows_length = $rows_list->length;

            $index = 0;  

            foreach ($rows_list as $row)

            {

                $contentInfo = new ContentInfo();

                $contentInfo->ParseFromDOMElement($row);

                $this->ContentList->append ($contentInfo);

            }  

            //test how many contents parsed.

            $count = $this->fundsInfoArray->count();

            echo $count;

        }  

        public function SerializeToDB()

        {

            //写入数据库，代码略。

        }

    }  

    ?>

contentinfo.php

    <?php  

    /*

     * To change this template, choose Tools | Templates

     * and open the template in the editor.

     */  

    /**

     * Description of ContentInfo

     *

     * @author xxxxx

     */

    class ContentInfo {

        //put your code here

        var $ID;

        var $Name;

        var $Content;

        public function ParseFromDOMElement(DOMElement $row)

        {

            $cells_list = $row->getElementsByTagName('td');

            $cells_length = $row->length;  

            $curCellIdx = 0;

            foreach ($cells_list as $cell)

            {

                switch ($curCellIdx++)

                {

                    case 0:

                        $this->ID = $cell->nodeValue;

                        break;

                    case 1:

                        $this->Name = $cell->nodeValue;

                        break;

                    case 2:

                        $this->Content = $cell->nodeValue;

                        break;

                }

            }

        }

    }  

    ?>

ZH奶酪：PHP 使用DOMDocument抓取网页的更多相关文章

java抓取网页数据，登录之后抓取数据。
最近做了一个从网络上抓取数据的一个小程序.主要关于信贷方面,收集的一些黑名单网站,从该网站上抓取到自己系统中. 也找了一些资料,觉得没有一个很好的,全面的例子.因此在这里做个笔记提醒自己. 首先需要一 ...
使用JAVA抓取网页数据
一.使用 HttpClient 抓取网页数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ...
Java 抓取网页中的内容【持续更新】
背景:前几天复习Java的时候看到URL类,当时就想写个小程序试试,迫于考试没有动手,今天写了下,感觉还不错内容1. 抓取网页中的URL 知识点:Java URL+ 正则表达式 import jav ...
C语言调用curl库抓取网页图片
思路是先用curl抓取网页源码,然后以关键字寻找出图片网址. #include <stdio.h> #include <stdlib.h> #include <str ...
[转载]爬虫的自我解剖(抓取网页HtmlUnit)
网络爬虫第一个要面临的问题,就是如何抓取网页,抓取其实很容易,没你想的那么复杂,一个开源HtmlUnit包,4行代码就OK啦,例子如下: 1 2 3 4 final WebClient webClie ...
C语言调用curl库抓取网页图片(转)
思路是先用curl抓取网页源码,然后以关键字寻找出图片网址. 范例: #include <stdio.h> #include <stdlib.h> #include < ...
Jumony快速抓取网页 --- Jumony使用笔记--icode
作者:郝喜路个人主页:http://www.cnicode.com 博客地址:http://haoxilu.cnblogs.com 时间:2014年6月26日 19:25:02 ...
python抓取网页中图片并保存到本地
#-*-coding:utf-8-*- import os import uuid import urllib2 import cookielib '''获取文件后缀名''' def get_file ...
PHP的CURL方法curl_setopt()函数案例介绍(抓取网页,POST数据)
通过curl_setopt()函数可以方便快捷的抓取网页(采集很方便),curl_setopt 是php的一个扩展库使用条件:需要在php.ini 中配置开启.(PHP 4 >= 4.0.2) ...

随机推荐

利用/proc/pid/pagemap将虚拟地址转换为物理地址
内核文档: Documentation/vm/pagemap.txt pagemap is a new (as of 2.6.25) set of interfaces in the kernel t ...
Android 数据存储01之SharedPreferences
Android 数据存储01之SharedPreferences 版本修改内容日期修改人 V1.0 原始版本 2013/2/20 skywang 1 SharedPreferences概括 Sh ...
mysql递归查询子类ID查询所有子类
先来看数据表的结构如下: id name parent_id --------------------------- 1 Home 0 2 About ...
Windows平台Mysql使表名区分大小写
my.ini 里面的mysqld部分加入 lower_case_table_names=2 [mysqld] lower_case_table_names=2 port= 3306 注: 1 ...
golang常用模块介绍
golang模块一.命令行库Cobra Cobra提供简单的接口来创建强大的现代化CLI接口,比如git与go工具.Cobra同时也是一个程序, 用于创建CLI程序 https://www.jian ...
使用pm2管理node.js应用
中文文档:https://pm2.io/doc/zh/runtime/quick-start/ pm2是从nodejs衍生出来的服务器进程管理工具,可以做到开机就启动nodejs.当然了,有些运维同学 ...
JavaScriptSerializer 类
ylbtech-.Net-Class:JavaScriptSerializer 类应对 Json.NET 使用序列化和反序列化. 为启用 AJAX 的应用程序提供序列化和反序列化功能. 1.实例返回 ...
5.数字拆分成4段，怎样使得4段的乘积最小【dp】
题目是:给出一个数字(10,000-100,000,000),把这个数字拆分成4段,怎样使得4段的乘积最小.比如12345拆分成1*2*3*45=270, 10000=1*00*0*0=0. 解题分析 ...
关掉Windows Firewall的PowerShell
在Windows 8或Windows 2012 R2上, 使用下面的命令: Set-NetFirewallProfile -Profile Domain,Public,Private -Enabled ...
Http协议中Get和Post的浅谈
起名困难户,每次写文章最愁的就是不知道该如何起个稍具内涵的名字,如果这篇文章我只是写写Get和Post的区别,我可以起个名字“Get和Post的那点事”,如果打算阐述一下Http协议原理性内容,那该叫 ...

ZH奶酪：PHP 使用DOMDocument抓取网页

index.php

ContentManager.php

contentinfo.php

ZH奶酪：PHP 使用DOMDocument抓取网页的更多相关文章

随机推荐

热门专题