Parsing Document Metadata with Bing Scaping

Set up the environment - install goquery package.

https://github.com/PuerkitoBio/goquery

go get github.com/PuerkitoBio/goquery

Modify the Proxy setting if in China. Refer to: https://sum.golang.org/

Unzip an Office file and analyze the Open XML file struct. "creator", "lastModifiedBy" in core.xml and "Application", "Company", "AppVersion" in app.xml are of primary interest.

Defining the metadata Package and mapping the data to structs in GO to open, parse, and extract Office Open XML documents.

package metadata

import (
"archive/zip"
"encoding/xml"
"strings"
) // Open XML type definition and version mapping
type OfficeCoreProperty struct {
XMLName xml.Name `xml:"coreProperties"`
Creator string `xml:"creator"`
LastModifiedBy string `xml:"lastModifiedBy"`
} type OfficeAppProperty struct {
XMLName xml.Name `xml:"Properties"`
Application string `xml:"Application"`
Company string `xml:"Company"`
Version string `xml:"AppVersion"`
} var OfficeVersion = map[string]string{
"16": "2016",
"15": "2013",
"14": "2010",
"12": "2007",
"11": "2003",
} func (a *OfficeAppProperty) GetMajorVersion() string {
tokens := strings.Split(a.Version, ".") if len(tokens) < 2 {
return "Unknown"
}
v, ok := OfficeVersion[tokens[0]]
if !ok {
return "Unknown"
}
return v
} // Processing Open XML archives and embedded XML documents
func NewProperties(r *zip.Reader) (*OfficeCoreProperty, *OfficeAppProperty, error) {
var coreProps OfficeCoreProperty
var appProps OfficeAppProperty for _, f := range r.File {
switch f.Name {
case "docProps/core.xml":
if err := process(f, &coreProps); err != nil {
return nil, nil, err
}
case "docProps/app.xml":
if err := process(f, &appProps); err != nil {
return nil, nil, err
}
default:
continue
}
}
return &coreProps, &appProps, nil
} func process(f *zip.File, prop interface{}) error {
rc, err := f.Open()
if err != nil {
return err
}
defer rc.Close() if err := xml.NewDecoder(rc).Decode(&prop); err != nil {
return err
} return nil
}

Figure out how to search for and retrieve files by using Bing.

1. Submit a search request to Bing with proper filters to retrieve targeted results.

2. Scrape the HTML response, extracting the HRER(link) data to obtain direct URLs for documents.

3. Submit an HTTP request for each direct document URL.

4. Parse the response body to create a zip.Reader.

5. Pass the zip.Reader into the code you already developed to extract metadata.

Analyze the search result elements in Bing.

Now scrap Bing results and parse the document metadata.

package metadata

import (
"archive/zip"
"encoding/xml"
"strings"
) // Open XML type definition and version mapping
type OfficeCoreProperty struct {
XMLName xml.Name `xml:"coreProperties"`
Creator string `xml:"creator"`
LastModifiedBy string `xml:"lastModifiedBy"`
} type OfficeAppProperty struct {
XMLName xml.Name `xml:"Properties"`
Application string `xml:"Application"`
Company string `xml:"Company"`
Version string `xml:"AppVersion"`
} var OfficeVersion = map[string]string{
"16": "2016",
"15": "2013",
"14": "2010",
"12": "2007",
"11": "2003",
} func (a *OfficeAppProperty) GetMajorVersion() string {
tokens := strings.Split(a.Version, ".") if len(tokens) < 2 {
return "Unknown"
}
v, ok := OfficeVersion[tokens[0]]
if !ok {
return "Unknown"
}
return v
} // Processing Open XML archives and embedded XML documents
func NewProperties(r *zip.Reader) (*OfficeCoreProperty, *OfficeAppProperty, error) {
var coreProps OfficeCoreProperty
var appProps OfficeAppProperty for _, f := range r.File {
switch f.Name {
case "docProps/core.xml":
if err := process(f, &coreProps); err != nil {
return nil, nil, err
}
case "docProps/app.xml":
if err := process(f, &appProps); err != nil {
return nil, nil, err
}
default:
continue
}
}
return &coreProps, &appProps, nil
} func process(f *zip.File, prop interface{}) error {
rc, err := f.Open()
if err != nil {
return err
}
defer rc.Close() if err := xml.NewDecoder(rc).Decode(&prop); err != nil {
return err
} return nil
}

Go Pentester - HTTP CLIENTS(5)的更多相关文章

  1. Go Pentester - HTTP CLIENTS(1)

    Building HTTP Clients that interact with a variety of security tools and resources. Basic Preparatio ...

  2. Go Pentester - HTTP CLIENTS(4)

    Interacting with Metasploit msf.go package rpc import ( "bytes" "fmt" "gopk ...

  3. Go Pentester - HTTP CLIENTS(3)

    Interacting with Metasploit Early-stage Preparation: Setting up your environment - start the Metaspl ...

  4. Go Pentester - HTTP CLIENTS(2)

    Building an HTTP Client That Interacts with Shodan Shadon(URL:https://www.shodan.io/)  is the world' ...

  5. Creating a radius based VPN with support for Windows clients

    This article discusses setting up up an integrated IPSec/L2TP VPN using Radius and integrating it wi ...

  6. Deploying JRE (Native Plug-in) for Windows Clients in Oracle E-Business Suite Release 12 (文档 ID 393931.1)

    In This Document Section 1: Overview Section 2: Pre-Upgrade Steps Section 3: Upgrade and Configurati ...

  7. ZK 使用Clients.response

    参考: http://stackoverflow.com/questions/11416386/how-to-access-au-response-sent-from-server-side-at-c ...

  8. MySQL之aborted connections和aborted clients

    影响Aborted_clients 值的可能是客户端连接异常关闭,或wait_timeout值过小. 最近线上遇到一个问题,接口日志发现有很多超时报错,根据日志定位到数据库实例之后发现一切正常,一般来 ...

  9. 【渗透测试学习平台】 web for pentester -2.SQL注入

    Example 1 字符类型的注入,无过滤 http://192.168.91.139/sqli/example1.php?name=root http://192.168.91.139/sqli/e ...

随机推荐

  1. yum 安装包的时候提示“没有可用软件包”

    今天在使用 yum 命令进行包的下载时候,Linux 提示 没有可用的软件包~ 如下: [root@localhost share]# yum -y install wordpress 已加载插件:f ...

  2. Java Service Wrapper 浅谈

    在实际开发过程中很多模块需要独立运行,他们并不会以web形式发布,传统的做法是将其压缩为jar包独立运行,这种形式简单易行也比较利于维护,但是 一旦服务器重启或出现异常时,程序往往无法自行修复或重启. ...

  3. XmlHttpRequest使用及“跨域”问题解决

    一. IE7以后对xmlHttpRequest 对象的创建在不同浏览器上是兼容的. 下面的方法是考虑兼容性的,实际项目中一般使用Jquery的ajax请求,可以不考虑兼容性问题. function g ...

  4. nmap二层发现

    使用nmap进行arp扫描要使用一个参数:-sn,该参数表明屏蔽端口扫描而只进行arp扫描. nmap支持ip段扫描,命令:nmap -sn 192.168.1.0/24 nmap速度比arping快 ...

  5. Java并发包JUC核心原理解析

    CS-LogN思维导图:记录CS基础 面试题 开源地址:https://github.com/FISHers6/CS-LogN JUC 分类 线程管理 线程池相关类 Executor.Executor ...

  6. CLR垃圾收集器

    CLR GC是一种引用跟踪算法,大致步骤如下: 1.暂停进程中所有的线程: 2.标记阶段,遍历堆中的所有对象,标记为删除,然后检查所有活动根,如果有引用对象,就标记那个对象可达,否则不可达: 3.GC ...

  7. SpringBoot--使用Mybatis分页插件

    1.导入分页插件包和jpa包 <dependency> <groupId>org.springframework.boot</groupId> <artifa ...

  8. yqq命令

    用 apt-get install 安装时 ,会有一个提问,问是否继续,需要输入 yes.使用 yqq 的话,就没有这个提问了,自动 yes . 例如下面的更新和安装nginx: apt-get -y ...

  9. 收藏 | 14张思维导图-构建Python核心体系!Python语法总结!

    今天在看Python时,ZOE的Python思维导图总结的很好,分享一下 链接: https://pan.baidu.com/s/1s6Gtptp-pJS0UliNeRIvjg 提取码: mrfz

  10. 常见的H5移动端Web页面Bug问题解决方案总汇

    解决jquery ajax调用远程接口的跨域问题 首先,接口必须允许远程调用.这是后端或者运维的事情.你必须保证你得到的一个接口是允许远程调用的.否则,就没啥了. $.ajax({ type:'get ...