Coursera课程笔记----P4E.Capstone----Week 2&3
Building a Search Engine(week 2&3)
Search Engine Architecture
Web Crawling
Index Building
Searching
Web Crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
steps
- Retrieve a page
- Look through the page for links
- Add the links to a list of "to be retrieved" sites
- repeat...
policy
- selection policy that states which page to download
- re-visit policy that states when to.check for changes to the pages
- politeness policy that states how to avoid overloading Web sites
- parallelization policy that states how to coordinate distributed Web crawlers
robots.txt
A way for a web site to communicate with web crawlers
An informal and voluntary standard
It tells the crawler where to look and where not to look
Search Indexing
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
code segment
spider.py
import sqlite3
import urllib.error
import ssl
from urllib.parse import urljoin
from urllib.parse import urlparse
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# Link to sqlite
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
# Create new tables
cur.execute('''CREATE TABLE IF NOT EXISTS Pages
(id INTEGER PRIMARY KEY, url TEXT UNIQUE, html TEXT,
error INTEGER, old_rank REAL, new_rank REAL)''')
cur.execute('''CREATE TABLE IF NOT EXISTS Links
(from_id INTEGER, to_id INTEGER)''')
#This table store only one url which is processing
cur.execute('''CREATE TABLE IF NOT EXISTS Webs (url TEXT UNIQUE)''')
# Check to see if we are already in progress...
cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1')
row = cur.fetchone()
if row is not None:
print("Restarting existing crawl. Remove spider.sqlite to start a fresh crawl.")
else :
starturl = input('Enter web url or enter: ')
if ( len(starturl) < 1 ) : starturl = 'http://www.dr-chuck.com/'
# delete the "/"
if ( starturl.endswith('/') ) : starturl = starturl[:-1]
web = starturl
if ( starturl.endswith('.htm') or starturl.endswith('.html') ) :
pos = starturl.rfind('/')
web = starturl[:pos]
if ( len(web) > 1 ) :
cur.execute('INSERT OR IGNORE INTO Webs (url) VALUES ( ? )', ( web, ) )
cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( starturl, ) )
conn.commit()
# Get the current webs
cur.execute('''SELECT url FROM Webs''')
webs = list()
for row in cur:
webs.append(str(row[0]))
print(webs)
many = 0
while True:
if ( many < 1 ) :
sval = input('How many pages:')
if ( len(sval) < 1 ) : break
many = int(sval)
many = many - 1
cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1')
try:
row = cur.fetchone()
# print row
fromid = row[0]
url = row[1]
except:
print('No unretrieved HTML pages found')
many = 0
break
print(fromid, url, end=' ')
# If we are retrieving this page, there should be no links from it
cur.execute('DELETE from Links WHERE from_id=?', (fromid, ) )
try:
document = urlopen(url, context=ctx)
html = document.read()
if document.getcode() != 200 :
print("Error on page: ",document.getcode())
cur.execute('UPDATE Pages SET error=? WHERE url=?', (document.getcode(), url) )
if 'text/html' != document.info().get_content_type() :
print("Ignore non text/html page")
cur.execute('DELETE FROM Pages WHERE url=?', ( url, ) )
conn.commit()
continue
print('('+str(len(html))+')', end=' ')
soup = BeautifulSoup(html, "html.parser")
except KeyboardInterrupt:
print('')
print('Program interrupted by user...')
break
except:
print("Unable to retrieve or parse page")
cur.execute('UPDATE Pages SET error=-1 WHERE url=?', (url, ) )
conn.commit()
continue
cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( url, ) )
cur.execute('UPDATE Pages SET html=? WHERE url=?', (memoryview(html), url ) )
conn.commit()
# Retrieve all of the anchor tags
tags = soup('a')
count = 0
for tag in tags:
href = tag.get('href', None)
if ( href is None ) : continue
# Resolve relative references like href="/contact"
up = urlparse(href)
if ( len(up.scheme) < 1 ) :
href = urljoin(url, href)
ipos = href.find('#')
if ( ipos > 1 ) : href = href[:ipos]
if ( href.endswith('.png') or href.endswith('.jpg') or href.endswith('.gif') ) : continue
if ( href.endswith('/') ) : href = href[:-1]
# print href
if ( len(href) < 1 ) : continue
# Check if the URL is in any of the webs
found = False
for web in webs:
if ( href.startswith(web) ) :
found = True
break
if not found : continue
cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( href, ) )
count = count + 1
conn.commit()
cur.execute('SELECT id FROM Pages WHERE url=? LIMIT 1', ( href, ))
try:
row = cur.fetchone()
toid = row[0]
except:
print('Could not retrieve id')
continue
# print fromid, toid
cur.execute('INSERT OR IGNORE INTO Links (from_id, to_id) VALUES ( ?, ? )', ( fromid, toid ) )
print(count)
cur.close()
sprank.py
import sqlite3
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
# Find the ids that send out page rank - we only are interested
# in pages in the SCC that have in and out links
cur.execute('''SELECT DISTINCT from_id FROM Links''')
from_ids = list()
for row in cur:
from_ids.append(row[0])
# Find the ids that receive page rank
to_ids = list()
links = list()
cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''')
for row in cur:
from_id = row[0]
to_id = row[1]
if from_id == to_id : continue
if from_id not in from_ids : continue
if to_id not in from_ids : continue
links.append(row)
if to_id not in to_ids : to_ids.append(to_id)
# Get latest page ranks for strongly connected component
prev_ranks = dict()
for node in from_ids:
cur.execute('''SELECT new_rank FROM Pages WHERE id = ?''', (node, ))
row = cur.fetchone()
prev_ranks[node] = row[0]
sval = input('How many iterations:')
many = 1
if ( len(sval) > 0 ) : many = int(sval)
# Sanity check
if len(prev_ranks) < 1 :
print("Nothing to page rank. Check data.")
quit()
# Lets do Page Rank in memory so it is really fast
for i in range(many):
# print prev_ranks.items()[:5]
next_ranks = dict();
total = 0.0
for (node, old_rank) in list(prev_ranks.items()):
total = total + old_rank
next_ranks[node] = 0.0
# print total
# Find the number of outbound links and sent the page rank down each
for (node, old_rank) in list(prev_ranks.items()):
# print node, old_rank
give_ids = list()
for (from_id, to_id) in links:
if from_id != node : continue
# print ' ',from_id,to_id
if to_id not in to_ids: continue
give_ids.append(to_id)
if ( len(give_ids) < 1 ) : continue
amount = old_rank / len(give_ids)
# print node, old_rank,amount, give_ids
for id in give_ids:
next_ranks[id] = next_ranks[id] + amount
newtot = 0
for (node, next_rank) in list(next_ranks.items()):
newtot = newtot + next_rank
evap = (total - newtot) / len(next_ranks)
# print newtot, evap
for node in next_ranks:
next_ranks[node] = next_ranks[node] + evap
newtot = 0
for (node, next_rank) in list(next_ranks.items()):
newtot = newtot + next_rank
# Compute the per-page average change from old rank to new rank
# As indication of convergence of the algorithm
totdiff = 0
for (node, old_rank) in list(prev_ranks.items()):
new_rank = next_ranks[node]
diff = abs(old_rank-new_rank)
totdiff = totdiff + diff
avediff = totdiff / len(prev_ranks)
print(i+1, avediff)
# rotate
prev_ranks = next_ranks
# Put the final ranks back into the database
print(list(next_ranks.items())[:5])
cur.execute('''UPDATE Pages SET old_rank=new_rank''')
for (id, new_rank) in list(next_ranks.items()) :
cur.execute('''UPDATE Pages SET new_rank=? WHERE id=?''', (new_rank, id))
conn.commit()
cur.close()
spdump.py
import sqlite3
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url
FROM Pages JOIN Links ON Pages.id = Links.to_id
WHERE html IS NOT NULL
GROUP BY id ORDER BY inbound DESC''')
count = 0
for row in cur :
if count < 50 : print(row)
count = count + 1
print(count, 'rows.')
cur.close()
spjson.py
import sqlite3
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
print("Creating JSON output on spider.js...")
howmany = int(input("How many nodes? "))
cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url
FROM Pages JOIN Links ON Pages.id = Links.to_id
WHERE html IS NOT NULL AND ERROR IS NULL
GROUP BY id ORDER BY id,inbound''')
fhand = open('spider.js','w')
nodes = list()
maxrank = None
minrank = None
for row in cur :
nodes.append(row)
rank = row[2]
if maxrank is None or maxrank < rank: maxrank = rank
if minrank is None or minrank > rank : minrank = rank
if len(nodes) > howmany : break
if maxrank == minrank or maxrank is None or minrank is None:
print("Error - please run sprank.py to compute page rank")
quit()
fhand.write('spiderJson = {"nodes":[\n')
count = 0
map = dict()
ranks = dict()
for row in nodes :
if count > 0 : fhand.write(',\n')
# print row
rank = row[2]
rank = 19 * ( (rank - minrank) / (maxrank - minrank) )
fhand.write('{'+'"weight":'+str(row[0])+',"rank":'+str(rank)+',')
fhand.write(' "id":'+str(row[3])+', "url":"'+row[4]+'"}')
map[row[3]] = count
ranks[row[3]] = rank
count = count + 1
fhand.write('],\n')
cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''')
fhand.write('"links":[\n')
count = 0
for row in cur :
# print row
if row[0] not in map or row[1] not in map : continue
if count > 0 : fhand.write(',\n')
rank = ranks[row[0]]
srank = 19 * ( (rank - minrank) / (maxrank - minrank) )
fhand.write('{"source":'+str(map[row[0]])+',"target":'+str(map[row[1]])+',"value":3}')
count = count + 1
fhand.write(']};')
fhand.close()
cur.close()
print("Open force.html in a browser to view the visualization")
Coursera课程笔记----P4E.Capstone----Week 2&3的更多相关文章
- Coursera课程笔记----P4E.Capstone----Week 6&7
Visualizing Email Data(Week 6&7) code segment gword.py import sqlite3 import time import zlib im ...
- Coursera课程笔记----P4E.Capstone----Week 4&5
Spidering and Modeling Email Data(week4&5) Mailing List - Gmane Crawl the archive of a mailing l ...
- 操作系统学习笔记----进程/线程模型----Coursera课程笔记
操作系统学习笔记----进程/线程模型----Coursera课程笔记 进程/线程模型 0. 概述 0.1 进程模型 多道程序设计 进程的概念.进程控制块 进程状态及转换.进程队列 进程控制----进 ...
- Coursera课程笔记----C++程序设计----Week3
类和对象(Week 3) 内联成员函数和重载成员函数 内联成员函数 inline + 成员函数 整个函数题出现在类定义内部 class B{ inline void func1(); //方式1 vo ...
- Coursera课程笔记----Write Professional Emails in English----Week 3
Introduction and Announcement Emails (Week 3) Overview of Introduction & Announcement Emails Bas ...
- Coursera课程笔记----Write Professional Emails in English----Week 1
Get to Know Basic Email Writing Structures(Week 1) Introduction to Course Email and Editing Basics S ...
- Coursera课程笔记----C程序设计进阶----Week 5
指针(二) (Week 5) 字符串与指针 指向数组的指针 int a[10]; int *p; p = a; 指向字符串的指针 指向字符串的指针变量 char a[10]; char *p; p = ...
- Coursera课程笔记----Write Professional Emails in English----Week 5
Culture Matters(Week 5) High/Low Context Communication High Context Communication The Middle East, A ...
- Coursera课程笔记----Write Professional Emails in English----Week 4
Request and Apology Emails(Week 4) How to Write Request Emails Write more POLITELY & SINCERELUY ...
随机推荐
- JUC强大的辅助类讲解--->>>CountDownLatchDemo (减少计数)
原理: CountDownLatch主要有两个方法,当一个或多个线程调用await方法时,这些线程会阻塞.其它线程调用countDown方法会将计数器减1(调用countDown方法的线程不会阻塞), ...
- 深入理解Java线程状态转移
目录 前言 状态转移图 1.0 新建态到就绪态 1.1 就绪态到运行态 1.2 运行态到就绪态 1.2.1 时间片用完 1.2.2 t1.yield() .Thread.yield(); 1.3 运行 ...
- 详解 I/O流
I/O流是用于处理设备之前信息传输的流,在我们今后的学习甚至是工作中,都是十分重要的. 在我们的日常生活中,也是很常见的,譬如:文件内容的合并.设备之键的文件传输,甚至是下载软件时的断点续传,都可以用 ...
- js的中文英文排序
本例主要实现 中文汉字按拼音排序的方法和英文按照首字母排序的方法. //要排序的数据 let data = [ {chinese: '蔡司', english: 'Chase'}, {chinese: ...
- iview使用之怎样给Page组件添加跳转按钮
在项目开发过程中,我们会经常遇到使用分页的表格,然而在ivieiw中,我们通常只能使用Page组件自带的功能,如下图: 切换每页条数这些基本的功能都不说了,有时候我们需要在输入框里输入想要跳转到的页数 ...
- 转载-linux内核长什么样
来源:Linux中国 今天,我来为大家解读一幅来自 TurnOff.us 的漫画 "InSide The Linux Kernel" . TurnOff.us是一个极客漫画网站,作 ...
- Java中的二分查找
二分查找:(折半查找) 前提:数组必须是有序的. 思想:每次都猜中间的那个元素,比较大或者小,就能减少一半的元素.思路:A:定义最小索引,最大索引. B:比较出中间索引 C:拿中间索引的值和要查找的元 ...
- C#集合ArrayList、泛型集合List(3)
数组的制约:局限性.有多少放多少,要想追加,就必须重新再定义一个数组,这就造成了资源的极大浪费而且性能消耗也比较大.因此此操作不太推荐.所以集合就来了. ,,,} 创建集合: ArrayList li ...
- MySQL之慢日志记录、分页
1.慢日志记录 slow_query_log = OFF #是否开启慢日志记录 long_query_time = 2 #时间限制,超过此时间,则记录 slow_query_log_file = C: ...
- PHP open_basedir配置未包含upload_tmp_dir 导致服务器不能上传文件
在做一个上传图片的功能时候发现后台接收到的$_FILES['file']['error'] = 6,这个错误意思是找不到临时文件,或者是临时文件夹无权限,需要更改php.ini文件的 upload_t ...