简单的爬虫架构

title

爬虫调度端：启动、停止、监听爬虫
爬虫程序：
- url管理器：管理已爬取ur和待爬取的url，取出一个待爬取url送入网页下载器
- 网页下载器：将url指定的网页下载存储成字符串，将字符串送入网页解析器
- 网页解析器：从字符串中解析出有价值的信息和数据，并爬取取出其中其他网页的url，送入url管理器
- 三个模块形成循环
得到价值数据

title

URL管理器

title

网页下载器

title

第一种方法

import urllib2
# 直接请求
response = urllib2.urlopen('http://www.baidu.com')
# 获取状态码
print response.getcode()
# 读取内容
cont = response.read()

第二种方法

import urllib2
# 创建Request对象
request = urllib2.Request(url)
# 添加数据
request = request.add_data('a','1')
# 添加http的head
request = add_header('User-Agent','Mozilla/5.0')
# 发送请求获取结果
response = urllib2.urlopen(request)

第三种方法

import urllib2,cookielib
# 创建cookie容器
cj = cookielib.CookieJar()
# 创建一个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# 给urllib2安装opener
urllib2.install_opener(opener)
# 使用带有cookie的urllib2访问网页
response = urllib2.urlopen("http://www.baidu.com")

网页解析器

title
title
title

pip install beautifulsoup4 安装beautifulsoup4
创建BeautifulSoup对象

from

逆风起笔

简单的爬虫架构

URL管理器

网页下载器

网页解析器