15158846557 或

所在位置：首页 > 营销资讯 > 网站运营 > 大神kennethreitz写出requests-html，号称为人设计的网页解析库

大神kennethreitz写出requests-html，号称为人设计的网页解析库

时间：2023-09-04 22:18:01 | 来源：网站运营

时间：2023-09-04 22:18:01 来源：网站运营

大神kennethreitz写出requests-html，号称为人设计的网页解析库：requests库的作者kennethreitz又设计出一个新的库requests-html. 目前stars数高达9195

之前的requests库号称是给人用的请求库，而requests-html号称是给人用的html解析库。kennethreitz的牛掰我是相信的，他不会吹的。新库的文档我阅读了一遍确实很不简单，以后学习爬虫可能再也不要requests+bs4作为起步的标配了，直接用requests-html一个库就可以搞定所有的事情。

我在谷歌趋势搜索了requests-html，发现最早搜索是2018年1月。大邓距离掌握爬虫圈最新技术落后了一年多，我知道的太晚了。以后大家有什么好的新的东西可以留言或者后台留言。

requests-html强大之处在于：

拥有了requests之外的超强且神奇的页面解析能力
完全支持javascript
定位元素支持CSS选择器（jQuery，类似于pyquery库的用法）、Xpath选择器
访问过程伪装成成浏览器行为模式（User-agent）
对于静态页面而言，本库内置自动翻页，省去构造网址的苦差事

安装

文档中说目前支持python3.6，但是我经过安装和测试，在python3.7也能正常安装和使用

pip install requests-html

智能翻页（待改进）

这是我看到的最亮的功能，但是实际使用还是有问题的，但是我仍要把ta列在第一个要讲的内容。平常我们写静态网页的爬虫前，需要先发现网址规律，如

第一页   https://book.douban.com/tag/小说第二页   https://book.douban.com/tag/小说?start=20&type=T第三页   https://book.douban.com/tag/小说?start=40&type=T第四页   https://book.douban.com/tag/小说?start=60&type=T

当我们可能批量发起请求的时候，代码需要这样写

from bs4 import BeautifulSoupimport requests base = 'https://book.douban.com/tag/小说?start={page}&type=Theaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}for i in range(100):    url = base.format(page=i*20)    resp = requests.get(url, headers=headers)    bsObj = BeautifulSoup(resp.text, 'html.parser')

但是requests-html只需要

from requests_html import HTMLSessionsession = HTMLSession()r = session.get('https://book.douban.com/tag/小说')for html in r.html:    print(html)

<HTML url='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'>

但是实际使用过程中该方法并未奏效，kennethreitz也在文档中提到
There’s also intelligent pagination support (always improving)

always improving就是该库在智能翻页这块表现还差强人意，还需要一直改进。但是这个功能的设想真的很棒，期待早日更新出能使用的智能翻页功能。
希望大家的心情不要希望->失望，其实后面还有很多精彩的内容等待着你

正儿八经的GET请求

我们对python编程语言的官网 https://python.org/ ,发起GET请求的，并得到网站响应Response对象。

该Response对象方法与requests库的类似，我们看看常用的方法

from requests_html import HTMLSessionsession = HTMLSession()r = session.get('https://python.org/')r
Run and output!

<Response [200]>

获取响应的html文本数据

r.text[:50]
Run and output!

'<!doctype html>/n<!--[if lt IE 7]> <html class="n'

获取响应的html数据，以二进制显示

r.content[:50]

Run and output!

b'<!doctype html>/n<!--[if lt IE 7]> <html class="n'

将响应转化为HTML类型，方便解析定位。

r.html

Run and output!

<HTML url='https://www.python.org/'>

HTML对象的方法

#混合着绝对和相对网址print(len(r.html.links))list(r.html.links)[:5]

Run and output!

119['/success-stories/category/arts/', 'https://kivy.org/', 'https://www.python.org/psf/codeofconduct/', 'http://www.scipy.org', 'https://docs.python.org/3/license.html']

htmlObj.absolute_links将相对路径也转化为绝对路径

print(len(r.html.absolute_links))list(r.html.absolute_links)[:5]

Run and output!

119['https://kivy.org/', 'https://www.python.org/psf/codeofconduct/', 'http://www.scipy.org', 'https://jobs.python.org', 'https://docs.python.org/3/license.html']

Notes
相对路径网址 //http://docs.python.org/3/tutorial/
绝对路径网址 http://docs.python.org/3/tutorial/

HTML.links获取网址
HTML.absolute_links获得绝对路径网址

我们发现两种方法返回的网址数量都是119，所以absolute_links实际上对links中的相对路径进行了填充将其转化为绝对路径网址。

支持Javascript

requests-html支持javascrip，现在我们找一个网站 https://pythonclock.org/，我们看到有一个倒计时时间表。这个页面内置了
javascript，像这种数据正常的网页解析库是无法解析到的。

from requests_html import HTMLSessionsession = HTMLSession()r2 = session.get('https://pythonclock.org/')r2.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]

Run and output!

'</h1>/n        </div>/n        <div class="python-27-clock"></div>/n        <div class="center">/n            <div class="guido-button-block">/n                <button class="js-guido-mode guido-button">'

requests-html有一个render渲染方法，可以用Chromium把javascript渲染出来，但是第一次使用时会下载chromium，大概需要几分钟时间把。

r2.html.render()r2.html.search('Python 2 will retire in only {months} months!')

Run and output!

'</h1>/n        </div>/n        <div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>/n        <div class="center">/n            <div class="guido-button-block">/n                <button class="js-guido-mode guido-button">'

上面的结果已经得到了倒计时的数据，接下来可以这样提取时间

periods = [element.text for element in r.html.find('.countdown-period')]amounts = [element.text for element in r.html.find('.countdown-amount')]countdown_data = dict(zip(periods, amounts))countdown_data

Run and output!

{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}

CSS定位

从HTML对象中抽取指定位置的元素
htmlObj.find('元素选择器', first=False) 返回满足条件的所有Element元素，返回的数据类型是由Element组成的列表。

r.html.find('#about')

Run and output!

[<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>]

将first设置为True，只返回满足条件的第一个Element，此时返回的不是列表，而是Element。

about = r.html.find('#about',first=True)about

Run and output!

<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>

Element对象方法

r = session.get('https://github.com/')htmlObj = r.htmlhtmlObj.xpath('a',first=True)

Run and output!

<Element 'a' class=('btn', 'ml-2') href='https://help.github.com/articles/supported-browsers'>

更多内容请看requests-html文档 http://html.python-requests.org/

往期文章

Python系列课（爬虫、文本分析、机器学习）视频教程
【荐】3月初python圈中的精选文章

少有人知的python数据科学库

职场达人必备技能：群发营销内容
【动画】如何用scrapy命令行访问、解析网页数据
100G 文本分析语料资源（免费下载）
十分钟带你入门最python风格的Gui库
字符串格式化你不得不知的那些事儿
手把手教你学会LDA话题模型可视化pyLDAvis库
【工具篇】如何用Google Colab高效的学习Python
使用Python制作WORD报告
使用Pandas、Jinja和WeasyPrint制作pdf报告

关键词：设计,号称

网站
营销
设计
运营
优化
效率
专注
电商
方案
推广

解决方案&服务

客户&案例

营销资讯

关于我们

解决方案&服务

客户&案例

营销资讯

关于我们

微信公众号

为了最佳展示效果，本站不支持IE9及以下版本的浏览器，建议您使用谷歌Chrome浏览器。点击下载Chrome浏览器

关闭

快捷入口

大神kennethreitz写出requests-html，号称为人设计的网页解析库

安装

智能翻页（待改进）

正儿八经的GET请求

HTML对象的方法

支持Javascript

CSS定位

Element对象方法

往期文章

集团网站建设应该怎么做？

python selenium 爬虫模拟浏览网站内容

如何低成本建外贸网站？

杭州天琥教育好不好？学平面设计去哪家培训班好？

开传奇SF总是赔钱的几个原因

请人做一个网站大概需要多少钱？角点科技分享制作网站的费用

Nginx的location里面的root、alias的使用技巧与区别

培训班绝不会告诉你的10个免费且高质的自学网站

模板站怎么搭建？

今日推荐——100多个知名网站克隆版的开源代码Clone-Wars

快捷入口

大神kennethreitz写出requests-html，号称为人设计的网页解析库

安装

智能翻页（待改进）

正儿八经的GET请求

HTML对象的方法

支持Javascript

CSS定位

Element对象方法

往期文章

推荐文章