以下是我爬取上海链家网宝山区房源信息的学习总结
准备工作
用到的Python模块:
- requests
- bs4
- pymongo
- datetime
- time
- random
分析网页
登陆http://sh.lianjia.com/ershoufang/baoshan 用Chrome打开开发者工具
每条房源信息都在li元素中,我们再来看一下翻页链接
试着点击下一页,我们浏览器上的链接是有规律可循的
http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
………
http://sh.lianjia.com/ershoufang/baoshan/100
现在我们试着爬取前10页的链接
|
|
爬取结果
|
|
解析网页
要抓取的信息如下:
- 标题:room_title = room.find(‘div’, attrs={‘class’: ‘prop-title’})
- 房屋信息:room_info = room.find(‘span’, attrs={‘class’: ‘info-col row1-text’})
- 位置:room_location = room.find(‘span’, attrs={‘class’: ‘info-col row2-text’})
- 附加信息:extra_info = room.find(‘div’, attrs={‘class’: ‘property-tag-container’})
- 总价:room_price = room.find(‘span’, attrs={‘class’: ‘total-price strong-num’})
- 单价:room_unit_price = room.find(‘span’, attrs={‘class’: ‘info-col price-item minor’})
|
|
存入MongoDB数据库
MongoDB数据结构是以键值对{key:value}形式组成,有点类似于JSON
|
|
运行代码,我们可以看到数据存入了MongoDB
<pymongo.results.InsertManyResult object at 0x00000260C536AB8>
<pymongo.results.InsertManyResult object at 0x00000260C536AAC>
<pymongo.results.InsertManyResult object at 0x00000260C536AA0>
<pymongo.results.InsertManyResult object at 0x00000260C536AB4>
<pymongo.results.InsertManyResult object at 0x00000260C536AB0>
<pymongo.results.InsertManyResult object at 0x00000260C536A28>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536888>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A48>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
可以下载一个MongoDB可视化工具,我用的是Robo3T,数据就这样存入了
总共有100页的数据,用time.sleep()来控制速度防止被封掉,但爬取效率实在很低,这两天准备学习pandas