{
"title": "selenium爬虫",
"tags": [
"post",
"spider",
"python"
],
"sources": [
"xlog"
],
"external_urls": [
"https://ming5ming.xlog.app/selenium-pa-chong"
],
"date_published": "2023-01-05T17:24:31.580Z",
"content": "### selenium是什么?\n>Selenium 是支持 web 浏览器自动化的一系列工具和库的综合项目。\n<br>它提供了扩展来模拟用户与浏览器的交互,用于扩展浏览器分配的分发服务器, 以及用于实现 [W3C WebDriver](https://www.w3.org/TR/webdriver/) 规范 的基础结构, 该 规范 允许您为所有主要 Web 浏览器编写可互换的代码。\n\npython中的selenium库是selenium的接口, 它可以模拟浏览器像人一样操作页面, 获取网页信息.\n\n<br>基于这个特点, selenium在爬取某些网站信息时代码逻辑更简单, **且不用逆向js加密代码**.\n<br>然而, 因为是模拟操作, 所以爬取效率比不上其他爬虫.\n\n为了展现selenium的强大, 我们举个例子:\n<br>爬取bilibili个人页面的 粉丝名字 及 粉丝的粉丝数\n\n<br>**注意**:爬取数据时要注意网站的 robots.txt 内的规定, 同时不要有太高的爬取频率, 以免对网站产生负担. 本文爬取的 粉丝名字 与 粉丝的粉丝数 属于公开内容.\n\n### 安装\n```shell\n$ pip install selenium\n```\n### 分析网站\n在个人空间的粉丝页面下, 粉丝信息位于`<ul>`下的`<li>`元素内.\n\n![粉丝页面](ipfs://bafybeifapwtslfayl6fwtrh3wwkrufbihdhtfulk4ddnqiotbsjd2lijwq)\n然而在`<li>`内没有粉丝的粉丝数这一信息, 而是在服务器上, 没有储存在本地. 它是通过鼠标移动到粉丝头像或者名字后, 触发js传输到本地的. 这个操作是由AJAX技术实现的. \n>[AJAX](https://developer.mozilla.org/zh-CN/docs/Web/Guide/AJAX)(Asynchronous JavaScript and XML) 是一种在无需重新加载整个网页的情况下,能够更新部分网页的技术.\n\n在鼠标移动到头像上时, 会在`<body>`末端生成一个`<div id=\"id-card\">`:\n\n![id-card](ipfs://bafkreiepcc7pmbixp3yy6wzmqyf2v3rmswecqu6o6iapfytb4qz6m6fkne)\n<br>粉丝的粉丝数位于`<div id=\"id-card\">`下的`<span class=\"idc-meta-item\">`下:\n\n![fansNum](ipfs://bafkreiaqsjkzzh6bajzgssuxsddntfgkd4igejazi2ke2tsgjqkvuazpaq)\n\n### 匹配方法\nselenium内匹配元素有很多方法:\n- xpath(最常用)\n- by id\n- by name/tag name/class name\n- by link\n- by css selector\n\nxpath之所以好用是因为xpath可以使用相对路径匹配, 并且语法简单.\n比如匹配粉丝头像可以这样写:\n```xpath\n//div[@id=\"id-card\"]\n```\n而在XML下该元素的位置:\n```XML\n<html>\n ...\n <body>\n ...\n <div id = \"id-card\">\n ...\n </div>\n </body>\n\n</html>\n```\n\n当然css selector有时也很好用:<br>\nXML:\n```XML\n<html>\n <body>\n <p class=\"content\">Site content goes here.</p>\n</body>\n<html>\n```\ncss selector:\n```css selector\np.content\n```\n### 写爬虫\n![mermaid-diagram-2023-01-06-041353.png](ipfs://bafkreidsamwfjlvrx7vi65vda77lmtgjsdsq54ora6ypoxjolgt6gng3ku)\n<br>\n初始化:\n```python\ndef initDriver(url):\n#设置headless浏览器\n options = webdriver.ChromeOptions()\n options.add_argument('headless')\n options.add_experimental_option('excludeSwitches', ['enable-logging'])\n\n#初始化\n driver = webdriver.Chrome(options=options)\n actions = ActionChains(driver)\n\n#打开链接\n driver.get(url)\n driver.implicitly_wait(10)\n\n return driver, actions\n```\n获取页码:\n```python\ndef getPageNum(driver):\n#通过xpath匹配底部翻页元素位置, 获取页码数\n text = driver.find_element(\"xpath\", '//ul[@class=\"be-pager\"]/span[@class=\"be-pager-total\"]')\n .get_attribute(\"textContent\")\n .split(' ')\n return text[1]\n```\n遍历所有页:\n```python\ndef spawnCards(page, driver, actions):\n #遍历所有页\n for i in range(1,int(page) + 1):\n print(f\"get data in page {i}\\n\")\n #触发ajax生成card\n spawn(driver, actions)\n if (i != int(page)):\n #翻页\n goNextPage(driver, actions)\n time.sleep(6) \n```\n生成card:\n```python\ndef spawn(driver, actions):\n #得到card list\n ulList = driver.find_elements(\"xpath\", '//ul[@class=\"relation-list\"]/li')\n #生成 card\n for li in ulList:\n getCard(li, actions)\n time.sleep(2)\n```\n```python\ndef getCard(li, actions):\n cover = li.find_element(\"xpath\", './/a[@class=\"cover\"]')\n actions.move_to_element(cover)\n actions.perform()\n actions.reset_actions()\n```\n\n获取并储存数据:\n```python\ndef writeData(driver):\n #获取 card list\n cardList = driver.find_elements(\"xpath\", '//div[@id=\"id-card\"]')\n for card in cardList:\n up_name = card.find_element(\"xpath\", './/img[@class=\"idc-avatar\"]').get_attribute(\"alt\")\n up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute(\"textContent\")\n print(f'name:{up_name}, {up_fansNum}')\n #写入csv文件\n with open('.\\\\date.csv', mode='a', newline='', encoding='utf-8') as f:\n writer = csv.writer(f)\n writer.writerow([up_name, up_fansNum])\n```\n\n完整代码:\n```python\nfrom selenium import webdriver\nfrom selenium.webdriver.common.action_chains import ActionChains\nimport time\nimport csv\n\ndef initDriver(url):\n options = webdriver.ChromeOptions()\n options.add_argument('headless')\n options.add_experimental_option('excludeSwitches', ['enable-logging'])\n driver = webdriver.Chrome(options=options)\n actions = ActionChains(driver)\n driver.get(url)\n driver.get(url)\n driver.implicitly_wait(10)\n return driver, actions\n\ndef getPageNum(driver):\n text = driver.find_element(\"xpath\", '//ul[@class=\"be-pager\"]/span[@class=\"be-pager-total\"]').get_attribute(\"textContent\").split(' ')\n return text[1]\n\ndef goNextPage(driver, actions):\n bottom = driver.find_element(\"xpath\", '//li[@class=\"be-pager-next\"]/a')\n actions.click(bottom)\n actions.perform()\n actions.reset_actions()\n\ndef getCard(li, actions):\n cover = li.find_element(\"xpath\", './/a[@class=\"cover\"]')\n actions.move_to_element(cover)\n actions.perform()\n actions.reset_actions()\n\ndef writeData(driver):\n #get card list\n cardList = driver.find_elements(\"xpath\", '//div[@id=\"id-card\"]')\n for card in cardList:\n up_name = card.find_element(\"xpath\", './/img[@class=\"idc-avatar\"]').get_attribute(\"alt\")\n up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute(\"textContent\")\n print(f'name:{up_name}, {up_fansNum}')\n #write info into csv file\n with open('.\\\\date.csv', mode='a', newline='', encoding='utf-8') as f:\n writer = csv.writer(f)\n writer.writerow([up_name, up_fansNum])\n\ndef spawn(driver, actions):\n #get card list\n ulList = driver.find_elements(\"xpath\", '//ul[@class=\"relation-list\"]/li')\n #spawn card\n for li in ulList:\n getCard(li, actions)\n time.sleep(2)\n \ndef spawnCards(page, driver, actions):\n for i in range(1,int(page) + 1):\n print(f\"get data in page {i}\\n\")\n spawn(driver, actions)\n if (i != int(page)):\n goNextPage(driver, actions)\n time.sleep(6) \n\ndef main():\n #init driver\n uid = input(\"bilibili uid:\")\n url = \"https://space.bilibili.com/\" + uid + \"/fans/fans\"\n driver, actions = initDriver(url)\n page = getPageNum(driver)\n\n #spawn card info(ajax)\n spawnCards(page, driver, actions)\n writeData(driver)\n\n driver.quit()\n\nif __name__ == \"__main__\":\n main()\n```\n### 结果\n\n![图片](ipfs://bafkreidu3mwjxbt5dp4dgwslyg6leeylta3kzbfz2nkznjkvs7phy3i5za)\n\n### 反思\n可以改进的地方:\n- 由于ajax异步加载, 必须等待页面加载完毕后才能进行元素定位. 而使用`time.sleep()`方法不够效率和优雅, `WebDriverWait()`方法可以解决. 它可以轮询页面状态, 当页面加载完毕返回`true`.\n- 使用了多次含重复路径的xpath表达式, 匹配占用了太多内存.\n- 完全可以并发的提取数据, 更快的获取结果. 但出于对服务器负荷的考虑, 只写了单线程版本.\n\n### References\n[selenium doc](https://selenium-python-zh.readthedocs.io/en/latest/installation.html)\n<br>推荐阅读:<br>\n[Ajax](https://developer.mozilla.org/zh-CN/docs/Web/Guide/AJAX)<br>\n[Xpath](https://developer.mozilla.org/zh-CN/docs/Web/XPath)",
"attributes": [
{
"value": "selenium-pa-chong",
"trait_type": "xlog_slug"
}
]
}