ming5ming-Note-40675-1

Token ID: 1

ERC-721 1 Transfers

Metadata

{
  "title": "selenium爬虫",
  "tags": [
    "post",
    "spider",
    "python"
  ],
  "sources": [
    "xlog"
  ],
  "external_urls": [
    "https://ming5ming.xlog.app/selenium-pa-chong"
  ],
  "date_published": "2023-01-05T17:24:31.580Z",
  "content": "### selenium是什么?\n>Selenium 是支持 web 浏览器自动化的一系列工具和库的综合项目。\n<br>它提供了扩展来模拟用户与浏览器的交互,用于扩展浏览器分配的分发服务器, 以及用于实现 [W3C WebDriver](https://www.w3.org/TR/webdriver/) 规范 的基础结构, 该 规范 允许您为所有主要 Web 浏览器编写可互换的代码。\n\npython中的selenium库是selenium的接口, 它可以模拟浏览器像人一样操作页面, 获取网页信息.\n\n<br>基于这个特点, selenium在爬取某些网站信息时代码逻辑更简单, **且不用逆向js加密代码**.\n<br>然而, 因为是模拟操作, 所以爬取效率比不上其他爬虫.\n\n为了展现selenium的强大, 我们举个例子:\n<br>爬取bilibili个人页面的 粉丝名字 及 粉丝的粉丝数\n\n<br>**注意**:爬取数据时要注意网站的 robots.txt 内的规定, 同时不要有太高的爬取频率, 以免对网站产生负担. 本文爬取的 粉丝名字 与 粉丝的粉丝数 属于公开内容.\n\n### 安装\n```shell\n$ pip install selenium\n```\n### 分析网站\n在个人空间的粉丝页面下, 粉丝信息位于`<ul>`下的`<li>`元素内.\n\n![粉丝页面](ipfs://bafybeifapwtslfayl6fwtrh3wwkrufbihdhtfulk4ddnqiotbsjd2lijwq)\n然而在`<li>`内没有粉丝的粉丝数这一信息, 而是在服务器上, 没有储存在本地. 它是通过鼠标移动到粉丝头像或者名字后, 触发js传输到本地的. 这个操作是由AJAX技术实现的. \n>[AJAX](https://developer.mozilla.org/zh-CN/docs/Web/Guide/AJAX)(Asynchronous JavaScript and XML) 是一种在无需重新加载整个网页的情况下,能够更新部分网页的技术.\n\n在鼠标移动到头像上时, 会在`<body>`末端生成一个`<div id=\"id-card\">`:\n\n![id-card](ipfs://bafkreiepcc7pmbixp3yy6wzmqyf2v3rmswecqu6o6iapfytb4qz6m6fkne)\n<br>粉丝的粉丝数位于`<div id=\"id-card\">`下的`<span class=\"idc-meta-item\">`下:\n\n![fansNum](ipfs://bafkreiaqsjkzzh6bajzgssuxsddntfgkd4igejazi2ke2tsgjqkvuazpaq)\n\n### 匹配方法\nselenium内匹配元素有很多方法:\n- xpath(最常用)\n- by id\n- by name/tag name/class name\n- by link\n- by css selector\n\nxpath之所以好用是因为xpath可以使用相对路径匹配, 并且语法简单.\n比如匹配粉丝头像可以这样写:\n```xpath\n//div[@id=\"id-card\"]\n```\n而在XML下该元素的位置:\n```XML\n<html>\n  ...\n  <body>\n    ...\n    <div id = \"id-card\">\n      ...\n    </div>\n  </body>\n\n</html>\n```\n\n当然css selector有时也很好用:<br>\nXML:\n```XML\n<html>\n <body>\n  <p class=\"content\">Site content goes here.</p>\n</body>\n<html>\n```\ncss selector:\n```css selector\np.content\n```\n### 写爬虫\n![mermaid-diagram-2023-01-06-041353.png](ipfs://bafkreidsamwfjlvrx7vi65vda77lmtgjsdsq54ora6ypoxjolgt6gng3ku)\n<br>\n初始化:\n```python\ndef initDriver(url):\n#设置headless浏览器\n    options = webdriver.ChromeOptions()\n    options.add_argument('headless')\n    options.add_experimental_option('excludeSwitches', ['enable-logging'])\n\n#初始化\n    driver = webdriver.Chrome(options=options)\n    actions = ActionChains(driver)\n\n#打开链接\n    driver.get(url)\n    driver.implicitly_wait(10)\n\n    return driver, actions\n```\n获取页码:\n```python\ndef getPageNum(driver):\n#通过xpath匹配底部翻页元素位置, 获取页码数\n    text = driver.find_element(\"xpath\", '//ul[@class=\"be-pager\"]/span[@class=\"be-pager-total\"]')\n                 .get_attribute(\"textContent\")\n                 .split(' ')\n    return text[1]\n```\n遍历所有页:\n```python\ndef spawnCards(page, driver, actions):\n    #遍历所有页\n    for i in range(1,int(page) + 1):\n        print(f\"get data in page {i}\\n\")\n        #触发ajax生成card\n        spawn(driver, actions)\n        if (i != int(page)):\n            #翻页\n            goNextPage(driver, actions)\n            time.sleep(6) \n```\n生成card:\n```python\ndef spawn(driver, actions):\n    #得到card list\n    ulList = driver.find_elements(\"xpath\", '//ul[@class=\"relation-list\"]/li')\n    #生成 card\n    for li in ulList:\n        getCard(li, actions)\n        time.sleep(2)\n```\n```python\ndef getCard(li, actions):\n    cover = li.find_element(\"xpath\", './/a[@class=\"cover\"]')\n    actions.move_to_element(cover)\n    actions.perform()\n    actions.reset_actions()\n```\n\n获取并储存数据:\n```python\ndef writeData(driver):\n    #获取 card list\n    cardList = driver.find_elements(\"xpath\", '//div[@id=\"id-card\"]')\n    for card in cardList:\n        up_name = card.find_element(\"xpath\", './/img[@class=\"idc-avatar\"]').get_attribute(\"alt\")\n        up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute(\"textContent\")\n        print(f'name:{up_name}, {up_fansNum}')\n        #写入csv文件\n        with open('.\\\\date.csv', mode='a', newline='', encoding='utf-8') as f:\n            writer = csv.writer(f)\n            writer.writerow([up_name, up_fansNum])\n```\n\n完整代码:\n```python\nfrom selenium import webdriver\nfrom selenium.webdriver.common.action_chains import ActionChains\nimport time\nimport csv\n\ndef initDriver(url):\n    options = webdriver.ChromeOptions()\n    options.add_argument('headless')\n    options.add_experimental_option('excludeSwitches', ['enable-logging'])\n    driver = webdriver.Chrome(options=options)\n    actions = ActionChains(driver)\n    driver.get(url)\n    driver.get(url)\n    driver.implicitly_wait(10)\n    return driver, actions\n\ndef getPageNum(driver):\n    text = driver.find_element(\"xpath\", '//ul[@class=\"be-pager\"]/span[@class=\"be-pager-total\"]').get_attribute(\"textContent\").split(' ')\n    return text[1]\n\ndef goNextPage(driver, actions):\n    bottom = driver.find_element(\"xpath\", '//li[@class=\"be-pager-next\"]/a')\n    actions.click(bottom)\n    actions.perform()\n    actions.reset_actions()\n\ndef getCard(li, actions):\n    cover = li.find_element(\"xpath\", './/a[@class=\"cover\"]')\n    actions.move_to_element(cover)\n    actions.perform()\n    actions.reset_actions()\n\ndef writeData(driver):\n    #get card list\n    cardList = driver.find_elements(\"xpath\", '//div[@id=\"id-card\"]')\n    for card in cardList:\n        up_name = card.find_element(\"xpath\", './/img[@class=\"idc-avatar\"]').get_attribute(\"alt\")\n        up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute(\"textContent\")\n        print(f'name:{up_name}, {up_fansNum}')\n        #write info into csv file\n        with open('.\\\\date.csv', mode='a', newline='', encoding='utf-8') as f:\n            writer = csv.writer(f)\n            writer.writerow([up_name, up_fansNum])\n\ndef spawn(driver, actions):\n    #get card list\n    ulList = driver.find_elements(\"xpath\", '//ul[@class=\"relation-list\"]/li')\n    #spawn card\n    for li in ulList:\n        getCard(li, actions)\n        time.sleep(2)\n    \ndef spawnCards(page, driver, actions):\n    for i in range(1,int(page) + 1):\n        print(f\"get data in page {i}\\n\")\n        spawn(driver, actions)\n        if (i != int(page)):\n            goNextPage(driver, actions)\n            time.sleep(6) \n\ndef main():\n    #init driver\n    uid = input(\"bilibili uid:\")\n    url = \"https://space.bilibili.com/\" + uid + \"/fans/fans\"\n    driver, actions = initDriver(url)\n    page = getPageNum(driver)\n\n    #spawn card info(ajax)\n    spawnCards(page, driver, actions)\n    writeData(driver)\n\n    driver.quit()\n\nif __name__ == \"__main__\":\n    main()\n```\n### 结果\n\n![图片](ipfs://bafkreidu3mwjxbt5dp4dgwslyg6leeylta3kzbfz2nkznjkvs7phy3i5za)\n\n### 反思\n可以改进的地方:\n- 由于ajax异步加载, 必须等待页面加载完毕后才能进行元素定位. 而使用`time.sleep()`方法不够效率和优雅, `WebDriverWait()`方法可以解决. 它可以轮询页面状态, 当页面加载完毕返回`true`.\n- 使用了多次含重复路径的xpath表达式, 匹配占用了太多内存.\n- 完全可以并发的提取数据, 更快的获取结果. 但出于对服务器负荷的考虑, 只写了单线程版本.\n\n### References\n[selenium doc](https://selenium-python-zh.readthedocs.io/en/latest/installation.html)\n<br>推荐阅读:<br>\n[Ajax](https://developer.mozilla.org/zh-CN/docs/Web/Guide/AJAX)<br>\n[Xpath](https://developer.mozilla.org/zh-CN/docs/Web/XPath)",
  "attributes": [
    {
      "value": "selenium-pa-chong",
      "trait_type": "xlog_slug"
    }
  ]
}