#StackBounty: #python #selenium #selenium-webdriver #web-scraping #xpath Without using WebDriverWait return: element click intercepted,…

Bounty: 50

Python:


Regarding the bounty: please, if possible, in addition to helping me
solve the current problem, indicate me an improved and faster option
for the method I currently use. (I’m still learning, so my methods are
pretty archaic)

Code Proposal Summary:

Collecting the links to all the games of the day present on the page (https://int.soccerway.com/matches/2021/07/31/), giving me the freedom to change the date to whatever I want, such as 2021/08/01 and so on. So that in the future I can loop and collect the list from several different days at the same time, in one code call.


Even though it’s a very slow model, without using Headless, this model clicks all the buttons, expands the data and imports all 465 listed match links:

for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:UsersComputadorDesktopPythonchromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But when I add options.add_argument("headless") so that the browser is not opened on my screen, the model returns the following error:

Message: element click intercepted




To get around this problem, I analyzed options and found this one on WebDriverWait (https://stackoverflow.com/a/62904494/11462274) and tried to use it like this:

for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

from selenium.webdriver.support.ui import WebDriverWait       
from selenium.webdriver.common.by import By       
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("start-maximized")
options.add_argument("headless")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:UsersComputadorDesktopPythonchromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But because it’s not iterable, it returns in error:

‘NoneType’ object is not iterable




Why do I need this option?

1 – I’m going to automate it in an online terminal, so there won’t be any browser to open on the screen and I need to make it fast so I don’t spend too much of my time limits on the terminal.

2 – I need to find an option that I can use any date instead of 2021/07/28 in:

url = "https://int.soccerway.com/matches/2021/07/28/"

Where in the future I’ll add the parameter:

today = date.today().strftime("%Y/%m/%d")



In this answer (https://stackoverflow.com/a/68535595/11462274), a guy indicated a very fast and interesting option (He named the option at the end of the answer as: Quicker Version) without the need for a WebDriver, but I was only able to make it work on the first page of the site, when I try to use other dates of the year, he keeps returning only the links to the games of the current day.

Expected Result (there are 465 links but I didn’t put the entire result because there is a character limit):

https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/        
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/
https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/
https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/
https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/
https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/
https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/
https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/
https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/
https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/
https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/uganda-under-23/eritrea-under-23/3567664/
https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/ethiopia-u23/congo-dr-under-23/3567663/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/boca-juniors/club-atletico-san-lorenzo-de-almagro/3528753/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/ca-union-de-santa-fe/ca-banfield/3528752/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/godoy-cruz-antonio-tomba/club-atletico-tucuman/3528762/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/club-atletico-sarmiento/ca-platense/3528758/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/club-atletico-velez-sarsfield/csd-defensa-y-justicia/3528761/
https://int.soccerway.com/matches/2021/07/28/argentina/primera-division/ca-independiente/club-atletico-patronato/3528757/
https://int.soccerway.com/matches/2021/07/28/australia/queensland/queensland-lions-sc/capalaba/3460465/
https://int.soccerway.com/matches/2021/07/28/australia/queensland/sunshine-coast-wanderers/brisbane-roar-ii/3460466/
https://int.soccerway.com/matches/2021/07/28/australia/queensland-pl2-youth/north-star-u23/samford-rangers-u23/3498140/
https://int.soccerway.com/matches/2021/07/28/australia/western-australia-npl-women/australia-murdoch-university-melville-fc/perth/3469912/
https://int.soccerway.com/matches/2021/07/28/australia/western-australia-npl-women/freemantle-city/northern-redbacks/3469950/
https://int.soccerway.com/matches/2021/07/28/australia/capital-territory-npl-youth-league/monaro-panthers-u23/cooma-tigers-u23/3433866/
https://int.soccerway.com/matches/2021/07/28/australia/capital-territory-npl-2-youth/canberra-we-u23/weston-molonglo-u23/3433696/
https://int.soccerway.com/matches/2021/07/28/australia/capital-territory-npl-women/gungahlin-united/australia-canberra-fc/3433608/
https://int.soccerway.com/matches/2021/07/28/australia/northern-nsw/weston-workers/broadmeadow-magic/3432958/
https://int.soccerway.com/matches/2021/07/28/australia/northern-nsw/valentine-phoenix/adamstown-rosebuds/3432957/
https://int.soccerway.com/matches/2021/07/28/australia/northern-nsw/edgeworth-eagles/lake-macquarie/3432954/
https://int.soccerway.com/matches/2021/07/28/australia/northern-nsw-reserve-league/edgeworth-eagles-res/lake-macquarie-res/3434270/
https://int.soccerway.com/matches/2021/07/28/australia/brisbane-womens-cup/australia-grange-thistle-sc/australia-albany-creek-excelsior-fc/3514706/https://int.soccerway.com/matches/2021/07/28/austria/regionalliga/fc-red-bull-salzburg-amateure/sv-wals-grunau/3559518/
https://int.soccerway.com/matches/2021/07/28/austria/regionalliga/sk-bischofshofen/sportverein-austria-salzburg/3559520/
https://int.soccerway.com/matches/2021/07/28/austria/regionalliga/admira-dornbirn/sc-austria-lustenau-ii/3560565/
https://int.soccerway.com/matches/2021/07/28/austria/regionalliga/fc-rot-weiss-rankweil/sc-bregenz/3560566/
https://int.soccerway.com/matches/2021/07/28/austria/regionalliga/dornbirner-sv/fc-lauterach/3560567/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/fc-hochst/bezau/3577691/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/fc-nenzing/fc-andelsbuch/3577692/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/fc-lustenau-07/ludesch/3577693/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/gofis/sc-rheindorf-altach-ii/3577694/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/fc-alberschwende/schruns/3577695/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/fc-hard/sc-fussach/3577696/
https://int.soccerway.com/matches/2021/07/28/austria/landesliga/horbranz/sk-cht-austria-meiningen/3577697/
https://int.soccerway.com/matches/2021/07/28/austria/ofb-stiegl-cup/sv-kuchl/fc-blau-weiss-linz/3527604/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/ostrovets-fc/slonim-city/3501140/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/belaya-rus/hcs-olympia/3501141/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/svislach/schuchin/3501137/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/neman-mosty/fk-tsementnik-krasnoselsky/3501138/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/smena-vawkavysk/chayka-zelva/3501139/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/uzda/kolos-cherven/3502257/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/fc-osipovichi/spartak-shklov/3502182/
https://int.soccerway.com/matches/2021/07/28/belarus/2-division/krasnopole/fk-dnepr-mogilev-ii/3502176/
https://int.soccerway.com/matches/2021/07/28/belarus/premier-league-women/gomel/dinamo-bgu/3477008/
https://int.soccerway.com/matches/2021/07/28/brazil/serie-b/botafogo-de-futebol-e-regatas/centro-sportivo-alagoano/3482911/
https://int.soccerway.com/matches/2021/07/28/brazil/copa-do-brasil/criciuma-esporte-clube/fluminense-football-club/3521228/
https://int.soccerway.com/matches/2021/07/28/brazil/copa-do-brasil/esporte-clube-vitoria/gremio-foot-ball-porto-alegrense/3521231/
https://int.soccerway.com/matches/2021/07/28/brazil/copa-do-brasil/clube-atletico-paranaense/atletico-clube-goianiense/3521234/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/audax-rio-de-janeiro-ec/associacao-desportiva-cabofriense/3508338/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/cfrj--marica/duque-de-caxias-futebol-clube/3508339/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/angra-dos-reis-esporte-clube/goncalense/3508340/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/america-football-club-rio-de-janeiro/artsul-futebol-clube/3508341/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/macae-esporte-futebol-clube/americano-futebol-clube/3508342/
https://int.soccerway.com/matches/2021/07/28/brazil/carioca-a2/sampaio-correa-fe/friburguense-atletico-clube/3508343/
https://int.soccerway.com/matches/2021/07/28/brazil/catarinense-2/sociedade-esportiva-recreativa-clube-guarani/atletico-catarinense/3527050/
https://int.soccerway.com/matches/2021/07/28/brazil/catarinense-2/nacao/acre-cidade-azul/3527051/
https://int.soccerway.com/matches/2021/07/28/brazil/cearense-2-div/floresta/cariri/3582360/
https://int.soccerway.com/matches/2021/07/28/brazil/cearense-2-div/uniao-ce/itapipoca-esporte-clube/3582361/
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/fortaleza-u19/atletico-go-u19/3520266/
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/cr-flamengo-u20/cruzeiro-ac-u20/3520267/
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/botafogo-fc-u20/sport-club-do-recife-u20/3520269/
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/ca-mineiro-u20/sociedade-esportiva-palmeiras-u20/3520270/
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/santos-futebol-clube-sao-paulo-u20/sao-paulo-futebol-clube-u20/3520273/     
https://int.soccerway.com/matches/2021/07/28/brazil/cbf-brasileiro-u20/ec-bahia-u20/cr-vasco-da-gama-u20/3520271/
https://int.soccerway.com/matches/2021/07/28/brunei-darussalam/premier-league/jerudong-fc/kuala-belait/3521922/
https://int.soccerway.com/matches/2021/07/28/chile/primera-division/audax-club-sportivo-italiano/corporacion-deportiva-everton/3478522/
https://int.soccerway.com/matches/2021/07/28/chile/primera-division/club-de-desportes-cobresal/deportes-melipilla/3478523/
https://int.soccerway.com/matches/2021/07/28/chile/primera-division/union-espanola/club-deportivo-huachipato/3478526/
https://int.soccerway.com/matches/2021/07/28/china-pr/csl/hebei-zhongji/dalian-aerbin-fc/3492367/
https://int.soccerway.com/matches/2021/07/28/china-pr/csl/beijing-guoan-football-club/shanghai-east-asia/3492368/
https://int.soccerway.com/matches/2021/07/28/china-pr/csl/tianjin-teda/changchun-yatai/3492366/
https://int.soccerway.com/matches/2021/07/28/china-pr/csl/shanghai-shenhua/hubei-luyin-fc/3492365/
https://int.soccerway.com/matches/2021/07/28/china-pr/china-league-one/nanjing-city/guizhou-zhicheng-toro-fc/3545195/
https://int.soccerway.com/matches/2021/07/28/china-pr/china-league-one/shenyang-city/hubei-huakaier/3545193/
https://int.soccerway.com/matches/2021/07/28/china-pr/china-league-one/wuhan-three-towns/sichuan-fc/3545196/
https://int.soccerway.com/matches/2021/07/28/colombia/primera-a/la-equidad/deportivo-pasto/3554519/
https://int.soccerway.com/matches/2021/07/28/colombia/primera-b/real-cartagena/valledupar-fc/3553296/
https://int.soccerway.com/matches/2021/07/28/costa-rica/primera-division/municipal-perez-zeledon/adr-jicaral/3529055/
https://int.soccerway.com/matches/2021/07/28/costa-rica/primera-division/sporting-san-jose/ad-guanacasteca/3529056/
https://int.soccerway.com/matches/2021/07/28/costa-rica/primera-division/deportivo-saprissa/santos-de-guapiles-fc/3529053/
https://int.soccerway.com/matches/2021/07/28/costa-rica/primera-division/grecia-fc/club-sport-cartagines/3529054/
https://int.soccerway.com/matches/2021/07/28/ecuador/primera-b/alianza-cotopaxi/cumbaya/3468963/
https://int.soccerway.com/matches/2021/07/28/ecuador/primera-b/gualaceo/club-deportivo-america/3468962/
https://int.soccerway.com/matches/2021/07/28/ecuador/primera-b/puerto-quito/ldu-de-portoviejo/3468965/
https://int.soccerway.com/matches/2021/07/28/estonia/esiliiga/fc-elva/fc-flora-tallinn-ii/3499410/

Note 1: There are multiple types of score-time, such as score-time status and score-time score, that’s why I used contains in "//td[contains(@class,'score-time')]//a"


Get this bounty!!!

#StackBounty: #python #web-scraping #modules Organizing things together to form a minimum viable Scraper App

Bounty: 50

This is a follow-up of my group of scraper questions starting from here.

I have thus far, with the help of @Reinderien, written 4 separate "modules" that expose a search function to scrape bibliographic information from separate online databases. Half of which use Selenium; the other Reqests.

I would like to know the best way to put them together, possibly organizing into a single module that can be imported together, and/or creating a base class so that common code can be shared between them.

I would like the final App to be able to execute the search function for each database, when given a list of search keywords, together with a choice of databases to search on as arguments.


Update:

Since there is still no answer to this question, I have drafted a working code that takes in a list of keywords together with the database to be searched in. If this is unspecified, the same set of keywords would be looped through all databases.

I would like to seek improvements to the code below, especially with respect to:

  1. Consolidating the search results to a single .json or .bib file when all databases are involved.
  2. Reusing common code so that the whole code-base is less bulky and extensible.
  3. More flexible search options, such as choosing 2 or 3 out of 4 databases to search in. (Possibly with the use of *args or **kwargs in the search function.)

main.py

import cnki, fudan, wuhan, qinghua

def db_search(keyword, db=None):

    db_dict = {
        "cnki": cnki.search,
        "fudan": fudan.search,
        "wuhan": wuhan.search,
        "qinghua": qinghua.search,
        }

    if db == None:
        for key in db_dict.keys():
            yield db_dict[key](keyword)
    elif db == "cnki":
        yield db_dict["cnki"](keyword)
    elif db == "fudan":
        yield db_dict["fudan"](keyword)
    elif db == "wuhan":
        yield db_dict["wuhan"](keyword)
    elif db == "qinghua":
        yield db_dict["qinghua"](keyword)


def search(keywords, db=None):
    for kw in keywords:
        yield from db_search(kw, db)



if __name__ == '__main__':
    rslt = search(['尹誥','尹至'])
    for item in rslt:
        print(item)

The Code:

cnki.py

from contextlib import contextmanager
from dataclasses import dataclass
from datetime import date
from pathlib import Path
from typing import Generator, Iterable, Optional, List, ContextManager, Dict
from urllib.parse import unquote
from itertools import chain, count
import re
import json
from math import ceil

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# from urllib3.packages.six import X


@dataclass
class Result:
    title: str        # Mozi's Theory of Human Nature and Politics
    title_link: str   # http://big5.oversea.cnki.net/kns55/detail/detail.aspx?recid=&FileName=ZDXB202006009&DbName=CJFDLAST2021&DbCode=CJFD
    html_link: Optional[str]  # http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009
    author: str       # Xie Qiyang
    source: str       # Vocational University News
    source_link: str  # http://big5.oversea.cnki.net/kns55/Navi/ScdbBridge.aspx?DBCode=CJFD&BaseID=ZDXB&UnitCode=&NaviLink=%e8%81%8c%e5%a4%a7%e5%ad%a6%e6%8a%a5
    date: date   # 2020-12-28
    download: str        #
    database: str     # Periodical

    @classmethod
    def from_row(cls, row: WebElement) -> 'Result':
        number, title, author, source, published, database = row.find_elements_by_xpath('td')

        title_links = title.find_elements_by_tag_name('a')

        if len(title_links) > 1:
            # 'http://big5.oversea.cnki.net/kns55/ReadRedirectPage.aspx?flag=html&domain=http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009'
            html_link = unquote(
                title_links[1]
                .get_attribute('href')
                .split('domain=', 1)[1])
        else:
            html_link = None

        dl_links, sno = number.find_elements_by_tag_name('a')

        published_date = date.fromisoformat(
            published.text.split(maxsplit=1)[0]
        )

        return cls(
            title=title_links[0].text,
            title_link=title_links[0].get_attribute('href'),
            html_link=html_link,
            author=author.text,
            source=source.text,
            source_link=source.get_attribute('href'),
            date=published_date,
            download=dl_links.get_attribute('href'),
            database=database.text,
        )

    def __str__(self):
        return (
            f'題名      {self.title}'
            f'n作者     {self.author}'
            f'n來源     {self.source}'
            f'n發表時間  {self.date}'
            f'n下載連結 {self.download}'
            f'n來源數據庫 {self.database}'
        )

    def as_dict(self) -> Dict[str, str]:
        return {
        'author': self.author,
        'title': self.title,
        'date': self.date.isoformat(),
        'download': self.download,
        'url': self.html_link,
        'database': self.database,
    }


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def submit_search(self, keyword: str) -> None:
        wait = WebDriverWait(self.driver, 50)
        search = wait.until(
            EC.presence_of_element_located((By.NAME, 'txt_1_value1'))
        )
        search.send_keys(keyword)
        search.submit()

    def switch_to_frame(self) -> None:
        wait = WebDriverWait(self.driver, 100)
        wait.until(
            EC.presence_of_element_located((By.XPATH, '//iframe[@name="iframeResult"]'))
        )
        self.driver.switch_to.default_content()
        self.driver.switch_to.frame('iframeResult')

        wait.until(
            EC.presence_of_element_located((By.XPATH, '//table[@class="GridTableContent"]'))
        )

    def max_content(self) -> None:
        """Maximize the number of items on display in the search results."""
        max_content = self.driver.find_element(
            By.CSS_SELECTOR, '#id_grid_display_num > a:nth-child(3)',
        )
        max_content.click()

    # def get_element_and_stop_page(self, *locator) -> WebElement:
    #     ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
    #     wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
    #     elm = wait.until(EC.presence_of_element_located(locator))
    #     self.driver.execute_script("window.stop();")
    #     return elm



class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver


    def number_of_articles_and_pages(self) -> int:
        elem = self.driver.find_element_by_xpath(
            '//table//tr[3]//table//table//td[1]/table//td[1]'
        )
        n_articles = re.search("共有記錄(.+)條", elem.text).group(1)
        n_pages = ceil(int(n_articles)/50)

        return n_articles, n_pages


    def get_structured_elements(self) -> Iterable[Result]:
        rows = self.driver.find_elements_by_xpath(
            '//table[@class="GridTableContent"]//tr[position() > 1]'
        )

        for row in rows:
            yield Result.from_row(row)


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        link = self.get_element_and_stop_page(By.LINK_TEXT, "下頁")

        try:
            link.click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")



class ContentFilterPlugin(HttpProxyBasePlugin):
    HOST_WHITELIST = {
        b'ocsp.digicert.com',
        b'ocsp.sca1b.amazontrust.com',
        b'big5.oversea.cnki.net',
    }

    def handle_client_request(self, request: HttpParser) -> Optional[HttpParser]:
        host = request.host or request.header(b'Host')
        if host not in self.HOST_WHITELIST:
            raise HttpRequestRejected(403)

        if any(
            suffix in request.path
            for suffix in (
                b'png', b'ico', b'jpg', b'gif', b'css',
            )
        ):
            raise HttpRequestRejected(403)

        return request

    def before_upstream_connection(self, request):
        return super().before_upstream_connection(request)
    def handle_upstream_chunk(self, chunk):
        return super().handle_upstream_chunk(chunk)
    def on_upstream_connection_close(self):
        pass


@contextmanager
def run_driver() -> ContextManager[WebDriver]:
    prox_type = ProxyType.MANUAL['ff_value']
    prox_host = '127.0.0.1'
    prox_port = 8889

    profile = FirefoxProfile()
    profile.set_preference('network.proxy.type', prox_type)
    profile.set_preference('network.proxy.http', prox_host)
    profile.set_preference('network.proxy.ssl', prox_host)
    profile.set_preference('network.proxy.http_port', prox_port)
    profile.set_preference('network.proxy.ssl_port', prox_port)
    profile.update_preferences()

    plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

    with proxy.start((
        '--hostname', prox_host,
        '--port', str(prox_port),
        '--plugins', plugin,
    )), Firefox(profile) as driver:
        yield driver


def loop_through_results(driver):
    result_page = SearchResults(driver)
    n_articles, n_pages = result_page.number_of_articles_and_pages()
    
    print(f"{n_articles} found. A maximum of 500 will be retrieved.")

    for page in count(1):

        print(f"Scraping page {page}/{n_pages}")
        print()

        result = result_page.get_structured_elements()
        yield from result

        if page >= n_pages or page >= 10:
            break

        result_page.next_page()
        result_page = SearchResults(driver)


def save_articles(articles: Iterable, file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('[n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n]n')


def query(keyword, driver) -> None:

    page = MainPage(driver)
    page.submit_search(keyword)
    page.switch_to_frame()
    page.max_content()


def search(keyword):
    with Firefox() as driver:
        driver.get('http://big5.oversea.cnki.net/kns55/')
        query(keyword, driver)
        result = loop_through_results(driver)
        save_articles(result, 'cnki_search_result.json')


if __name__ == '__main__':
    search('尹至')

qinghua.py

Search functionality is down at the moment. Planning the try out with Requests as soon as it is up and running.

from contextlib import contextmanager
from dataclasses import dataclass, asdict, replace
from datetime import datetime, date
from pathlib import Path
from typing import Iterable, Optional, ContextManager
import re
import os
import time
import json

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


@dataclass
class PrimaryResult:
    captions: str
    date: date
    link: str

    @classmethod
    def from_row(cls, row: WebElement) -> 'PrimaryResult': 

        caption_elems = row.find_element_by_tag_name('a')
        date_elems = row.find_element_by_class_name('time')

        published_date = date.isoformat(datetime.strptime(date_elems.text, '%Y-%m-%d'))

        return cls(
            captions = caption_elems.text,
            date = published_date,
            link = caption_elems.get_attribute('href')
        )

    def __str__(self):
        return (
            f'n標題     {self.captions}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.link}'
        )


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver
 
    def submit_search(self, keyword: str) -> None:
        driver = self.driver
        wait = WebDriverWait(self.driver, 100)

        xpath = "//form/button/input"
        element_to_hover_over = driver.find_element_by_xpath(xpath)
        hover = ActionChains(driver).move_to_element(element_to_hover_over)
        hover.perform()

        search = wait.until(
            EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
        )
        search.send_keys(keyword)
        search.submit()


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        try: 
            link = self.get_element_and_stop_page(By.LINK_TEXT, "下一页")
            link.click()
            print("Navigating to Next Page")

        except (TimeoutException, WebDriverException):
            print("No button with 「下一页」 found.")
            return 0


    # @contextmanager
    # def wait_for_new_window(self):
    #     driver = self.driver
    #     handles_before = driver.window_handles
    #     yield
    #     WebDriverWait(driver, 10).until(
    #         lambda driver: len(handles_before) != len(driver.window_handles))

    def switch_tabs(self):
        driver = self.driver
        print("Current Window:")
        print(driver.title)
        print()

        p = driver.current_window_handle
        
        chwd = driver.window_handles
        time.sleep(3)
        driver.switch_to.window(chwd[1])

        print("New Window:")
        print(driver.title)
        print()


class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def get_primary_search_result(self):
        
        filePath = os.path.join(os.getcwd(), "qinghua_primary_search_result.json")

        if os.path.exists(filePath):
            os.remove(filePath)    

        rows = self.driver.find_elements_by_xpath('//ul[@class="search_list"]/li')

        for row in rows:
            rslt = PrimaryResult.from_row(row)
            with open('qinghua_primary_search_result.json', 'a') as file:
                json.dump(asdict(rslt), file, ensure_ascii=False, indent=4)
            yield rslt


# class ContentFilterPlugin(HttpProxyBasePlugin):
#     HOST_WHITELIST = {
#         b'ocsp.digicert.com',
#         b'ocsp.sca1b.amazontrust.com',
#         b'big5.oversea.cnki.net',
#         b'gwz.fudan.edu.cn',
#         b'bsm.org.cn/index.php'
#         b'ctwx.tsinghua.edu.cn',
#     }

#     def handle_client_request(self, request: HttpParser) -> Optional[HttpParser]:
#         host = request.host or request.header(b'Host')
#         if host not in self.HOST_WHITELIST:
#             raise HttpRequestRejected(403)

#         if any(
#             suffix in request.path
#             for suffix in (
#                 b'png', b'ico', b'jpg', b'gif', b'css',
#             )
#         ):
#             raise HttpRequestRejected(403)

#         return request

#     def before_upstream_connection(self, request):
#         return super().before_upstream_connection(request)
#     def handle_upstream_chunk(self, chunk):
#         return super().handle_upstream_chunk(chunk)
#     def on_upstream_connection_close(self):
#         pass


# @contextmanager
# def run_driver() -> ContextManager[WebDriver]:
#     prox_type = ProxyType.MANUAL['ff_value']
#     prox_host = '127.0.0.1'
#     prox_port = 8889

#     profile = FirefoxProfile()
#     profile.set_preference('network.proxy.type', prox_type)
#     profile.set_preference('network.proxy.http', prox_host)
#     profile.set_preference('network.proxy.ssl', prox_host)
#     profile.set_preference('network.proxy.http_port', prox_port)
#     profile.set_preference('network.proxy.ssl_port', prox_port)
#     profile.update_preferences()

#     plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

#     with proxy.start((
#         '--hostname', prox_host,
#         '--port', str(prox_port),
#         '--plugins', plugin,
#     )), Firefox(profile) as driver:
#         yield driver


def search(keyword) -> None:
    with Firefox() as driver:
        driver.get('http://www.ctwx.tsinghua.edu.cn/index.htm')

        page = MainPage(driver)
        # page.select_dropdown_item()
        page.submit_search(keyword)

        time.sleep(5)
        # page.switch_tabs()

        while True:
            primary_result_page = SearchResults(driver)
            primary_results = primary_result_page.get_primary_search_result()
            for result in primary_results:
                print(result)
                print()
            if page.next_page() == 0:
                break
            else:
                pass


if __name__ == '__main__':
    search('尹至')

fudan.py

# fudan.py

from dataclasses import dataclass
from itertools import count
from pathlib import Path
from typing import Dict, Iterable, Tuple, List, Optional
from urllib.parse import urljoin

from bs4 import BeautifulSoup
from requests import Session
from datetime import date, datetime

import json
import re

BASE_URL = 'http://www.gwz.fudan.edu.cn'


@dataclass
class Link:
    caption: str
    url: str
    clicks: int
    replies: int
    added: date

    @classmethod
    def from_row(cls, props: Dict[str, str], path: str) -> 'Link':
        clicks, replies = props['点击/回复'].split('/')
        # Skip number=int(props['编号']) - this only has meaning within one page

        return cls(
            caption=props['资源标题'],
            url=urljoin(BASE_URL, path),
            clicks=int(clicks),
            replies=int(replies),
            added=datetime.strptime(props['添加时间'], '%Y/%m/%d').date(),
        )
        
    def __str__(self):
        return f'{self.added} {self.url} {self.caption}'

    def author_title(self) -> Tuple[Optional[str], str]:
        sep = ':'  # full-width colon, U+FF1A

        if sep not in self.caption:
            return None, self.caption

        author, title = self.caption.split(sep, 1)
        author, title = author.strip(), title.strip()

        net_digest = '網摘'
        if author == net_digest:
            return None, title

        return author, title


@dataclass
class Article:
    author: Optional[str]
    title: str
    date: date
    download: Optional[str]
    url: str

    @classmethod
    def from_link(cls, link: Link, download: str) -> 'Article':

        author, title = link.author_title()

        download = download.replace("r", "").replace("n", "").strip()
        if download == '#_edn1':
            download = None
        elif download[0] != '/':
            download = '/' + download

        return cls(
            author=author,
            title=title,
            date=link.added,
            download=download,
            url=link.url,
        )

    def __str__(self) -> str:
        return(
            f"n作者   {self.author}"
            f"n標題   {self.title}"
            f"n發佈日期 {self.date}"
            f"n下載連結 {self.download}"
            f"n訪問網頁 {self.url}"
        )

    def as_dict(self) -> Dict[str, str]:
        return {
            'author': self.author,
            'title': self.title,
            'date': self.date.isoformat(),
            'download': self.download,
            'url': self.url,
        }


def compile_search_results(session: Session, links: Iterable[Link], category_filter: str) -> Iterable[Article]:

    for link in links:
        with session.get(link.url) as resp:
            resp.raise_for_status()
            doc = BeautifulSoup(resp.text, 'html.parser')

        category = doc.select_one('#_top td a[href="#"]').text
        if category != category_filter:
            continue

        content = doc.select_one('span.ny_font_content')
        dl_tag = content.find(
            'a', {
                'href': re.compile("/?(lunwen/|articles/up/).+")
            }
        )

        yield Article.from_link(link, download=dl_tag['href'])


def get_page(session: Session, query: str, page: int) -> Tuple[List[Link], int]:
    with session.get(
        urljoin(BASE_URL, '/Web/Search'),
        params={
            's': query,
            'page': page,
        },
    ) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')

    table = doc.select_one('#tab table')
    heads = [h.text for h in table.select('tr.cap td')]
    links = []

    for row in table.find_all('tr', class_=''):
        cells = [td.text for td in row.find_all('td')]
        links.append(Link.from_row(
            props=dict(zip(heads, cells)),
            path=row.find('a')['href'],
        ))

    page_td = doc.select_one('#tab table:nth-child(2) td') # 共 87 条记录, 页 1/3
    n_pages = int(page_td.text.rsplit('/', 1)[1])

    return links, n_pages


def get_all_links(session: Session, query: str) -> Iterable[Link]:
    for page in count(1):
        links, n_pages = get_page(session, query, page)
        print(f'{page}/{n_pages}')
        yield from links

        if page >= n_pages:
            break


def save_articles(articles: Iterable[Article], file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('[n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n]n')


def search(keyword):
    with Session() as session:
        links = get_all_links(session, query=keyword)
        academic_library = '学者文库'
        articles = compile_search_results(session, links, category_filter=academic_library)
        save_articles(articles, 'fudan_search_result')


if __name__ == '__main__':
    search('尹至')

wuhan.py

from dataclasses import dataclass, asdict
from itertools import count
from typing import Dict, Iterable, Tuple, List

from bs4 import BeautifulSoup
from requests import post
from datetime import date, datetime

import json
import os
import re

@dataclass
class Result:
    author: str
    title: str
    date: date
    url: str
    publication: str = "武漢大學簡帛網"

    @classmethod
    def from_metadata(cls, metadata: Dict) -> 'Result': 
        author, title = metadata['caption'].split(':')
        published_date = date.isoformat(datetime.strptime(metadata['date'], '%y/%m/%d'))
        url = 'http://www.bsm.org.cn/' + metadata['url']

        return cls(
            author = author,
            title = title,
            date = published_date,
            url = url
        )


    def __str__(self):
        return (
            f'作者    {self.author}'
            f'n標題     {self.title}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.url}'
            f'n發表平台  {self.publication}'
        )


def submit_query(keyword: str):
    query = {"searchword": keyword}
    with post('http://www.bsm.org.cn/pages.php?pagename=search', query) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')
        content = doc.find('div', class_='record_list_main')
        rows = content.select('ul')


    for row in rows:
        if len(row.findAll('li')) != 2:
            print()
            print(row.text)
            print()
        else:
            captions_tag, date_tag = row.findAll('li')
            caption_anchors = captions_tag.findAll('a')
            category, caption = [item.text for item in caption_anchors]
            url = caption_anchors[1]['href']
            date = re.sub("[()]", "", date_tag.text)

            yield {
                "category": category, 
                "caption": caption, 
                "date": date,
                "url": url}


def remove_json_if_exists(filename):
    json_file = filename + ".json"
    filePath = os.path.join(os.getcwd(), json_file)

    if os.path.exists(filePath):
        os.remove(filePath)


def search(query: str):
    remove_json_if_exists('wuhan_search_result')
    rslt = submit_query(query)

    for metadata in rslt:
        article = Result.from_metadata(metadata)
        print(article)
        print()

        with open('wuhan_search_result.json', 'a') as file:
            json.dump(asdict(article), file, ensure_ascii=False, indent=4)



if __name__ == '__main__':
    search('尹至')


Get this bounty!!!

#StackBounty: #selenium #selenium-webdriver #web-scraping #google-sheets-formula Scraping angellist start-up data

Bounty: 50

I want to scrape data in a spreadsheet from this site Angel.co startup list i have tried many ways,but it shows an error. used IMPORTXML,IMPORTHTML in spreadsheet it’s not working

format : startup name, location, category

Thanks in advance for help.

tried to used this below request method to scrape data however it shows no output.

import requests

URL = 'https://angel.co/social-network-2'


headers = {
   "Host": "www.angel.co",
   "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux armv8l; rv:88.0) 
   Gecko/20100101 Firefox/88.0",
   "Accept": "application/json, text/javascript, */*; q=0.01",
   "Accept-Language": "en-US,en;q=0.5",
   "Accept-Encoding": "gzip, deflate",
   "Referer": "https://angel.co/social-network-2",
   "X-Requested-With": "XMLHttpRequest",
   "via": "1.1 google"
}

datas = requests.get(URL, headers=headers).json()
import re

for i in datas['data']:
    for j in re.findall('class="uni-link">(.*)</a>',i['title']):
    print(j)


Get this bounty!!!

#StackBounty: #python #python-3.x #web-scraping #concurrent.futures Script doesn't work when I go for multiple search keywords in t…

Bounty: 50

I’ve created a script to fetch different newspaper names derived from a search engine when I initiate search using different keywords, as in CMG제약,DB하이텍 e.t.c. in that pages top right search box.

I also used some customized dates within params to get results from those dates. The script is doing fine as long as I use a single keyword in the search list.

However, when I use multiple keyword in the search list the script only keeps up with the last keyword. This is the list of keywords I would like to use:

keywords = ['CMG제약','DB하이텍','ES큐브','EV첨단소재']

The script is short in size but because of the height of the params, it looks bigger.

I’ve tried so far with (works as intended as I used single search keyword in the list):

import requests
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.parse import urljoin

year_list_start = ['2013.01.01','2014.01.02']
year_list_upto = ['2014.01.01','2015.01.01']

base = 'https://search.naver.com/search.naver'
link = 'https://search.naver.com/search.naver'
params = {
    'where': 'news',
    'sm': 'tab_pge',
    'query': '',
    'sort': '1',
    'photo': '0',
    'field': '0',
    'pd': '',
    'ds': '',
    'de': '',
    'cluster_rank': '',
    'mynews': '0',
    'office_type': '0',
    'office_section_code': '0',
    'news_office_checked': '',
    'nso': '',
    'start': '',
}

def fetch_content(s,keyword,link,params):
    for start_date,date_upto in zip(year_list_start,year_list_upto):
        ds = start_date.replace(".","")
        de = date_upto.replace(".","")
        params['query'] = keyword
        params['ds'] = ds
        params['de'] = de
        params['nso'] = f'so:r,p:from{ds}to{de},a:all'
        params['start'] = 1

        while True:
            res = s.get(link,params=params)
            print(res.status_code)
            print(res.url)
            soup = BeautifulSoup(res.text,"lxml")
            if not soup.select_one("ul.list_news .news_area .info_group > a.press"): break
            for item in soup.select("ul.list_news .news_area"):
                newspaper_name = item.select_one(".info_group > a.press").get_text(strip=True).lstrip("=")
                print(newspaper_name)

            if soup.select_one("a.btn_next[aria-disabled='true']"): break
            next_page = soup.select_one("a.btn_next").get("href")
            link = urljoin(base,next_page)
            params = None


if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
        
        keywords = ['CMG제약']

        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            future_to_url = {executor.submit(fetch_content, s, keyword, link, params): keyword for keyword in keywords}
            concurrent.futures.as_completed(future_to_url)

How can I make the script work when there are more than one keyword in the search list?


Get this bounty!!!

#StackBounty: #python #web-scraping #beautifulsoup Extracting specific elements from websites

Bounty: 50

I have some pieces of information regarding some text. I would like to extract the whole (complete) sentence from the websites. I have two columns, one for the heading/main title (X1), and another for some text included on the webpage (X2). For example:

import pandas as pd

d = {
        'X': [
            'https://en.wikipedia.org/wiki/Manchester_United_F.C.',
            'https://docs.python.org/3/library/email.message.html'
            ], 
        'X1': [
            'Manchester United F.C.',
            'email.message: Representing an email message — Python'
            ] , 
        'X2': [
            'Manchester United Football Club is a professional football club based in Old Trafford, Greater Manchester, England, that competes in the Premier League, the',
            'The payload is either a string or bytes object, in the case of simple message objects, or a list of EmailMessage objects, for MIME container documents such as'
            ]
    }

df = pd.DataFrame(data=d)

I would need to extract the whole complete sentence for each element in X1 and X2. In case it would be not possible to extract the information, I would like just to add an empty element.

My final dataset would have 5 columns, X, X1, X2, X3, and X4, where

X3=['Manchester United F.C.','email.message: Representing an email message']
X4=['Manchester United Football Club is a professional football club based in Old Trafford, Greater Manchester, England, that competes in the Premier League, the top flight of English football.','The payload is either a string or bytes object, in the case of simple message objects, or a list of EmailMessage objects, for MIME container documents such as multipart/* and message/rfc822 message objects.'] 

I did as follows:

from bs4 import BeautifulSoup as bs
import requests
    
for pg in df['X'].tolist():
    page = requests.get(pg)
    soup = BeautifulSoup(page.content, "html.parser")

The first heading (for Manchester United) is an id: <h1 id="firstHeading" class="firstHeading">Manchester United F.C.</h1>, so probably I should do results = soup.find(id='firstHeading'), while the element in X2 that I would like to extract is within BodyContent(if I am right).
In the second case, the title is within span, <span id="email-message-representing-an-email-message"></span>, and the text is within the class body.

I have approximately 50 sentences like these and I can’t do it manually. I know that the format might change and, to avoid error messages in case of not matching value X1 and X2 on the webpage, I would use a try/except and append empty element in the list (to convert into column later).

How do I get the desired output?


Get this bounty!!!

#StackBounty: #python #python-3.x #web-scraping Take information from a webpage and compare to previous request

Bounty: 50

After I have been doing some improvements from my Previous code review. I have taken the knowledge to upgrade and be a better coder but now im here again asking for Code review where I think it could be better.

The purpose of this code is a monitoring that checks for a special site every random 30 to 120 seconds. If there has been a changes then it goes through some if statements as you can see and it will then print to my discord if there has been a changed made.

This is what I have created:

monitoring.py

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import concurrent.futures
import random
import time
from datetime import datetime, timedelta
from typing import Any, Dict, List

import pendulum
from loguru import logger

from scrape_values import Product

store: str = "shelta"
link: str = "https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103"

# -------------------------------------------------------------------------
# Utils
# -------------------------------------------------------------------------
_size_filter: Dict[str, datetime] = {}


def monitor_stock():
    """
    Function that checks if there has happen a restock or countdown change on the website
    """
    payload = Product.from_page(url=link).payload

    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        while True:

            # Request for new product information
            new_payload = Product.from_page(url=link).payload

            # Release sleep
            release_date_sleeper(new_payload)

            # Check countdown timer comparision
            if countdown_timer_comparision(payload, new_payload):
                # Send notification to discord
                executor.submit(send_notification, new_payload, "Timer change!")
                # Replace list
                payload["displayCountDownTimer"] = new_payload["displayCountDownTimer"]

            # Check sizes comparision
            if sizes_comparision(payload, new_payload):
                # Send notification to discord
                executor.submit(send_notification, new_payload, "Restock!")
                # Replace list
                payload["sizes"] = new_payload["sizes"]

            else:
                # No changes happen
                logger.info("No changes made")

                payload["sizes"] = new_payload["sizes"]
                time.sleep(random.randint(30, 120))


def release_date_sleeper(payload) -> None:
    """
    Check if there is a release date on the website. We should sleep if there is to save resources
    :param payload:
    """
    if payload.get('releaseDate'):
        delta_seconds = (payload["releaseDate"].subtract(seconds=10)) - pendulum.now()
        if not delta_seconds.seconds:
            logger.info(f'Release date enabled | Will sleep to -> {(payload["releaseDate"].subtract(seconds=10)).to_datetime_string()}')
            time.sleep(delta_seconds.seconds)


def countdown_timer_comparision(payload, new_payload) -> bool:
    """
    Compare the first requests with the latest request and see if the countdown timer has been changed on the website
    :param payload: First request made
    :param new_payload: Latest request made
    :return: bool
    """
    if new_payload.get("displayCountDownTimer") and payload["displayCountDownTimer"] != new_payload[
        "displayCountDownTimer"]:
        logger.info(f'Detected new timer change -> Name: {new_payload["name"]} | Display Time: {new_payload["displayCountDownTimer"]}')
        return True


def sizes_comparision(payload, new_payload) -> bool:
    """
    Compare the first requests with the latest request and see if the sizes has been changed on the website
    :param payload: First request made
    :param new_payload: Latest request made
    :return: bool
    """
    if payload["sizes"] != new_payload["sizes"]:
        if spam_filter(new_payload["delay"], new_payload["sizes"]):
            logger.info(f'Detected restock -> Name: {new_payload["name"]} | Sizes: {new_payload["sizes"]}')
            return True


def send_notification(payload, status) -> Any:
    """
    Send to discord
    :param payload: Payload of the product
    :param status: Type of status that being sent to discord
    """
    payload["status"] = status
    payload["keyword"] = True
    # FIXME: call create_embed(payload) for post to discord
    # See more here https://codereview.stackexchange.com/questions/260043/creating-embed-for-discord-reading-from-dictionary


def spam_filter(delay: int, requests: List[str]) -> List[str]:
    """
    Filter requests to only those that haven't been made previously within our defined cooldown period.

    :param delay: Delta seconds
    :param requests:
    :return:
    """
    # Get filtered set of requests.
    filtered = [
        r for r in list(set(requests))
        if (
              r not in _size_filter
                or datetime.now() - _size_filter[r] >= timedelta(seconds=delay)
        )
    ]
    # Refresh timestamps for requests we're actually making.
    for r in filtered:
        _size_filter[r] = datetime.now()

    return filtered


if __name__ == "__main__":
    monitor_stock()

scrape_values.py

import json
import re
from dataclasses import dataclass
from typing import List, Optional

import requests
from bs4 import BeautifulSoup


@dataclass
class Product:
    name: Optional[str] = None
    price: Optional[str] = None
    image: Optional[str] = None
    sizes: List[str] = None

    @staticmethod
    def get_sizes(doc: BeautifulSoup) -> List[str]:
        pat = re.compile(
            r'^<script>var JetshopData='
            r'({.*})'
            r';</script>$',
        )
        for script in doc.find_all('script'):
            match = pat.match(str(script))
            if match is not None:
                break
        else:
            return []

        data = json.loads(match[1])
        return [
            variation
            for get_value in data['ProductInfo']['Attributes']['Variations']
            if get_value.get('IsBuyable')
            for variation in get_value['Variation']
        ]

    @classmethod
    def from_page(cls, url: str) -> Optional['Product']:
        with requests.get(url) as response:
            if not response.ok:
                return None
            doc = BeautifulSoup(response.text, 'html.parser')

        name = doc.select_one('h1.product-page-header')
        price = doc.select_one('span.price')
        image = doc.select_one('meta[property="og:image"]')

        return cls(
            name=name and name.text.strip(),
            price=price and price.text.strip(),
            image=image and image['content'],
            sizes=cls.get_sizes(doc),
        )

    @property
    def payload(self) -> dict:
        return {
            "name": self.name or "Not found",
            "price": self.price or "Not found",
            "image": self.image or "Not found",
            "sizes": self.sizes,
        }

My concern is that I might have done it incorrectly where I have split it into multiple functions that maybe is not necessary to do? Im not sure and I do hope I will get some cool feedbacks! Looking forward


Get this bounty!!!

#StackBounty: #python #python-3.x #web-scraping #python-requests Unable to fetch the rest of the names leading to the next pages from a…

Bounty: 50

I’ve created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can’t figure out how I can get the results from next pages as well using requests.

There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.

I’ve tried with (capable of grabbing names from the first page):

import re
import requests
from bs4 import BeautifulSoup

URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
    'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(URL)
    
    params['WebsiteKey'] = re.search(r"gWebsiteKey[^']+'(.*?)'",r.text).group(1)
    params['hkey'] = re.search(r"gHKey[^']+'(.*?)'",r.text).group(1)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
    
    r = s.post(URL,params=params,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

In case someone comes up with any solution based on selenium, I’ve found success already with the same. However, I’m not willing to go that route:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"

with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)

    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
    wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[@id='ctl01_LoadingLabel' and .='Loading']")))
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

How can I get the rest of the names from that webpage leading to the next pages using requests module?


Get this bounty!!!

#StackBounty: #python #python-3.x #selenium #web-scraping #python-requests Can't find the right way to grab part numbers from a web…

Bounty: 50

I’m trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers. This image represents where the part numbers are.

I’ve tried with:

import requests

link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
post_url = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/HRQ'

payload = {"q":4,"ReqID":21,"focus":"f24~v472_0","scroll":[],"events":["e468~12~0~472~0~4","e468_0~6~472"],"ito":22,"kms":4}

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['referer'] = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/go?q=2'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    r = s.post(post_url,data=payload)
    print(r.json())

When I execute the above script, I get the following result:

{'isRedirect': True, 'url': '../../camosStatic/Exception.html'}

How can I fetch the part numbers from that site using requests?

In case of selenium, I tried like below to fetch the part numbers but it seems the script can’t click on the product list tab if I kick out hardcoded delay from it. Given that I don’t wish to go for any hardcoded delay within the script.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
 
with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    time.sleep(10)   #I would like to get rid of this hardcoded delay
    
    item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    driver.execute_script("arguments[0].click();",item)
    for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
        print(elem.text)


Get this bounty!!!

#StackBounty: #python #python-3.x #selenium #web-scraping #python-requests Can't find the right way to grab part numbers from a web…

Bounty: 50

I’m trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers. This image represents where the part numbers are.

I’ve tried with:

import requests

link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
post_url = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/HRQ'

payload = {"q":4,"ReqID":21,"focus":"f24~v472_0","scroll":[],"events":["e468~12~0~472~0~4","e468_0~6~472"],"ito":22,"kms":4}

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['referer'] = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/go?q=2'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    r = s.post(post_url,data=payload)
    print(r.json())

When I execute the above script, I get the following result:

{'isRedirect': True, 'url': '../../camosStatic/Exception.html'}

How can I fetch the part numbers from that site using requests?

In case of selenium, I tried like below to fetch the part numbers but it seems the script can’t click on the product list tab if I kick out hardcoded delay from it. Given that I don’t wish to go for any hardcoded delay within the script.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
 
with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    time.sleep(10)   #I would like to get rid of this hardcoded delay
    
    item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    driver.execute_script("arguments[0].click();",item)
    for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
        print(elem.text)


Get this bounty!!!

#StackBounty: #python #selenium #web-scraping #scrapy #web-crawler how to run spider multiple times with different input

Bounty: 50

I’m trying to scrape information from different sites about some products. Here is the structure of my program:

product_list = [iPad, iPhone, AirPods, ...]

def spider_tmall:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...


def spider_jd:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...

if __name__ == '__main__':

    for a in range(len(inputlist)):
        process = CrawlerProcess(settings={
            "FEEDS": {
                "itemtmall.csv": {"format": "csv",
                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
                "itemjd.csv": {"format": "csv",
                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
        })

        process.crawl(tmallSpider)
        process.crawl(jdSpider)
        process.start()

Basically, I want to run all spiders for all inputs in product_list. Right now, my program only runs through all spiders once (in the case, it does the job for iPad) then there is ReactorNotRestartable Error and the program terminates. Anybody knows how to fix it?
Also, my overall goal is to run the spider multiple times, the input doesn’t necessarily have to be a list. It can be a CSV file or something else. Any suggestion would be appreciated!


Get this bounty!!!