#StackBounty: #python #python-3.x #web-scraping #python-requests Unable to fetch the rest of the names leading to the next pages from a…

Bounty: 50

I’ve created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can’t figure out how I can get the results from next pages as well using requests.

There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.

I’ve tried with (capable of grabbing names from the first page):

import re
import requests
from bs4 import BeautifulSoup

URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
    'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(URL)
    
    params['WebsiteKey'] = re.search(r"gWebsiteKey[^']+'(.*?)'",r.text).group(1)
    params['hkey'] = re.search(r"gHKey[^']+'(.*?)'",r.text).group(1)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
    
    r = s.post(URL,params=params,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

In case someone comes up with any solution based on selenium, I’ve found success already with the same. However, I’m not willing to go that route:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"

with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)

    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
    wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[@id='ctl01_LoadingLabel' and .='Loading']")))
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

How can I get the rest of the names from that webpage leading to the next pages using requests module?


Get this bounty!!!

#StackBounty: #python #python-3.x #selenium #web-scraping #python-requests Can't find the right way to grab part numbers from a web…

Bounty: 50

I’m trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers. This image represents where the part numbers are.

I’ve tried with:

import requests

link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
post_url = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/HRQ'

payload = {"q":4,"ReqID":21,"focus":"f24~v472_0","scroll":[],"events":["e468~12~0~472~0~4","e468_0~6~472"],"ito":22,"kms":4}

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['referer'] = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/go?q=2'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    r = s.post(post_url,data=payload)
    print(r.json())

When I execute the above script, I get the following result:

{'isRedirect': True, 'url': '../../camosStatic/Exception.html'}

How can I fetch the part numbers from that site using requests?

In case of selenium, I tried like below to fetch the part numbers but it seems the script can’t click on the product list tab if I kick out hardcoded delay from it. Given that I don’t wish to go for any hardcoded delay within the script.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
 
with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    time.sleep(10)   #I would like to get rid of this hardcoded delay
    
    item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    driver.execute_script("arguments[0].click();",item)
    for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
        print(elem.text)


Get this bounty!!!

#StackBounty: #python #python-3.x #selenium #web-scraping #python-requests Can't find the right way to grab part numbers from a web…

Bounty: 50

I’m trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers. This image represents where the part numbers are.

I’ve tried with:

import requests

link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
post_url = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/HRQ'

payload = {"q":4,"ReqID":21,"focus":"f24~v472_0","scroll":[],"events":["e468~12~0~472~0~4","e468_0~6~472"],"ito":22,"kms":4}

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['referer'] = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/go?q=2'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    r = s.post(post_url,data=payload)
    print(r.json())

When I execute the above script, I get the following result:

{'isRedirect': True, 'url': '../../camosStatic/Exception.html'}

How can I fetch the part numbers from that site using requests?

In case of selenium, I tried like below to fetch the part numbers but it seems the script can’t click on the product list tab if I kick out hardcoded delay from it. Given that I don’t wish to go for any hardcoded delay within the script.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
 
with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    time.sleep(10)   #I would like to get rid of this hardcoded delay
    
    item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    driver.execute_script("arguments[0].click();",item)
    for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
        print(elem.text)


Get this bounty!!!

#StackBounty: #python #selenium #web-scraping #scrapy #web-crawler how to run spider multiple times with different input

Bounty: 50

I’m trying to scrape information from different sites about some products. Here is the structure of my program:

product_list = [iPad, iPhone, AirPods, ...]

def spider_tmall:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...


def spider_jd:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...

if __name__ == '__main__':

    for a in range(len(inputlist)):
        process = CrawlerProcess(settings={
            "FEEDS": {
                "itemtmall.csv": {"format": "csv",
                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
                "itemjd.csv": {"format": "csv",
                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
        })

        process.crawl(tmallSpider)
        process.crawl(jdSpider)
        process.start()

Basically, I want to run all spiders for all inputs in product_list. Right now, my program only runs through all spiders once (in the case, it does the job for iPad) then there is ReactorNotRestartable Error and the program terminates. Anybody knows how to fix it?
Also, my overall goal is to run the spider multiple times, the input doesn’t necessarily have to be a list. It can be a CSV file or something else. Any suggestion would be appreciated!


Get this bounty!!!

#StackBounty: #r #reactjs #web-scraping #leaflet #rselenium Extracting underlying data via RSelenium with embedded leaflet svg, and more

Bounty: 200

I would like to extract information about each ad in this link. Now, I got to the stage where I can automatically click See Ad Details, but there is much underlying data that is not straightforward to wrangle into a neat dataframe.

library(RSelenium)
rs <- rsDriver()
remote <- rs$client
remote$navigate(
  paste0(
    "https://www.facebook.com/ads/library/?", 
    "active_status=all&ad_type=political_and_issue_ads&country=US&", 
    "impression_search_field=has_impressions_lifetime&", 
    "q=actblue&view_all_page_id=38471053686"
  )
)

test <- remote$findElement(using = "xpath", "//*[@class="_7kfh"]")
test$clickElement()
## Manually figured out element
test <- remote$findElement(using = "xpath", "//*[@class="_7lq0"]")
test$getElementText()

The output text is messy itself but I believe with some time and effort, it can be wrangled into something useful. The problem is wrangling the underlying data in

  1. the graph, which seems to be just an image, and
  2. leaflet svg, which displays data when a cursor hovers over it.

I am at a loss to how to systematically extract this image and especially the leaflet svg. How would I take each ad and then extract the full data available in the details in this case?


Get this bounty!!!

#StackBounty: #node.js #web-scraping #puppeteer net::ERR_TOO_MANY_REDIRECTS if I want to open website in puppeteer

Bounty: 50

When I try to open a washington post article in puppeteer, I get error:

Error: net::ERR_TOO_MANY_REDIRECTS

internal/process/warning.js:27 (node:70458) UnhandledPromiseRejectionWarning: Error: net::ERR_TOO_MANY_REDIRECTS at https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/
    at navigate (.../node_modules/puppeteer/lib/FrameManager.js:121:37)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
  -- ASYNC --
    at Frame.<anonymous> (.../node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (.../node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (.../node_modules/puppeteer/lib/helper.js:112:23)
    at .../app.js:255:14
    at processTicksAndRejections (internal/process/task_queues.js:89:5)

app.js

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/');

  const html = await page.content()
  console.log(html)
  await browser.close()
})()

I tried to abort navigation redirects based on this, but then I get Error: net::ERR_FAILED

internal/process/warning.js:27 (node:70488) UnhandledPromiseRejectionWarning: Error: net::ERR_FAILED at https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/
    at navigate (.../node_modules/puppeteer/lib/FrameManager.js:121:37)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
  -- ASYNC --
    at Frame.<anonymous> (.../node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (.../node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (.../node_modules/puppeteer/lib/helper.js:112:23)
    at .../app.js:247:14
    at processTicksAndRejections (internal/process/task_queues.js:89:5)


Get this bounty!!!

#StackBounty: #python #python-3.x #web-scraping Trouble getting desired response issuing a post requests

Bounty: 50

I’ve created a script in python to get a 200 status code issuing a post http requests but when I run my script I get 403 instead. It seems that I followed the way how the requests is being sent in chrome dev tools.

To do it manually – go to that page, select 6 as size and then hit the add to cart button.

How can I do the same using the script below?

Webpage address

I’ve tried with:

import requests
from bs4 import BeautifulSoup

main_url = 'https://www.footlocker.co.uk/en/homepage'
post_url = 'https://www.footlocker.co.uk/en/addtocart?'

params = {
    'SynchronizerToken': '',
    'Ajax': True,
    'Relay42_Category': 'Product Pages',
    'acctab-tabgroup-314207586604090': None,
    'Quantity_314207586604070': '1',
    'SKU': '314207586604070'
}

with requests.Session() as s:
    r = s.get(main_url)
    soup = BeautifulSoup(r.text,"lxml")

    #parsing token to reuse within data
    token = soup.select_one("[name='SynchronizerToken']")['value']

    params['SynchronizerToken'] = token

    res = s.post(post_url,params=params,data=params,headers={
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.footlocker.co.uk/en/p/nike-signal-dmsx-men-shoes-73190?v=314207586604',
        'accept': 'application/json, text/javascript, */*; q=0.01'
        })
    print(res.status_code)

Current status:

403

Expected status:

200


Get this bounty!!!

#StackBounty: #python #python-3.x #web-scraping #python-requests Unable to scrape a piece of static information from a webpage

Bounty: 50

I’ve created a script in python to log in a webpage using credentials and then parse a piece of information SIGN OUT from another link (the script is supposed to get redirected to that link) to make sure I did log in.

Website address

I’ve tried with:

import requests
from bs4 import BeautifulSoup

url = "https://member.angieslist.com/gateway/platform/v1/session/login"
link = "https://member.angieslist.com/"

payload = {"identifier":"usename","token":"password"}

with requests.Session() as s:
    s.post(url,json=payload,headers={
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
        "Referer":"https://member.angieslist.com/member/login",
        "content-type":"application/json"
        })
    r = s.get(link,headers={"User-Agent":"Mozilla/5.0"},allow_redirects=True)
    soup = BeautifulSoup(r.text,"lxml")
    login_stat = soup.select_one("button[class*='menu-item--account']").text
    print(login_stat)

When i run the above script, I get AttributeError: 'NoneType' object has no attribute 'text' this error which means I went somewhere wrong in my log in process as the information I wish to parse SIGN OUT is a static content.

How can I parse this SIGN OUT information from that webpage?


Get this bounty!!!

#StackBounty: #python #asp.net #web-scraping #python-requests Python Scraping .aspx multiple pages, multiple __VIEWSTATES

Bounty: 50

I’m trying to scrape this site:

http://www.occeweb.com/MOEAsearch/index.aspx

If I search for “A”, I get multiple pages.

I can get the results of the 1st page fine, using:

url = 'http://www.occeweb.com/MOEAsearch/index.aspx'

r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
vs = soup.find('input',{'id':'__VIEWSTATE'}).attrs['value']
ev = soup.find('input',{'id':'__EVENTVALIDATION'}).attrs['value']

cookies = {
    'ASP.NET_SessionId': 'f1vztt45bdcvzr45jkrbcoru',
}

headers = {
    'Proxy-Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Origin': 'http://www.occeweb.com',
    'Upgrade-Insecure-Requests': '1',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.143 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer': url,
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,en;q=0.9',
}

data = {
  '__EVENTTARGET': 'gvResults',
  '__EVENTARGUMENT': '',
  '__VIEWSTATE': vs,
  '__VIEWSTATEGENERATOR': '2E193097',
  '__EVENTVALIDATION': ev,
  'txtSearch': 'A',
  'StartsEnds': 'rbBeginswith',
  'TxtSearchFirst': '',
  'btnSearch':'Search'
}

r = requests.post(url, headers=headers, cookies=cookies, data=data)
soup = BeautifulSoup(r.text,'html.parser')

However, when I try to use the same __VIEWSTATE and __EVENTVALIDATION for the 2nd page, it doesn’t work.

I have also tried pulling the __VIEWSTATE from the response of the POST request and using that in the subsequent call, no luck.

Note that I am able to get this to work for the first 11 pages of results by simply copying the __VIEWSTATE and __EVENTVALIDATION from chrome dev tools on page 1 and holding it static (have to remove 'btnSearch':'Search' for pages after 1 for some reason).

However this static __VIEWSTATE and __EVENTVALIDATION fail on page 12. When I copy the page 12 curl, it works until page 22, then page 32, 42 and so on. So it seems the __VIEWSTATE needs to be updated once every 10 pages or so.

Problem is, the __VIEWSTATE I pull from the result of the POST request does not work, and I can’t GET the updated __VIEWSTATE I need.

Thanks for you help!


Get this bounty!!!

#StackBounty: #web-scraping #extract scraping data from truecaller

Bounty: 50

I have 200K phone number and i want to get there city using truecaller , how to do that ?
as you know truecaller has a restriction per requests ,,

somebody do this here :
https://www.phphive.info/324/truecaller-api/

this is mycode :

 $cookieFile = dirname(__file__) . DIRECTORY_SEPARATOR . 'cookies';
    $no = $users[0];
    $url = "https://www.truecaller.com/api/search?type=4&countryCode=sd&q=" . $no;
    $ch = curl_init();
    $header = array();
    $header[] = 'Content-length: 0';
    $header[] = 'Content-type: application/json';
    $header[] = 'Authorization: Bearer i03~CNORR-VIOJ2k~Hua_GBt73sKJJmO';
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    $data = curl_exec($ch);
    $error = curl_error($ch);
    curl_close($ch);
    $data = json_decode($data, true);
    $name = $data['data'][0]['name'];
    $altname = $data['data'][0]['altName'];
    $gender = $data['data'][0]['gender'];
    $about = $data['data'][0]['about'];


Get this bounty!!!