#StackBounty: #python #selenium #web-scraping #scrapy #web-crawler how to run spider multiple times with different input

Bounty: 50

I’m trying to scrape information from different sites about some products. Here is the structure of my program:

product_list = [iPad, iPhone, AirPods, ...]

def spider_tmall:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...


def spider_jd:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...

if __name__ == '__main__':

    for a in range(len(inputlist)):
        process = CrawlerProcess(settings={
            "FEEDS": {
                "itemtmall.csv": {"format": "csv",
                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
                "itemjd.csv": {"format": "csv",
                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
        })

        process.crawl(tmallSpider)
        process.crawl(jdSpider)
        process.start()

Basically, I want to run all spiders for all inputs in product_list. Right now, my program only runs through all spiders once (in the case, it does the job for iPad) then there is ReactorNotRestartable Error and the program terminates. Anybody knows how to fix it?
Also, my overall goal is to run the spider multiple times, the input doesn’t necessarily have to be a list. It can be a CSV file or something else. Any suggestion would be appreciated!


Get this bounty!!!

#StackBounty: #python-3.x #scrapy #format #placeholder #pymysql Why can't replace placeholder with format function in pymysql?

Bounty: 50

How i create the table mingyan.

CREATE TABLE `mingyan` (
  `tag` varchar(10) DEFAULT NULL,
  `cont` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

It’s said that string format function with {} is more pythonic way than placeholder %.
In my scrapy to write some fields into a table mingyan.

self.cursor.execute("insert into mingyan(tag, cont) values (%s, %s)",(item['tag'],item['cont']))

It works fine in my scrapy,now i replace the placeholder way with string format function.

self.cursor.execute("insert into mingyan(tag, cont) values ({},{})".format(item['tag'],item['cont']))

The scrapy got error info

pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; 

Why can’t replace placeholder with format function in pymysql?

The item in scrapy doucment.
item meaning in scrapy


Get this bounty!!!

#StackBounty: #python #scrapy deny certain links in scrapy linkextractor

Bounty: 50

with open('/home/timmy/myamazon/bannedasins.txt') as f:
    banned_asins = f.read().split('n')

class AmazonSpider(CrawlSpider):

    name = 'amazon'
    allowed_domains = ['amazon.com',]

    rules = (
            Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
            Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
            process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
            callback="parse_item"),
            )

I have the following two rules to extract Amazon product Links which works correct,Now I want to remove some Asins from search re.search('dp/(.*)/',i).groups()[0] this retrieves the ASIN and places it in the format https://www.amazon.com/dp/{ASIN}, what I want to do is– if asin in banned_asins do not extract it.

After reading Link Extractors Scrapy doc,I believe its done by deny_extensions not sure how to use though

banned_asins= ['B07RTX74L7','B07D9JCH5X',......]


Get this bounty!!!

#StackBounty: #python #python-3.x #selenium #web-scraping #scrapy Can't make my script keep trying with different proxies until it …

Bounty: 50

I’ve written a script in scrapy in combination with selenium to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I’m trying to do is parse all the post links from it’s landing page and then fetch the name of each title from it’s target page.

My following script works inconsistently because when get_random_proxy function produces a usable proxy then I get my script working otherwise it fails miserably.

How can I make my script keep trying with different proxies until it runs successfully?

I’ve written so far:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def get_proxies():   
    response = requests.get("https://www.sslproxies.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
    return proxies

def get_random_proxy(proxy_vault):
    random.shuffle(proxy_vault)
    proxy_url = next(cycle(proxy_vault))
    return proxy_url

def start_script():
    proxy = get_proxies()
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f'--proxy-server={get_random_proxy(proxy)}')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

class StackBotSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = [
        'https://stackoverflow.com/questions/tagged/web-scraping'
    ]

    def __init__(self):
        self.driver = start_script()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".summary .question-hyperlink"))):
            yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_details)

    def parse_details(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))):
            yield {"post_title":elem.text}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(StackBotSpider)
c.start()


Get this bounty!!!

#StackBounty: #python #web-scraping #scrapy Scrapy: follow external links only

Bounty: 50

With OffsiteMiddleware you can control how to follow external links in Scrapy.

I want the spider to ignore all internal links on a site and follow external links only.

Dynamic rules to add the response URL domain to deny_domains didn’t work.

Can you override get_host_regex in OffsiteMiddleware to filter out all onsite links? Any other way?

Clarification: I want the spider to ignore the domains defined in allowed_domains and all internal links on each domain crawled. So the domain of every URL followed by the spider must be ignored when the spider is on that URL. In other words: When the crawler reaches a site like example.com, I want it to ignore any links on example.com and only follow external links to sites that are not on example.com.


Get this bounty!!!