How yo make a selenium Scripts faster?

0
0

I use python Selenium and Scrapy for crawling a website.

but my script is so slow,

Crawled 1 pages (at 1 pages/min)

i use CSS SELECTOR instead of XPATH for optimise the time.
i change the midllewares

'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,

is Selenium is too slow or i should change something in Setting?

my Code:

def start_requests(self):
yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
display
= Display(visible=0, size=(800, 600))
display
.start()
driver
= webdriver.Firefox()
driver
.get("http://www.example.com")
inputElement
= driver.find_element_by_name("OneLineCustomerAddress")
inputElement
.send_keys("75018")
inputElement
.submit()
catNums
= driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
#INIT
driver
.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
for catNumber in xrange(1,len(catNums)+1):
print "\n IN catnumber \n"
driver
.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
time
.sleep(5)
self
.parse_articles(driver)
pages
= driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')
if(pages):
page
= driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')
checkText
= (page.text).strip()
if(len(checkText) > 0):
pageNums
= int(page.text)
pageNums
= pageNums - 1
for pageNumbers in range (pageNums):
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "waitingOverlay")))
driver
.find_element_by_css_selector('.jsNxtPage.pgNext').click()
self
.parse_articles(driver)
time
.sleep(5)
def parse_articles(self,driver) :
test
= driver.find_elements_by_css_selector('html body div#page div#main.content div#sContener div#sContent div#lpContent.jsTab ul#lpBloc li div.prdtBloc p.prdtBDesc strong.prdtBCat')
def between(self, value, a, b):
pos_a
= value.find(a)
if pos_a == -1: return ""
pos_b
= value.rfind(b)
if pos_b == -1: return ""
adjusted_pos_a
= pos_a + len(a)
if adjusted_pos_a >= pos_b: return ""
return value[adjusted_pos_a:pos_b]
  • You must to post comments
0
0

So your code has few flaws here.

  1. You use selenium to parse the page contents when scrapy Selectors are faster and more efficient.
  2. You start a webdriver for every response.

This can be resolved very eloquently by using scrapy’s Downloader middlewares!
You want to create a custom downloader middleware that would download requests using selenium rather than scrapy downloader.

For example I use this:

# middlewares.py
class SeleniumDownloader(object):
def create_driver(self):
"""only start the driver if middleware is ever called"""
if not getattr(self, 'driver', None):
self
.driver = webdriver.Chrome()
def process_request(self, request, spider):
# this is called for every request, but we don't want to render
# every request in selenium, so use meta key for those we do want.
if not request.meta.get('selenium', False):
return request
self
.create_driver()
self
.driver.get(request.url)
return HtmlResponse(request.url, body=self.driver.page_source, encoding='utf-8')

Activate your middleware:

# settings.py
DOWNLOADER_MIDDLEWARES
= {
'myproject.middleware.SeleniumDownloader': 13,
}

Then in your spider you can specify which urls to download via selenium driver by adding a meta argument.

# you can start with selenium
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={'selenium': True})
def parse(self, response):
# this response is rendered by selenium!
# also can use no selenium for another response if you wish
url
= response.xpath("//a/@href")
yield scrapy.Request(url)

The advantages of this approach is that you your driver is being started only once and used to download page source only, the rest is left to proper asynchronous scrapy tools.

The disadvantages is that you cannot click buttons around and such since you are not exposed to the driver. Most of the times you can reverse engineer what the buttons do via network inspector and you should never need to do any clicking with the driver itself.

  • You must to post comments
Showing 1 result
Your Answer
Post as a guest by filling out the fields below or if you already have an account.
Name*
E-mail*
Website