Prevent CSS/other resource download in PhantomJS/Selenium driven by Python

0
0

I’m trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I’ve found this code:

page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console
.log('The url of the request is matching. Aborting: ' + requestData['url']);
request
.abort();
}
};

via: How can I control PhantomJS to skip download some kind of resource?

How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?

Note: I’ve already found how to prevent image download by editing service_args variable via:

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

and

PhantomJS 1.8 with Selenium on python. How to block images?

But service_args can’t help me with resources like CSS. Thanks!

  • You must to post comments
0
0

A bold young soul by the name of “watsonmw” recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

For a solution at all costs, consider building from source (which developers note “takes roughly 30 minutes … with 4 parallel compile jobs on a modern machine”) and integrating his patch, linked above.

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver
= webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver
.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver
.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
// ...
}
'''
, 'args': []})

Until then, you’ll just get a Can't find variable: page exception.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

  • You must to post comments
Showing 1 result
Your Answer
Post as a guest by filling out the fields below or if you already have an account.
Name*
E-mail*
Website