Skip to content Skip to sidebar Skip to footer

How To Set Time Interval Between Page Scraping With Phantomjs

Currently I wrote a script with Phantomjs that scrapes through multiple pages. My script works but I can't figure out how to set a time interval in between scrapes. I tried using s

Solution 1:

The problem is that PhantomJS is asynchronous, but loop iteration is not. All iterations (in the first snippet) are executed even before the first page is loaded. You're essentially generating multiple such processes which run at the same time.

You can use something like async to let it run sequentially:

phantom.create(function(ph) {
    ph.createPage(function(page) {
        var arrayList = ['string1', 'string2', 'string3'....];

        var tasks = arrayList.map(function(eachItem) {
            returnfunction(callback){
                var webAddress = "http://www.example.com/" + eachItem;
                page.open(webAddress, function(status) {
                    console.log("opened site? ", status);

                    page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                        setTimeout(function() {
                            return page.evaluate(function() {
                                //code here for gathering data
                            }, function(result) {
                                callback(null, result);
                            });
                        }, 5000);
                    });
                });
            };
        });

        async.series(tasks, function(err, results){
            console.log("Finished");
            ph.exit();
        });
    });
});

Of course you can also move phantom.create() inside of each task which will create a separate process for each request, but the code above will be faster.

Solution 2:

You have some typos in the second snippet where you added the setInterval approach:

var arrayList = ['string1', 'string2', 'string3'];
var i = 0;
var scrapeInterval = setInterval(function () {
    var webAddress = "http://www.example.com/arrayList[i]"
    phantom.create(function (ph) {
        return ph.createPage(function (page) {

            return page.open(yelpAddress, function (status) {
                console.log("opened site? ", status);


                page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function () {

                    setTimeout(function () {
                        return page.evaluate(function () {
                            //code here for gathering data
                        }, function (result) {
                            return result
                            ph.exit();
                        });

                    }, 5000);

                });
            });
        });

        i++;
        if (i > arrayList.length) {
            clearInterval(scrapeInterval);
        } //This was missing;
    }); //This was missing;
}, 5000);

And something i've noticed, is the return statement in the following timeout:

setTimeout(function () {
    return page.evaluate(function () {
        //code here for gathering data
    }, function (result) {
        return result
        ph.exit();
    });
}, 5000);

ph.exit(); will never be reached, i don't know if this will cause any issue for you but you might want to take a look at it.

Post a Comment for "How To Set Time Interval Between Page Scraping With Phantomjs"