This post is about Apify generic Web Scraper actor while on the free tier.
I needed a quick tool to automate one of my tasks. I quickly whipped up multiple web scraper tasks and tested them individually on the Apify platform. But when it was time to integrate it into my backend- I stumbled on the following problem.
Cannot run actor (By launching this job you will exceed the memory limit of 4096MB for all your actor runs and builds (currently used: 4096MB, requested: 4096MB). Please upgrade to a paid plan to increase your actor memory limit.)
I realised that when running my actors (saved tasks) concurrently, it breaks the maximum memory of 4Gb for my free tier. This problem did not occur while testing since I tested them individually.
Queue problem
I wanted to start all the jobs at once and not deal with managing the queue myself in my backend. I could put jobs into CRON to run at different times, but I would head into the same memory problem when the previous instance run is not yet finished. Theoretically, it’s possible to utilise webhook responses from Apify for that.
But it was easier to put all URLs into startUrls
parameter instead. This way memory limit is not exceeded since the same instance would go through all the URLs supplied, one by one, managing queue itself.
One problem down.
Scraping function rewrite
Before I wrote actor for each website I needed to scrape. Now I had to accumulate many scraping strategies into one pageFunction
. Not an ideal solution, but serves our objective here.
But how to understand which strategy to use for each URL?
One way is to detect via URL itself from the context given as an input parameter to the pageFunction
. Url of the loaded page is stored in context.request.loadedUrl. I went with userParmas option instead which can be found in context.request.userData. I just needed more data to be passed in which would not be possible otherwise.
Here is a snippet of my input JSON for 2 different strategies. You can set userData also in Apify UI when modifying actor/saved task input.
Here is the snippet of my page function to illustrate the strategies in use.
"startUrls": [ { "url": "[my secret url to scrape]", "method": "GET", "userData": { "useStrategy": "strategy1", "myOtherData": "" } }, { "url": "[my otgher url to scrape]", "method": "GET", "userData": { "useStrategy": "strategy2", "myOtherData": "" } } ]
async function pageFunction( context ) { function processStrategy1( loadedUrl ){ // Do your stuff 1 } function processStrategy2( loadedUrl ){ // Do your stuff 2 } switch ( context.request.userData.useStrategy ) { case 'strategy1': await processStrategy1( context.request.loadedUrl ); break; case 'strategy2': await processStrategy2( context.request.loadedUrl ); break; default: context.log.info('undefined strategy' ); } return listings; }
Final words
I hope this helps someone out with a similar problem. Let me know if there is something unclear or if you have a better idea.