Scrape Data from Websites into a Headless CMS
What data are we scraping?
We are going to extract the logo out of each website. The problem is that there are many places where a logo can appear on a website and we can’t be 100% sure that we find the one we want. The list of potential locations could be: the manifest file, the favicon, the schema data, and the HTML code. We must try to retrieve all of them in sequence until we find an image.
How to retrieve data from a webpage in NodeJS
A very easy way to read the data in a webpage with NodeJS is to download the document and parse its content with jsdom, a script that allows you to explore the DOM with the Web API you would use client-side. We could also work with regular expressions, but since we have to look for several items on the page, I’d recommend the above script as it’ll make your life way simpler.
This is an example of how we can retrieve the manifest file.
After fetching its content we can get the list of images out of it.
Import the data into Storyblok
We have an array with the data scraped from the list of sites (title, meta description, etc..) along with the logos. Now we need to store the content in the CMS.
In order to upload the logos and create the entries, we are going to use the Storyblok JS Client, as it simplifies the code for the API requests.
The above client takes care of the throttling for the API, but since we are uploading a lot of assets, we’d better take care to limit the number of simultaneous uploads as they could cause issues for the memory or for the bandwidth.
We can use Async to limit the number of concurrent operations. Obviously you can use this approach any time you are uploading many assets via API.
We’ll loop through the array of websites and use the
eachLimit method, setting a suitable number of simultaneous uploads.
Inside the loop check if the entry already exists by its path, and if so just update it, otherwise create a new one.
uploadLogo method is called before creating or updating an entry. This function uploads the website's logo in Storyblok and returns the signed request data which contains the id and the fancy URL of the asset along with other data.
In Storyblok, we’ll use an asset field for the logo so we need to use a specific schema to store it.
This page demonstrates a very specific example, but you can use the same structure to scrape a different set of data from a list of sites, or from all the pages of a website, and then upload it to Storyblok.