Scrape Data from Websites into a Headless CMS

Contents
    Try Storyblok

    Storyblok is the first headless CMS that works for developers & marketers alike.

    In this article you will learn how to extract data from a list of websites and import them into Storyblok using the Management API.

    The repo of the script in this tutorial can be found here.

    What data are we scraping?

    We are going to extract the logo out of each website. The problem is that there are many places where a logo can appear on a website and we can’t be 100% sure that we find the one we want. The list of potential locations could be: the manifest file, the favicon, the schema data, and the HTML code. We must try to retrieve all of them in sequence until we find an image.

    How to retrieve data from a webpage in NodeJS

    A very easy way to read the data in a webpage with NodeJS is to download the document and parse its content with jsdom, a script that allows you to explore the DOM with the Web API you would use client-side. We could also work with regular expressions, but since we have to look for several items on the page, I’d recommend the above script as it’ll make your life way simpler. 

    This is an example of how we can retrieve the manifest file. 

     async getManifest() {
       // this.dom is the virtual DOM parsed by JSDOM
       // We can navigate in the DOM with the Web API to get the manifest meta tag
       let manifestLink = this.dom.window.document.querySelector('[rel="manifest"]')
       let manifest = {}
       if (manifestLink) {
         try {
           // Request the manifest file
           // this.fullUrl will convert a path without domain to a full path
           // with domain
           let response = await axios.get(this.fullUrl(manifestLink.href))
           if (response.status === 200) {
             manifest = response.data
           }
         } catch (err) { }
       }
       return manifest
     }

    After fetching its content we can get the list of images out of it.

     async getLogoFromManifest() {
       let manifest = await this.getManifest()
       if (manifest?.icons?.length) {
         // Sort the icons array based on the first value of the sizes property 
         // of each image (the width).
         manifest.icons.sort((a, b) => {
           if (!a.sizes || !b.sizes) {
             return 0
           }
           let aSize = parseInt(a.sizes.split('x')[0])
           let bSize = parseInt(b.sizes.split('x')[0])
           if (aSize > bSize) {
             return -1
           }
           if (aSize < bSize) {
             return 1
           }
           return 0
         })
         // Return the biggest image from the list
         return this.absoluteUrl(manifest.icons[0].src)
       } else {
         return ''
       }
     }

    Import the data into Storyblok

    We have an array with the data scraped from the list of sites (title, meta description, etc..) along with the logos. Now we need to store the content in the CMS. 

    In order to upload the logos and create the entries, we are going to use the Storyblok JS Client, as it simplifies the code for the API requests.
    The above client takes care of the throttling for the API, but since we are uploading a lot of assets, we’d better take care to limit the number of simultaneous uploads as they could cause issues for the memory or for the bandwidth.
    We can use Async to limit the number of concurrent operations. Obviously you can use this approach any time you are uploading many assets via API.
    We’ll loop through the array of websites and use the eachLimit method, setting a suitable number of simultaneous uploads.

    Inside the loop check if the entry already exists by its path, and if so just update it, otherwise create a new one.

    async import() {
       process.stdout.write(`Writing data into Storyblok \n`);
       let total = 0
       // this.websites contains the array of all the websites
       // Perform max 15 simultaneous operations
       async.eachLimit(this.websites, 15, async (website) => {
         try {
           // If the story exists, just update that one
           const story = (await this.storyblok.get(`cdn/stories/pl/${this.storySlug(website)}`, { version: 'draft' })).data.story
           await this.updateStory(story, website)
         } catch (err) {
           // If the story doesn’t exist, update it
           await this.createStory(website)
         }
         // Output a progressive counter in the console
         // to give info about the status of the import
         process.stdout.clearLine()
         process.stdout.cursorTo(0)
         process.stdout.write(`${++total} of ${this.websites.length} Stories saved.`)
       }, (err) => {
         if(err) {
           console.log(err)
         }
         // Save a log of the import
         fs.writeFileSync('./data/log.json', JSON.stringify(this.log, null, 2))
       })
     }

    The uploadLogo method is called before creating or updating an entry. This function uploads the website's logo in Storyblok and returns the signed request data which contains the id and the fancy URL of the asset along with other data.

    /**
      * Upload an asset
      * @param {string} image The image url
      * @return {promise}
      */
     async uploadLogo(logo) {
       return new Promise(async (resolve) => {
         try {
           let logo_data = ''
           let filename = ''
           if (logo.includes('<svg')) {
             // If the logo is an HTML element use the HTML content
             logo_data = logo
             filename = 'logo.svg'
           } else {
             // If the logo is a URL, get the data with an http request
             logo_data = (await axios.get(logo, {responseType: 'arraybuffer'})).data
             filename = logo.split('?')[0].split('/').pop()
           }
           if(!logo_data) {
             resolve()
           }
           // Submit the API request to store a new asset
           const new_asset_request = await   this.storyblok.post(`spaces/${this.space_id}/assets`, { filename: filename })
           const signed_request = new_asset_request.data
           // Use FormData to store the assets to the post_url provided by the
           // previous request
           let form = new FormData()
           for (let key in signed_request.fields) {
             form.append(key, signed_request.fields[key])
           }
           form.append('file', logo_data)
           form.submit(signed_request.post_url, (err) => {
             if (err) {
               resolve()
             } else {
               resolve(signed_request)
             }
           })
         } catch (err) {
           resolve()
         }
       })
     }

    In Storyblok, we’ll use an asset field for the logo so we need to use a specific schema to store it.

    story_data.story.content.logo = {
      "id": logo.id,
      "alt": `${website.name} Logo`,
      "filename": logo.pretty_url,
      "fieldtype": "asset",
    }

    This page demonstrates a very specific example, but you can use the same structure to scrape a different set of data from a list of sites, or from all the pages of a website, and then upload it to Storyblok.