Search App API usage

What the Search App will index for you

The Search App API is not connected to your content in Storyblok itself, therefore does not know about the overall structure and/or content types that you have defined. The crawler takes your website and navigates through the tree and reads the following tags:

{
  properties: {
    url: { type: 'keyword' }, // url itself
    title: { type: 'text' }, // title tag
    description: { type: 'text' }, // meta tag description
    keywords: { type: 'text' }, // meta tag keywords
    h1: { type: 'text' }, // all h1
    h2: { type: 'text' }, // all h2
    h3: { type: 'text' }, // all h3
    h4: { type: 'text' }, // all h4
    h5: { type: 'text' }, // all h5
    content: { type: 'text' }, // everything in body (see options to scope it)
    last_updated: { type: 'date' } // date of crawling
  }
}

As you can see this expects HTML as the pure source, so if you need additional information such as content types or similar we suggest to use the keywords meta tag and a filter query on that field.

<head>
  <!-- your head content -->
  <meta name="keywords" content="article, ...">
</head>

What the Search App won’t index for you

The Search App won’t index urls with the endings below.

[
  '.jpg', 
  '.exe', 
  '.pdf', 
  '.png', 
  '.png', 
  '.xml', 
  '.json'
]

Filtering on properties the Search App indexed

The Search App API allows you to access crawled and indexed fields depending on your content on your website. Since most websites ship with different paths for their languages below you can see how to use the ElasticSearch Filter option to limit results to the current (“en” in our example) locale. Feel free to try other ElasticSearch Filters.

// Send a POST request with the following query in its body to our `_search` endpoint, available in your storyblok space "Search" area by pressing on the "x Documents" link.
{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          // your search query string, in our example we searched for "test"
          "query": "test",
          // define which fields you want to receive
          "fields": ["title^4", "h1^3", "h2^2", "h3^1", "h4", "h5", "content", "description", "keywords"],
          "type": "phrase_prefix"
        }
      },
      // filter results by locale depending on url.
      "filter": {
        "prefix": {
          "url": "https://www.example.com/en/"
        }
      }
    }
  },
  "size": 10,
  "from": 0,
  "highlight": {
    // define the highlight tags you want to use
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"],
    "fields": {
      "*": {}
    },
    "require_field_match": false
  }
}

How to scope the “content” property?

Default Behavior**

The default behavior of the crawler is to index every p, li and dd tag available inside the body.

Defining the content body

Sometimes you want to remove the duplicated content (eg. header and footer) from being indexed. For that you can use the CSS class .sb-indexed__body in your HTML to define the start of your content area. Depending on the HTML tag you use the class on, we will determine the end of the content body as the HTML tag closes. You do not have to create a new HTML tag surrounding your content if you already have one. We will search for all .sb-indexed__body classes (as you might have elements like Newsletter CTAs that are in the content). You are not required to use div as in our example, please use the appropriate HTML tag that fits the semantics of the area.

Usage #1

<div class="header">
  <!-- Your header content which is the same on every page -->
</div>
<div class="any_class_you_want sb-indexed__body">
  <!-- Your actual content which is unique -->
</div>
<div class="footer">
  <!-- Your footer content which is the same on every page -->
</div>

Usage #2

<div class="header">
  <!-- Your header content which is the same on every page -->
</div>
<div class="any_class_you_want">
  <div class="sb-indexed__body">
  <!-- Your actual content which is unique -->
  </div>
  <div class="my-newsletter-content">
  </div>
  <div class="sb-indexed__body">
  <!-- Your actual content which is unique -->
  </div>
</div>
<div class="footer">
  <!-- Your footer content which is the same on every page -->
</div>

Defining multiple content sections in one page.

If one big “content” is not enough for you, you can use the CSS class .sb-indexed__section in your HTML together with an id attribute to allow us to crawl different jumplinks. This is most useful if you’re creating a larger FAQ post or tutorial, keep in mind that this will also split up your results into seperate items and does not return as one, use the sb-indexed__body instead. The class and the ID can be added to any HTML element, however keep in mind that the value of will be added to the current url.

Usage

<div class="header">
  <!-- Your header content which is the same on every page -->
</div>
<div class="any_class_you_want">
  <div class="sb-indexed__section" id="getting-started-with-search">
  <!-- Your actual content which is unique -->
  </div>
  <div class="my-newsletter-content">
  </div>
  <div class="sb-indexed__section" id="setting-up-a-search">
  <!-- Your actual content which is unique -->
  </div>
</div>
<div class="footer">
  <!-- Your footer content which is the same on every page -->
</div>

How to stop the Search App Crawler to index a site?

The Search App Crawler supports the HTML meta attribute meta[name="searchblok"] to be set to noindex and won’t crawl your site.

What is the maximum time till a timeout happens?

The Search App Crawler will wait a maximum of 7000 ms till a time out will happen.