How to call Content Grabber API ~ RefinePro Knowledge Base for OpenRefine

Content Grabber is a very powerful and easy to use software developed by Sequentum for web scraping. Its point and click interface allows you to develop a scraper and retrieve data from any website quickly.

In this tutorial, we will describe how to call the Content Grabber API to trigger an agent and pass input parameters. Thanks to Content Grabber API you can embed the scraper in a more complex workflow and configure it on-demand. We will first discuss Content grabber API then I will create a simple example to show step by step how it works

Don't hesitate to contact us if you have questions or need help to implement web scraping projects.

Content Grabber API.

The Content Grabber proxy API enables other applications to access Content Grabber at runtime.

If you want to run agents from a web application, you must use the Content Grabber proxy API. The proxy API interfaces with the Content Grabber runtime using a Windows service. The Windows service is installed as part of the Content Grabber application, so Content Grabber must be installed on the web server.

Configure Content Grabber API

Create a project using Content Grabber Premium Edition and go to Application Settings → Configure Service

(click on the image to enlarge)

Then in the following window enable the Remote procedure calls (SOAP) on port 8003, We request (REST) on port 8004 and Content Grabber Scheduler. Finally check Start service automatically. You will see the message: "The Content Grabber service is running."

In this example, the API endpoint will run only on our local machine (i.e., it will not be available publicly on the Internet). Therefore, in Application Settings→API Access, you can uncheck "API key required," so we do not need to send API key everytime you made the request.

(click on the image to enlarge)

Disable API key required.

(click on the image to enlarge)

Format of a simple Web Requests

The default Content Grabber endpoint URL is: http:/localhost:portnumber/ContentGrabber

If Content Grabber service located on a remote machine, you can replace localhost to the IP address of the server (please ensure you properly secure access to that machine either via firewall or using an API key). The port number for web requests is 8004 by default.

Now let's look at the web request used to execute Content Grabber agent. The following request will run the agent synchronously and returned the data in JSON format.

http://localhost:8004/ContentGrabber/RunAgentReturnJson?agent=(agentname or path)&timeout=(timeout value)&pars=(inputParameters)

Following is the detail of the parameters mentioned in above reques.t

agent=(agent name or path) :
The name of the agent and full path can be used here. If you add the name of the agent, Content Grabber will look for it in the default location in your local system.

The default location for the agent is: C:\Users\Public\Documents\Content Grabber\Agents

If you have saved your agent in a different place, then you need to provide the exact path.

timeout=(timeout value)
The timeout value is the maximum number of seconds you want your agent to run. By default, this value is 30 seconds. if you don't set up this value, your agent will automatically stop and close its session after 30 seconds.

Now putting it together: if agent's name is "XYZ" and you don't require your agent to stop after a certain period then the request will be: http://localhost:8004/ContentGrabber/RunAgentReturnJson?agent=XYZ&timeout=0

pars={inputParameters}
If your agent requires input parameters, you can send them in JSON formatted list of input parameters that should be URL encoded. Input parameters let you configure the agent at runtime. For example, we can search for a specific item in an online catalog or phone registry.

In our case, the XYZ project accept the input parameters last_name in the following format {"last_name ":"Smith"}

You will need to encode the JSON to pass it via the URL: http://localhost:8004/ContentGrabber/RunAgentReturnJson?agent=XYZ&timeout=0&pars=%7B%22last_name+%22%3A%22Smith%22%7D

Postman:

We will use Postman to test web requests. Postman is a Google Chrome application for communicating with HTTP APIs. It has a user-friendly GUI for making requests and reading responses.

Example Bestbuy.ca

We already built an agent to scrap data from the website https://bestbuy.ca to collect the product name, price and model number

The agent is built to accept input parameters to search for a specific product. I will send the name of that specific product as an input parameter. Content Grabber will first search that product by name then extracted the desired data.

How to create a basic request and enter your web request in Postman
Open postman select GET HTTP method. Enter your request, now click save and give a name to your request. After giving name, select collection folder where you would like to save it. We saved the two examples below in Postman documenter.

(click on the image to enlarge)

Create a basic request and enter your web request.

Agent name : BESTBUY
Timeout : 0
Input parameter : {"search":"Fitbit Alta HR Fitness Tracker with Heart Rate Monitor - Small - Blue/Grey"}
Encoded url : %7B%22search%22%3A%22Fitbit+Alta+HR+Fitness+Tracker+with+Heart+Rate+Monitor+-+Small+-+Blue%2FGrey%22%7D
Web request : http://localhost:8004/ContentGrabber/RunAgentReturnJSON?agent=BESTBUY&timeout=0&pars=%7B%22search%22%3A%22Fitbit+Alta+HR+Fitness+Tracker+with+Heart+Rate+Monitor+-+Small+-+Blue%2FGrey%22%7D

I have used the IP address of the machine where Content Grabber is installed instead of localhost. Below is the response that I have received. There is no need to open Content Grabber to run the project and collect information if it is called by API.

(click on the image to enlarge)

Now, let's change the input parameter to scrape the data for another product.

Agent name : BESTBUY
Timeout : 0
Input parameter : {"search":"SAMSUNG 960 EVO M.2 250GB NVMe PCI-Express 3.0"}
Encoded url : %7B%22search%22%3A%22SAMSUNG+960+EVO+M.2+250GB+NVMe+PCI-Express+3.0%22%7D
Web request : http://localhost:8004/ContentGrabber/RunAgentReturnJSON?agent=BESTBUY&timeout=0&pars=%7B%22search%22%3A%22SAMSUNG+960+EVO+M.2+250GB+NVMe+PCI-Express+3.0%22%7D

(click on the image to enlarge)

Again you can view and run those queries from postman from here.

In this tutorial, we saw that Content Grabber let you quickly and easily build webscraper and start and configure them via a web API. Now you can build your own web application based on using web data extraction and scale it to handle multiple request at the same time.

Congratulation you made it till the end of this tutorial. Contact us if you have questions or need help to implement web scraping projects.

Tutorial written by Madhat Abrar

RefinePro Knowledge Base for OpenRefine

6.6.18