Parsing and extracting HTML tag and links in Refine ~ RefinePro Knowledge Base for OpenRefine

17.2.15

Parsing and extracting HTML tag and links in Refine

I recently helped someone on stackoverflow to parse and extract information from an HTML page. Refine with GREL offer multiple ways to select specific element and contant. This article will review the main functions and specific use cases to illustrate when to use them.

Let use the following code snippet through this article to illustrate the different GREL expression:

<html>

Here is the content of my page with two images:

<img alt=" style="width: 62px;" src="image1.png"> and

and a link <a href="http://www.example.com" rel="Example"><img alt=" src="example.gif"/></a>

<html>

Extract content from a particular tag

In the case you want to extract the content from a specific tag into a new column the the parseHtml() and select() functions are here to help. For example the following expression will extract all content between an image tag: value.parseHtml().select("img")

The expression returns an array, to select an specific element in the array use [0] for the first element [1] for the second and so on. The selection follow the same logic when you select an element in array using a split() function.

For example the expression value.parseHtml().select("img")[0] return img alt=" style="width: 62px;" src="image1.png"

Alternatively you can return all the result in the array into a single string. Use toString() at the end of the expression like this: value.parseHtml().select("img").toString() will return img alt=" style="width: 62px;" src="image1.png" img alt=" style="width: 62px;" src="image2.png"

Remove Tag

If you want to just replace the tag, the parsHtml() function might be overkill. Simply do value.replace('<img','') to remove all image related tag. value.replace('<div>','').replace('</div>','') for all the <div>

Remove Tag and its content

Here we will need to use a regular expression to select the tag and any content within. In this case we will extract everything between to image tag.

Select everything (.*) between <img and /> (add a \ to escape the / so Refine don't take it as the end of the regex). We need to a ? to make the expression "non greedy" and stop at the first />

The final expression is: value.replace(/<img (.*?) \/>/,'')

Extract links only

If you use the expression value.parseHtml().select("a")[0] as described previously, Refine will also include all the links attribute like no follow, attached images ... In this case it will returns:
a href="http://www.example.com" rel="Example"><img alt=" src="example.gif"/></a>

The split function allow us to be more precise and select only content between the href" tag and the closing double quote: value.split('href=')[1].split('"')[0] will returns only http://www.example.com

RefinePro Knowledge Base for OpenRefine

17.2.15