RefinePro Knowledge Base for OpenRefine: web scraping

This is the RefinePro knowledge base about OpenRefine. We build it over the years, and keep adding to it. From great tutorials and how-to, to handy GREL expressions and links to external resources, you will find here one of the most comprehensive list of resources to learn OpenRefine.

For a comprehensive documentation you should refer to the official OpenRefine wiki.

Don't where to get started? Search for a specific function below, or read our most popular article from the right side menu.

Showing posts with label web scraping. Show all posts

2.2.22

Visual Web Ripper (VWR) End of Life

9:54 AM web scraping

Visual Web Ripper was one of the first point-and-click web scraping software released over ten years ago and developed by Sequentum. Since June 30, 2022, Sequentum deprecated the license server for VWR. As a result, all Visual Web Ripper licenses are inactive, and users can no longer run their projects.

In this post, we highlight several key dates in VWR end of life. Sequentum provides a migration path from VWR to its latest technology.

Solving Google’s reCAPTCHA v2 with ParseHub Agent

1:03 PM captcha, parsehub, web scraping

ParseHub is a great point and click web scraping software. While projects run on ParseHub servers, you can connect with third party proxies like BrightData or captcha resolution service like 2Captcha.

In this tutorial, we will show you how to bypass Google Recaptcha v2 test page with ParseHub Agent and 2Captcha service. You will need to create an account with 2Captcha and have an API key to complete this tutorial.

Don't hesitate to contact us if you want to access the ParseHub project, have questions or need help to implement web scraping projects.

How to call Content Grabber API

11:22 AM API, Content Grabber, postman, web scraping

Content Grabber is a very powerful and easy to use software developed by Sequentum for web scraping. Its point and click interface allows you to develop a scraper and retrieve data from any website quickly.

In this tutorial, we will describe how to call the Content Grabber API to trigger an agent and pass input parameters. Thanks to Content Grabber API you can embed the scraper in a more complex workflow and configure it on-demand. We will first discuss Content grabber API then I will create a simple example to show step by step how it works

Don't hesitate to contact us if you have questions or need help to implement web scraping projects.

Fetch City and Province / State based on the postal code

2:10 PM API, geocoding, good practices, reconciliation, tutorial, web scraping

In the US, Canada and UK postal code are pretty good code to retrieve information on a location. In this tutorial we will use the yahoo place finder API to add geographical content to a data set based on the postal code. This tutorial can be easily turned around and used to run a query based on a latitude and longitude (see the end of this post).

Parse mark up language (JSON, html, xml ...)

11:56 PM extract, html, JSON, split, web scraping, xml

In this tutorial we will see how to parse mark up language like JSON, html or xml. Those language are great to parse because there is often an easily identifiable markup right before or after the content you want to extract. In this tutorial we will use a JSON language and extract relevant information by following a six steps process.

On a similar topic:

Full Tutorial by David Huynh

2:32 PM around the web, duplicate, editing, facet, geocoding, good practices, introduction, reconciliation, transpose, tutorial, web scraping

Download a 17 pages tutorial on google refine
by David Huynh

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Subscribe to receive our monthly OpenRefine roundups with new tutorials, release updates and community announcements.

First Name

Last Name

RefinePro Knowledge Base for OpenRefine

2.2.22

Visual Web Ripper (VWR) End of Life

27.3.20

Solving Google’s reCAPTCHA v2 with ParseHub Agent

6.6.18

How to call Content Grabber API

23.10.11

Fetch City and Province / State based on the postal code

18.10.11

Parse mark up language (JSON, html, xml ...)

18.7.11

Full Tutorial by David Huynh

Popular

Labels

Blog Archive

RefinePro Knowledge Base for OpenRefine

2.2.22

Visual Web Ripper (VWR) End of Life

27.3.20

Solving Google’s reCAPTCHA v2 with ParseHub Agent

6.6.18

How to call Content Grabber API

23.10.11

Fetch City and Province / State based on the postal code

18.10.11

Parse mark up language (JSON, html, xml ...)

18.7.11

Full Tutorial by David Huynh

RefinePro Social Media

Popular

Labels

Blog Archive