This is the RefinePro knowledge base about OpenRefine. We build it over the years, and keep adding to it. From great tutorials and how-to, to handy GREL expressions and links to external resources, you will find here one of the most comprehensive list of resources to learn OpenRefine.

For a comprehensive documentation you should refer to the official OpenRefine wiki.

Don't where to get started? Search for a specific function below, or read our most popular article from the right side menu.

Showing posts with label split. Show all posts
Showing posts with label split. Show all posts

22.11.15

Limitation when splitting and joining multi-valued cells

Split multi-valued cells function helps to transpose data stored in one cells into multiple rows, while keeping the relationship with the other columns in the data set. In this article we will see some of the limitation of the function when splitting and joining back a data set and how you can work around it.

10.10.14

Parsing Apache log using OpenRefine

Recently I was looking for a quick way to explore some apache log file. I didn't want to set up any software and I wanted to analyze some very precise path for a specific user, or what happen after a specific error. So I thought about OpenRefine and its parsing capabilities.

The recipe doesn't replace an analytical tool to understand your traffic but help to go behind the curtain and drill down to analyze specific IP address or user, type of error code and patterns

31.10.12

Cleaning Date with Google Refine (from around the web)

Basic tutorial to clean up some date using OpenRefine. Great example of well structure GREL syntax to build complex transformation.

Read the full article on Hermanes Barbara's blog

10.9.12

Error: smartSplit error: Un-terminated quoted field at end of CSV line

I am a big fan of the smartSplit function. It is really easy to understand and help to extract quickly part of a string based on any character. However if while using the smartSplit function a cells contains a double quote - " - sign, google refine will return the following error message
Error: smartSplit error: Un-terminated quoted field at end of CSV line

Here is my work around.

5.9.12

Google Refine Workshop (from around the web)

This tutorial / exercise will walk you through all google refine main functionality. Through it's exercise so you can get your hand on quickly!

24.2.12

Selecting a string within a cell using smartSplit

The function smartSplit is a variation on split function that allow you to split the cell content based on any string of character and then select the leg you want to work on. This function is very useful to extract or remove string within cells without creating multiple columns and then merging them back.

16.2.12

Count how often a character occurs in a cell

Did you know that Refine can count how often an string or character appears in a cell?

To achieve this, I first recommend that you store the count result in a separate column (so you do not write over your initial content). Select your reference column (where you want to do the count per cells) and create a new column based on this column. An other option is to store the result in a custom text facet.

We will use the Grel expression value.split(" ").length().

However if the cells does not contains the value Refine will still return '1'. I found two ways to work around this issue.

18.10.11

Parse mark up language (JSON, html, xml ...)


In this tutorial we will see how to parse mark up language like JSON, html or xml. Those language are great to parse because there is often an easily identifiable markup right before or after the content you want to extract.  In this tutorial we will use a JSON language and extract relevant information by following a six steps process.

On a similar topic:



13.10.11

Update phone number format

This post is a quick adaptation to phone number based on the method presented in the add a space to postal code (splitByLength and Merge function).

5.10.11

Extract from twitter hastag and reference


This case has been brought to me by cosmin who wanted to extract hastag from tweets for some analysis and data visualization. Data have been gather using ScraperWiki and their ability to scrap twitter data into one single document (see the video tutorial).

4.10.11

Video tutorial to clean up your dataset (by free your metadata)

A great video tutorial from free your metadata which show you how to:

18.9.11

Google Refine 2.0 Training video

In this video you will learn to:

19.7.11

Add a space to postal code (splitByLength and Merge function)

This short tips explains how to convert postal code store on 6 characters to 7 by adding a space after 3 digits. We will use splitByLength (see related video) and merge multiple column into one functions.

29.6.11

Split cell content into multiple column, non fixed field length

I recently get a file to work on generated by crystal report and I had to deal with this format as no other were available. In my case, data were supposed to be split into 11 columns, in the original file there were all in 1, data were separated by a variable number of space. This post will present a process to split cell content when you have no markup. JSON code is provided for reference below.

25.6.11

Using "splitByLengths" in Google Refine

Learn how to use the "splitByLengths" function in Google Refine to split a single column into multiple columns based.