This is the RefinePro knowledge base about OpenRefine. We build it over the years, and keep adding to it. From great tutorials and how-to, to handy GREL expressions and links to external resources, you will find here one of the most comprehensive list of resources to learn OpenRefine.

For a comprehensive documentation you should refer to the official OpenRefine wiki.

Don't where to get started? Search for a specific function below, or read our most popular article from the right side menu.

28.4.12

Field format change accidentally to Number and how to add leading 0

By inadvertence one can transform quickly a field containing number in a text format to number format. This mainly happen during the project creation (import) or when creating new column. This conversion to number can lead to a loss of data like leading 0. Here is how to get them back and avoid this to happen again.


26.4.12

Data exploration tutorial with google refine

Recently, Hugh Stimson published a great article: Data Mining My Old Radio Playlists. His post mix tutorials on php scripting, data cleaning with google refine and data analysis with PostgreSQL.

This answer post demonstrate that data analysis is fully doable in google refine using really basic function (I'll be using GREL function only once for the long tail analysis). I guess also this post is a good illustration of my previous post on data exploration using google refine.

10.4.12

Fusion Table, map multiple items with the same location


When you want to map multiple items with the same location in Fusion Table, only one item is displayed and all the others are ignored. There is several workaround to this  major limitation, and the most common is to change slightly your coordinate (longitude / latitude) so your point will appear close to each other on the map (tip from the google fusion team itself). 

When working with large data set, identifying and manually correcting all records sharing the same location can become time consuming. So I've been looking how to deal with this in Google refine and ends up with this straight forward process.

9.3.12

Difference between a record and a row

Google refine make a clear distinction between a row and a record. We will see what's the difference between the two and advantages to works in records mode.

Fill down the right and secure way

The fill down function consists of taking the content of cells and copying down following blank cells. This is done based on the rows number. When you perform this action using the fill down function, Google refine does not take into account if rows belong to different records or not, if the following rows are blank, it will fill it down with the content of the previous row.

If you do not use this function with extra care you can easily corrupt the integrity of your data set. In a nutshell use  row.record.cells[columnName].value[0]   to fill down data within the same record. 

Here is why, and how to avoid that.

24.2.12

Selecting a string within a cell using smartSplit

The function smartSplit is a variation on split function that allow you to split the cell content based on any string of character and then select the leg you want to work on. This function is very useful to extract or remove string within cells without creating multiple columns and then merging them back.

16.2.12

Count how often a character occurs in a cell

Did you know that Refine can count how often an string or character appears in a cell?

To achieve this, I first recommend that you store the count result in a separate column (so you do not write over your initial content). Select your reference column (where you want to do the count per cells) and create a new column based on this column. An other option is to store the result in a custom text facet.

We will use the Grel expression value.split(" ").length().

However if the cells does not contains the value Refine will still return '1'. I found two ways to work around this issue.