This is the RefinePro knowledge base about OpenRefine. We build it over the years, and keep adding to it. From great tutorials and how-to, to handy GREL expressions and links to external resources, you will find here one of the most comprehensive list of resources to learn OpenRefine.

For a comprehensive documentation you should refer to the official OpenRefine wiki.

Don't where to get started? Search for a specific function below, or read our most popular article from the right side menu.

21.12.12

The Named Entity Extractor extension by Free You Metadata (from around the web)

The Free Your Metadata Named Entity Extractor extension helps you to enrich your data in OpenRefine using AlchemyAPI, DBpedia Lookup and Zemanta. The extension works on plain text field and any unstructured (meta)data

15.11.12

Mining and OpenRefine(ing) JISCMail: (from around the web)

A look at OER-DISCUSS [Listserv] JISC CETIS MASHe: a complete tutorial to scrap data from a mailing list and analyse participant and contribution.

Read the full article.

Finding (Nearly) Duplicate Items in a Data Column (from around the web)

An other great article by Tony Hirst. This tutorial will show you how to use clustering function (ngram and fingerprint) directly in your facet. Really handy.

Read the full article.

7.11.12

From Excel file to RDF with links to DBpedia and Europeana (from around the web)

DERI Galway the author of the RDF extension (download and documentation here) show steps by steps how to use the RDF extension to reconcile your data against DBpedia and Europeana. This tutorial also go through the step to create an RDF schema

6.11.12

Chit Chat with New Datasets – Facets in OpenRefine (Was /Google Refine/) (from around the web)

A good review of faceting capability including text, numeric, timeline customized and scatterplot facet.

Read the full article 

31.10.12

Cleaning Date with Google Refine (from around the web)

Basic tutorial to clean up some date using OpenRefine. Great example of well structure GREL syntax to build complex transformation.

Read the full article on Hermanes Barbara's blog

26.10.12

Refine your EventBrite guestlist (from around the web)

This recipe shows you how to use Google Refine to fetch details from your EventBrite account, and to explore your guest lists in detail. This tutorial show how to - use the Create Project via Web Addresses (URLs)' and - retrieve guest information using the EventBrite API.

The full article: http://www.opendatacookbook.net/wiki/recipe/a_refined_guestlist

22.10.12

A framework for the OpenRefine community

Following the results from the Google Refine Usage Survey, I would like to share a more personal vision of the birth of the OpenRefine community. The code and all issues have been recently moved to Github, the wiki will close soon and the project would have left the Google code environment.

However while a clear consensus have been found to go for GitHub (GitHub got voted 35 out of 43 responses, see results here) to host the code and issue tracker, I am not sure that GitHub is the right place to for the documentation. In this post I'll try to explain the reason why. Please note that I am open to comments and suggestions regarding analysis and proposition I'll do in this post. OpenRefine is now within the community hands and everyone voice count.

19.10.12

Google Refine Administrativa Survey Results

Following the Usage Survey (see first results), we open a survey to understand the community preference regarding tool to administrate OpenRefine. Thanks to the 43 participants. 

I blow provide a first flat analysis of the results with decision that have already been made based on this survey. You can access details answers here and the survey form here.

18.10.12

Google Refine Recipe (from around the web)

Keith Maguire provide a list of short and sweet recipe to Capitalise the first letter, isolating value, faceted browsing or Comparing two columns with Refine. Enjoy!

via Delicious http://www.keithmaguire.net/blog/categories/refine/

Google Refine Usage Survey Results

Following the survey on Google Refine usage we distribute last week I would like first thanks the 99 persons who participated. Thanks to your answer, we now have a better understanding of who use Google Refine, how and what's the community expectations are.

Thanks again for spreading the word and providing detailed and insightful answers. Here is a first flat analysis of answer collected. You can access details answers here (email addresses have been removed) and the survey questions here.

10.10.12

Google Refine project administrivia survey

The survey is now closed. For archive purpose please find a copy of the questions below. Thanks to all the participants.

5.10.12

Open Refine Survey

The survey is now closed. For archive purpose please find a copy of the questions below. Thanks to all the participants.

3.10.12

From Freebase Gridworks to Google Refine and now OpenRefine

Yesterday David Huynh announced that Google will soon stop its active support of Google Refine and count of community to get more involved to growth Refine.

Refine is already a mature data cleaning tool, this change in leadership will be a major challenge for the tool continuity. But first I'd like to clarify what I have read on twitter yesterday night. Google Refine has always been an open source tool and anyone can commit changes, develop an extension or update the wiki.

Through this post I'd like to give my insight on the reason of this decision and what will be the short terms consequences of it.


Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi (from around the web)

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi 

This neat tutorial explained how to import data directly from the twitter API at the project creation stage using JSON language. The second part of the tutorial explains how to prepare the data to import the in Gephi for data visualization purpose.

via Delicious 

29.9.12

Use Google Refine to clean your data for Fulcrum (from around the web)

Use Google Refine to clean your data for Fulcrum

Fulcrum allow to create location-based data collection apps and deploy them to your mobile device. This tutorial show how to use google refine to take advantages of the data you have collected using fulcrum

via Delicious http://docs.fulcrumapp.com/guides/cleaning-up-data-with-google-refine/

10.9.12

Error: smartSplit error: Un-terminated quoted field at end of CSV line

I am a big fan of the smartSplit function. It is really easy to understand and help to extract quickly part of a string based on any character. However if while using the smartSplit function a cells contains a double quote - " - sign, google refine will return the following error message
Error: smartSplit error: Un-terminated quoted field at end of CSV line

Here is my work around.

5.9.12

Google Refine Workshop (from around the web)

This tutorial / exercise will walk you through all google refine main functionality. Through it's exercise so you can get your hand on quickly!

Data Journalism Workshop - New York (from around the web)

Google Hangout of HHNew York presenting Google Refine

11.8.12

Google Refine Uploader and Stats Extension



The Google Refine Uploader Extension allows you to export datasets from Google Refine and post them as JSON to web servers! Intended for use with CouchDB. Please note that this extension is a work in progress. Feel free to join and help 


This extension is based on the Chicago Tribune Stats extension. A tutorial is available on their blog. Please note that the extension does not work with Google Refine 2.5. It should be tested with the 2.0 version available here.

If you have installed and tried any of those two extensions, I`ll be pleased to hear from you!

Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column


Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column


Great tutorial to reshape data set using transpose and fill down function. This article also introduce the split multi-valued cells function to split and transpose in one shot.

27.6.12

Google Refine Reconciliation Service support for Apache Standbol (from around the web)

Add support for the Reconciliation Service API to the Apache Stanbol

Entityhub RESTful API (see documentation). The Google Refine ReconciliationServiceApi allows to reconcile String values with Entities.  The Entityhub is very well suited for implementing this service as it can execute those queries very efficiently based on the SolrYard implementation.

Capturing Interactive Data Transformation Operations using Provenance Workflows (from around the web)

Capturing Interactive Data Transformation Operations using Provenance Workflows


Abstract:


The ready availability of data is leading to the increased opportunity of their re-use for new applications and for analyses. Most of these data are not necessarily in the format users want, are usually heterogeneous, and highly dynamic, and this necessitates data transformation efforts to repurpose them. Interactive data transformation (IDT) tools are becoming easily available to lower these barriers to data transformation efforts. This paper describes a principled way to capture data lineage of interactive data transformation processes. We provide a formal model of IDT, it's mapping to a provenance representation, and its implementation and validation on Google Refine. Provision of the data transformation process sequences allows assessment of data quality and ensures portability between IDT and other data transformation platforms. The proposed model showed a high level of coverage against a set of requirements used for evaluating systems that provide provenance management solutions.


25.6.12

Google refine ; JSON and my notepad or how to write script in google refine

One of the nice thing about google refine is that every action you do generate a JSON code. If we want to do a comparison with Excel, the JSON code generated can be compared to record a macro. The sweet spot of Google Refine is that you don't need to click on the record button, it keep track of all your actions automatically and that can be easily exported for back up or editing purpose.

5.6.12

Creating row and record index

Google Refine provide the row index as information in the third column. Unfortunately GREL expression cannot call value in this column, you need to use one of the following expression to generate the value.


4.6.12

Sort by multiple criteria

Google Refine sort function allow a combination of several columns to sort by field A and field B. 


In my case, I used this method as a work extract the most recent title posted from a records in a list of radio show (using a timestamp field). As I am not aware for a way to select a specific row within a record, I used the sort function to present the record I wanted to extract at the top my the record group.

3.6.12

Google Refine + Perl (from around the web)




Make Google Refine and Perl transforms one-liners work together using the fetch by url (RESTful API)

2.6.12

Create records in Google Refine

This short tutorial describe how to create records in Google Refine. For the difference between a row can present a data set in row or record mode (see the difference between the two).


28.4.12

Field format change accidentally to Number and how to add leading 0

By inadvertence one can transform quickly a field containing number in a text format to number format. This mainly happen during the project creation (import) or when creating new column. This conversion to number can lead to a loss of data like leading 0. Here is how to get them back and avoid this to happen again.


26.4.12

Data exploration tutorial with google refine

Recently, Hugh Stimson published a great article: Data Mining My Old Radio Playlists. His post mix tutorials on php scripting, data cleaning with google refine and data analysis with PostgreSQL.

This answer post demonstrate that data analysis is fully doable in google refine using really basic function (I'll be using GREL function only once for the long tail analysis). I guess also this post is a good illustration of my previous post on data exploration using google refine.

10.4.12

Fusion Table, map multiple items with the same location


When you want to map multiple items with the same location in Fusion Table, only one item is displayed and all the others are ignored. There is several workaround to this  major limitation, and the most common is to change slightly your coordinate (longitude / latitude) so your point will appear close to each other on the map (tip from the google fusion team itself). 

When working with large data set, identifying and manually correcting all records sharing the same location can become time consuming. So I've been looking how to deal with this in Google refine and ends up with this straight forward process.

9.3.12

Difference between a record and a row

Google refine make a clear distinction between a row and a record. We will see what's the difference between the two and advantages to works in records mode.

Fill down the right and secure way

The fill down function consists of taking the content of cells and copying down following blank cells. This is done based on the rows number. When you perform this action using the fill down function, Google refine does not take into account if rows belong to different records or not, if the following rows are blank, it will fill it down with the content of the previous row.

If you do not use this function with extra care you can easily corrupt the integrity of your data set. In a nutshell use  row.record.cells[columnName].value[0]   to fill down data within the same record. 

Here is why, and how to avoid that.

24.2.12

Selecting a string within a cell using smartSplit

The function smartSplit is a variation on split function that allow you to split the cell content based on any string of character and then select the leg you want to work on. This function is very useful to extract or remove string within cells without creating multiple columns and then merging them back.

16.2.12

Count how often a character occurs in a cell

Did you know that Refine can count how often an string or character appears in a cell?

To achieve this, I first recommend that you store the count result in a separate column (so you do not write over your initial content). Select your reference column (where you want to do the count per cells) and create a new column based on this column. An other option is to store the result in a custom text facet.

We will use the Grel expression value.split(" ").length().

However if the cells does not contains the value Refine will still return '1'. I found two ways to work around this issue.

13.2.12

How to: convert easting/northing into lat/long for an interactive map

How to: convert easting/northing into lat/long for an interactive map 
Google Fusion Tables is great for creating interactive maps from a spreadsheet – but it isn't too keen on easting and northing. That can be a problem as many government and local authority datasets use easting and northing to describe the geographical position of things – for example, speed ...

28.1.12

Merging Datasets with Common Columns in Google Refine


Merging Datasets with Common Columns in Google Refine

It's an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here's a way of doing it in Google Refine…Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy ...

23.1.12

Remove all number from a string

To remove all digits value from a string the following regex should do the work (from the menu edit cell > transform):
replace(value, /\d/, '')

This expression replace all numbers (identified by the regular expression /\d/) by blank. 
All regular expression (regex) in google refine should starts and ends by the character: /

17.1.12

Delete multiple projects at once (using the explorer)

At the end of a large project, you may have created multiple of google refine projects and want to clear them from the Open Project view. Screenshot (click on picture to enlarge it) below are from google refine 2.5. The same process is applicable to previous version, only the button browse workspace directory changed is place.