25.6.12

Google refine ; JSON and my notepad or how to write script in google refine

One of the nice thing about google refine is that every action you do generate a JSON code. If we want to do a comparison with Excel, the JSON code generated can be compared to record a macro. The sweet spot of Google Refine is that you don't need to click on the record button, it keep track of all your actions automatically and that can be easily exported for back up or editing purpose.


Please note that the JSON code generated is intimately linked with the history feature of Google Refine. Every changes in one of them will impact the other. The history feature actually use this code to undo / redo your actions in refine.

Extract JSON code

To access the JSON code, open the undo / redo tab and click on the "Extract ..." button


This open Extract Operation History window which have two sections:
  1. The list of action performed in the project. This is tied to your project history.
  2. The corresponding JSON code for every steps. Select the code from this panel and copy past it to any text editor software (like notepad)

You can extract part of the JSON code by unticking / ticking only the action you are interested in. Please note that cells edited individually do not generate JSON and the action appear in grey in the Extract Operation History window.


Edit the code

This tutorial will not get into the details of the JSON code (this will be for an other time), but rather explain how you can edit quickly your code using a standard text editor (like Notepad++) to apply the same operation to an other column (field) or project.

  1. Select the JSON code and copy past it in your text editor
  2. Use the replace function (often ctrl+H in windows based application) to update the column name with the one you want to apply the actions on.
  3. Select all the code, copy and past it in google refine using the apply function. Please note that:
    1. If the JSON code contains errors, google refine will not apply the code ;
    2. Each section of your code generate back a step in the history. You can go back in your project and check that everything went well. Be careful the description might not match your new actions if you didn't update it previously in the text editor and may display the initial description in the history. 
  4. Check your data integrity by faceting your new column on blanks (did I lost some data) or looking for any specific changes in the data you've done that could have been wrong (like changing toNumber or adding leading zeros ...)
During the edit mode you can also remove some useless action by deleting the matching code. 

Examples

The two following articles presents some ready to use JSON code to edit and adapt to your project. Feel free to play with them to get familiar with this notion: 

Limitation

The following limitations apply when you want to reuse code from a column or project to an other.
  • Cells edited individually can not be exported; so think global when you use Google Refine, this not Excel where you edit cells one by one, instead
  • Think broad and generic when using facet so your code will be able to adapt to different situation ;
  • Do check the output! Your new data-set might be slightly different from the one you used to generate the code and you code might missed some specific cases.






This post was

0 comments:

Post a Comment