22.11.15

Limitation when splitting and joining multi-valued cells

Split multi-valued cells function helps to transpose data stored in one cells into multiple rows, while keeping the relationship with the other columns in the data set. In this article we will see some of the limitation of the function when splitting and joining back a data set and how you can work around it.


The Limitation

Let's take this example with and dummy dataset generated with mockaroo. You can notice the following:
  1. the ip_address is the first column and some rows have no value
  2. the column tag contains multiple value separated by a comma.

Using the function Split multi-valued cells OpenRefine will create a new line for every tag. The column ip_address , first_name and last_name.



We can not start to do some cleaning on the tag field (using clustering or other GREL function). Once the operation done, we want to join back the tag value to have our project in its initial format using the function  Join multi-valued cells  and separating the value by a comma.


In the screenshot below you can see that rows with a blank ip_address have no tag as they have been merged wit the previous rows with no blank ip_address. 



Why 

When we split the data set, OpenRefine use the record function to keep track what tag value belong to which initial rows. Since we have blank value in the first column of our set, the ip_address one, OpenRefine doesn't detect properly where each record start and stop and bundle into one record what should actually be different one when joining multi-valued cells.

Work Around

The work around is to first make sure that there is no blank value in the first column of your data set before initially splitting the data.

You can do that by either moving an other column at the first index, like the first_name in our example. The safest solution is to create an index column and move it to the column 0.

The screen-shot below show the operation from 2 to 5 with tag properly joined


Why Moving the column tag in index 0 doesn't work

The fact that OpenRefine rely on the record model to split and join multi-valued cells have an other effect. The column you split cannot be the first of your project (at the index 0). Once your data split, OpenRefine will not be able to recognized each record since all the rows have content. 

The screenshot below show the results with the column tag at index 0 after splitting and joining multi-valued cells. 


This post was

0 comments:

Post a Comment