27.10.15

Using facet to cluster a subset of your data when you have a large number of rows in OpenRefine?


You sometimes when working in OpenRefine have to deal with a large data set that refer to the same entity (person, city, book or any other entries) but using different spelling. When working with a large data set the clustering function can become unresponsive, due to the amount of computing done in order to run the different algorithms.


In this tutorial we will see how you can create subset of records to cluster using facets to better manage the compute load on your machine. If you want to cluster your full data set, you just need to run the cluster function on the different subset created with your facet.




Clustering, is a powerful function in OpenRefine that works by grouping similar entries so that information that looks alike gets grouped together. For example it will detect the same person with slightly different spelling in the first name, middle name or last name.


In our example we will cluster the text field to group together similar tweets that just have spelling difference (due to a retweet - RT - or someone editing it before re-posting it). Our example is limited to few thousand of rows, so you won't really need to combine clustering with facet. This is just to show you how it works.





1) Edit Cells -> Cluster ... then it will scan and cluster all the cells in that column against the algorithm and parameters you set. Using the fingerprint function we have six results.




2) Now, create a timeline facet and select a the period of time you want to focus our analysis on. In our case we are interested in the tweets between 2015/08/31 and 2015/09/12.



3) Invoke again the Cluster function. We have now only three cluster proposal. 
The time line facet narrowed the selection for the clustering function.



Similar tweets have now been merged.

Again, this is just a simple example with a small data set, if you had a larger amount of data, the concept here would still apply and work just as well. The same can be done with any type of facet and text filter.