This is the RefinePro knowledge base about OpenRefine. We build it over the years, and keep adding to it. From great tutorials and how-to, to handy GREL expressions and links to external resources, you will find here one of the most comprehensive list of resources to learn OpenRefine.

For a comprehensive documentation you should refer to the official OpenRefine wiki.

Don't where to get started? Search for a specific function below, or read our most popular article from the right side menu.

17.4.22

Hosting OpenRefine in 2022

Hosting OpenRefine is a long-asked feature from the community. Natively, OpenRefine is designed to run on the user's local machine. Therefore, the software does not include user management, permission, or sharing the compute resources (CPU, RAM) with other users. 

Hosting OpenRefine allows access to multiple users. Use cases include

  • Ease of access to the tool for users with limited permission on their computer or during events (training, hackathons, for example). 
  • Allow working on larger dataset using more powerful online server  - although OpenRefine 4.x with Spark is addressing this issue.
  • Collaboration with multiple user working on the same projects. 
  • Enable hosting in a secured environment to process sensitive data (ie. the data does not go on the user machine)

Hosted Instance with htpasswd 

or any other type of access control to the machine like RDP or shared user account on the machine. In that case OpenRefine is installed as a regular software on an machine that is access remotely. 

Back in 2014, at RefinePro we built a service to manage users and instances of OpenRefine hosted using AWS EC2 instances. Our platform was basically

  • logging users
  • starting their EC2 instance
  • loading their project workspace on the EC2
  • shutting down the EC2 instance once the user finished their work. The goal is to not incurred unnecessary hosting cost. 

We used htpasswd to protect the instance from being publicly available. At the time, we found a way to hide the extra logging from our users, but things remained hacky and not suitable for the long run. Therefore, we decided to stop the service in 2017.

Hosting with JupyterHub

In the last two years, we have seen an increasing number of hosted OpenRefine deployments based on Jupyter and JupyterHub (with Kubernetes). In those deployments, OpenRefine is one of the applications hosted via JupyterHub as part of a larger data science workbench. OpenRefine benefits from JupyterHub user and environments management. 

As of April 2022, there are several publicly advertised JupytherHub deployments, including:

RefinePro also released in collaboration with FAIRPlus one extension and docker customizable docker to help with hosted instances. 

Looking ahead 

Following on the idea from Felix Lohmeier and Tony Hirch on OpenRefine wiki ; I think it would be interesting to develop an official OpenRefine package for JupyterHub. Such package will include docker configuration specific to JupyterHub, dedicated OpenRefine extensions (like the local file extension) along with best practices. 

The creation and maintenance of the package will provide the community with an official way to host OpenRefine. It will also offer a point of contact for potential contributors to improve the package or OpenRefine itself. 

I am interested in your thoughts and potential interest in building this package. 

Introducing OpenRefine Authenticator and File Extensions

The RefinePro team is thrilled to release under the Apache License 2.0 two new extensions for the OpenRefine ecosystem. The extensions have been funded and developed in partnership with Novartis, with the technical help of Aridhia Informatics. They are released under the FAIRplus program. 


Thank you to Jiangbo Dang, Andrea Splendiani, and Rodrigo Barnes for your help. 


Feel free to reach out if you have questions. You can also open issue in each respective repository