17.4.22

Hosting OpenRefine in 2022

Hosting OpenRefine is a long-asked feature from the community. Natively, OpenRefine is designed to run on the user's local machine. Therefore, the software does not include user management, permission, or sharing the compute resources (CPU, RAM) with other users. 

Hosting OpenRefine allows access to multiple users. Use cases include

  • Ease of access to the tool for users with limited permission on their computer or during events (training, hackathons, for example). 
  • Allow working on larger dataset using more powerful online server  - although OpenRefine 4.x with Spark is addressing this issue.
  • Collaboration with multiple user working on the same projects. 
  • Enable hosting in a secured environment to process sensitive data (ie. the data does not go on the user machine)

Hosted Instance with htpasswd 

or any other type of access control to the machine like RDP or shared user account on the machine. In that case OpenRefine is installed as a regular software on an machine that is access remotely. 

Back in 2014, at RefinePro we built a service to manage users and instances of OpenRefine hosted using AWS EC2 instances. Our platform was basically

  • logging users
  • starting their EC2 instance
  • loading their project workspace on the EC2
  • shutting down the EC2 instance once the user finished their work. The goal is to not incurred unnecessary hosting cost. 

We used htpasswd to protect the instance from being publicly available. At the time, we found a way to hide the extra logging from our users, but things remained hacky and not suitable for the long run. Therefore, we decided to stop the service in 2017.

Hosting with JupyterHub

In the last two years, we have seen an increasing number of hosted OpenRefine deployments based on Jupyter and JupyterHub (with Kubernetes). In those deployments, OpenRefine is one of the applications hosted via JupyterHub as part of a larger data science workbench. OpenRefine benefits from JupyterHub user and environments management. 

As of April 2022, there are several publicly advertised JupytherHub deployments, including:

RefinePro also released in collaboration with FAIRPlus one extension and docker customizable docker to help with hosted instances. 

Looking ahead 

Following on the idea from Felix Lohmeier and Tony Hirch on OpenRefine wiki ; I think it would be interesting to develop an official OpenRefine package for JupyterHub. Such package will include docker configuration specific to JupyterHub, dedicated OpenRefine extensions (like the local file extension) along with best practices. 

The creation and maintenance of the package will provide the community with an official way to host OpenRefine. It will also offer a point of contact for potential contributors to improve the package or OpenRefine itself. 

I am interested in your thoughts and potential interest in building this package.