Following the survey on Google Refine usage we distribute last week I would like first thanks the 99 persons who participated. Thanks to your answer, we now have a better understanding of who use Google Refine, how and what's the community expectations are.
Thanks again for spreading the word and providing detailed and insightful answers. Here is a first flat analysis of answer collected. You can access details answers here (email addresses have been removed) and the survey questions here.
In term of coding skills the same variety of user type appears with people with no or limited coding skills sharing the same platform than regular developers. With an average of 3.73 the audience can be defined as technology friendly user with basic knowledge of coding. It can be interesting to dig deeper on this point to draw a better picture of the Refine community.
Most of user use Refine a few times a months (52%) against 37% being more engage and using it multiple time per weeks.
The majority (43%) of people think that Refine is a relatively easy software to use, harder than the traditional spreadsheet but not out of reach. Overall the community who relatively confident on how to use it as 60% of the response are score themselves with a score of 3 or more regarding their skills to use Refine.
Average score regarding user skills using Refine is 2.82 which is just over the average score (2.5/5). Globally the community know how to use Refine with lot of room for improvement (edit @09:10AM the 18/10/2012)
Thanks again for spreading the word and providing detailed and insightful answers. Here is a first flat analysis of answer collected. You can access details answers here (email addresses have been removed) and the survey questions here.
Q1 Communities you identify with?
We can observe that OpenRefine in used by a variety of users in terms of type of purpose. Even if the open data and research purpose have a higher level of response no category collected less than 15% of the respondent and the other category is strong with nearly 20% of the people who answered.
Q6 What domain of data do you deal with?
The type of data dealt with is also heterogeneous with again no category ranking below 20%. Once again the other category is strong and cover other type of data type like: government (10 answers), heritage data (4 answers), linked data, data extraction (scraping, social network or text extraction) and meta data processing.
Q2 How are your programming (coding) skills)?
In term of coding skills the same variety of user type appears with people with no or limited coding skills sharing the same platform than regular developers. With an average of 3.73 the audience can be defined as technology friendly user with basic knowledge of coding. It can be interesting to dig deeper on this point to draw a better picture of the Refine community.
Q3: How often do you use Refine?
Most of user use Refine a few times a months (52%) against 37% being more engage and using it multiple time per weeks.
Q7 & Q8 User skills level with Refine and their perception regarding the skills level needed.
The majority (43%) of people think that Refine is a relatively easy software to use, harder than the traditional spreadsheet but not out of reach. Overall the community who relatively confident on how to use it as 60% of the response are score themselves with a score of 3 or more regarding their skills to use Refine.
Average score regarding user skills using Refine is 2.82 which is just over the average score (2.5/5). Globally the community know how to use Refine with lot of room for improvement (edit @09:10AM the 18/10/2012)
Q9 High-level tasks you do with Refine.
This question show that OpenRefine is used massively as and transformation (including normalization) and loading tool. We could assimilate those action to the T and L from ETL (Extract Transform and Load). However usage regarding taxonomy work and reconciliation have non negligible usage share with respectively 28% and 45% of users taking advantage of this tool. Finally scraping (by fetching pages) and geo-coding (through API call) represent marginal usage (those usage have been collected through the other field).
Q11 & Q12 Format used for import and export in Refine
Nearly all Refine users (97% for importing and 92% for exporting) use the software to work with tabular format. This is not a real surprise as it remains today the most widespread way to store data. However, it is interesting that close to 50% of them also used it for hierarchical format. My guess is that Refine offer an easy to use interface to read and manipulate those formats more and more commonly used. It is worth to note that 20% of the respondent have mentioned RDF as an export format (data collected from the other field).
Q13 & Q14 Tools used before and after Refine
The percentage of person using scripting tool goes from 54% upstream of Refine to 45% after Refine process. The same decrease appears for spreadsheet software (from 76% to 54% after the Refine step). This observation make us think that Refine is mostly used a one way tool between data collection (through scraping or spreadsheet) to database loading or for data visualization and confirm the Transform and Load usage observed in question 9. Co usage with other ETL tool is very limited with less than 11% of the user.
Perception of Refine
This section covers answer collected in the your perception of Google Refine and highlight mains feedback received through the set of four open questions.
Refine is mainly used instead of Excel or other spreadsheet software for 26 users and script (using python or ruby) for 17 respondents. This split between spreadsheet and script oriented persons is also revealed through their perception of Refine. The first group will describe Refine as "powerful data manipulation tool with a fairly short learning curve for basic functionality" or "like Excel's filter but with more depth" or a "Swiss knife of data wrangling" when the second group will compare it to SQL or other ETL: "Power of SQL for computation with a nice graphical interface" ; "ETL for small data sets proof of concept".
In any cases Refine is identified as a tool for data
- curation,
- cleaning
- management when they are messy and inconsistent,
- investigation,
- transformation,
- normalization,
- reconciliation,
- extension
As an ETL tool for the web, expectation from the community are oriented toward data collection with:
- authentication support to online platform to access data from more sources
- direct connection with databases for import and export (MySQ, SQLite) and
- the ability to run automatically on dataset by generating an executable code, so Refine can be part of a workflow.
Functionality enhancement expected from the community regards core and unique Refine capability clustering and reconciliation with the capacity to cluster on more than one column at a time and being able to reconcile Refine project against each other. Finally mention to R, D3 and other visualization interface have been made with request to add visualization capability to Refine by either using their library or smoothing the export process.