Parse mark up language (JSON, html, xml ...)

In this tutorial we will see how to parse mark up language like JSON, html or xml. Those language are great to parse because there is often an easily identifiable markup right before or after the content you want to extract.  In this tutorial we will use a JSON language and extract relevant information by following a six steps process.

On a similar topic:

1. Understand the code to parse.
Every code is different and have its own specific mark up system. html and xml often embed mark up between < and > characters, but variation can exist. The best is to first take some time to analyze the code you want to parse. The process presented will apply to any mark up language as long as you are able to identify the mark up specific to the data you want to extract.

As an example, let's take a result from Google Place API, which is in JSON. In this case we query for all locations within 50 meters of a specific coordinate. Google Place API can return more than one location per coordinate. Here is the full answer from Google Place API:

Let's have a closer look at this. Lines 1 to 10 contains the initial query information (we query for latitude:51.4634670 and longitude:0.0062710. )

Then lines 11 to 24 contains the first results send back by Google Place API. JSON code is pretty easy to read and we can guess that:

  • line 11 contains a link to an icon
  • line 12 contains the location id
  • line 13 contains the location name
  • line 14 contains a reference
  • line 15 contains location type defined by google
  • line 16 contains the vicinity, address of the location.
  • lines 17 to 24 contains the exact coordinate for this location (which can differ from the coordinate we have query for).
Lines 25 to 39 repeats the same pattern and this keep going on  till line 185. Lines 189 and 187 indicate the ends of the message.

2. Define a parsing strategy.
So for every coordinates Google Place API returns a code which embed several location. We want to extract the location details and have a column for the location name, the location type and address. 
My strategy to parse code is to:

  • First, to isolate every records in a different column. To do so, we need to find a specific mark up identifying the beginning of a new result. In this case we identify that each records starts with the mark up "icone" :
  • Second, regroup every new columns created in the same column, so we can clean them in one shot. 
  • Third, clean and extract the information
3. Isolate each records.
For this use the split function based on the mark up you identifies previously. Call the function in Edit column > Split into several column... the following windows will show up.

Key the separator (in this case "icon" :) and indicate in how many columns you want to split your content. As we does not know how many records each cell contains we will leave this option blank (leave blank for no limit). Doing so, google refine will create a new column every time the "icon" : mark up appears. 

Our example will split into 14 different columns, each one of them containing a piece of the JSON code. There is 13 columns containing location information and one (the first) containing the initial query information (lines 1 to 10), we can remove this column as it contains no valuable information.

4. Transpose your results.
So now every records is in a different column. This is not very useful to do mass editing or even to read them. We will use the transpose function of google refine. The transpose function will act like the past special transpose in Excel, i.e, it will take the last column and move it below the previous one. We need to do this for the 13 columns to get them under the same columns. 

Column by column transpose the content by calling the transpose windows, and then indicating which column you want to transpose under which one. In this case we will take the last column and move it under the previous one. We do not want to prepend the column name and will ignore blank cells (in the case some Google Place API results contains more location than other we do not want to transpose empty cell).

You can automated this process by adapting and applying the JSON code presented here.

5. Clean up your records.
Now we have all our records under each other, and cleaning them will be a piece of cake. 
Repeat the process describe in step 3 to isolate now every piece of information. To do so, we need to find a specific mark up identifying the beginning of a new column. In this case, we will use the different mark up we identified previously: "name" : "types" :"vicinity" :

Then you can use the use either the replace function to remove specific character or keep going with the split function to isolate useless content (put all the useless content into a specific column and remove this column)

6. Final wrap up
Update your column name and fill down other information to have a consistent data set.