This case has been brought to me by cosmin who wanted to extract hastag from tweets for some analysis and data visualization. Data have been gather using ScraperWiki and their ability to scrap twitter data into one single document (see the video tutorial).
I faced two mains issues doing this cleaning:
- you can have more than one hastag in a single tweet
- the hastag can be anywhere in the tweet (and not only at the beginning or the end), thus limiting the value of the split function.
Step by step
- Remove all extra space (leading, trailing and consecutive)
- Split into several column based on a space and no max column so I get every word of the tweet in a single column
- transpose cells across column into row so I get all my words into columns
- If you are using google refine 2.5 please refer to this tutorial for more details
- If you are using a previous version you can use this JSON code to transpose a large number of column
- Clean your data (cluster, facet to remove non significant words ...). See tutorial section for helps on this part. This steps in not included in the JSON code below.
- Text filter on # and create new column called hastag
- Repeat the action by using a text filter on @ and create new column called mention
JSON code for this case:
this JSON have been designed based on the ScraperWiki format. If you want to reuse this code, ensure that
- tweets are in a field / column name : text and
- this text column is the last column of your project.
- all tweets that are not mention or hastag are removed by this script. In the history, move up to the step 35 to retrieved all the data.