Skip to Main Content

Research Data: Data Cleaning with OpenRefine

This page is used in conjunction with training sessions on "Data Cleaning with OpenRefine"

OpenRefine Exerises

Exercise 1: Import Data

  • Download and save the file SAFI_openrefine.csv
  • In OpenRefine click Create Project and select Get data from This Computer.
  • Click Browse and select the file SAFI_openrefine.csv
  • Click Next>>
  • OpenRefine gives you a preview and some options before you hit Create Project
  • Have a look at the data set
 

Exercise 2 : Creating Facets

  • Scroll over to the village column.
  • Click the down arrow and choose Facet > Text facet.
  • Try sorting this facet by name and by count in the left panel. Do you notice any problems with the data?
  • Hover the mouse over one of the names in the Facet list. Find the edit functionality available.
  • You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. We will look at other more powerful correction options later, so leave the errors for now.

 

Exercise 3: Using Facets

  • Using faceting, find out how many different interview_date values there are in the survey results.
  • Is the column formatted as Text or Date?
  • Use faceting to produce a timeline display for interview_date. You will need to use Edit cells > Common transforms > To date to convert this column to dates. Then use Facet >Timeline Facet.
  • During what period were most of the interviews collected?

 

Exercise 4: Clustering

  • In the village Text Facet we created in the step above, click the Cluster button.
  • Try to change the Method and the Keying Function to see what different mergers of values are suggested.
  • Select the key collision method and metaphone3 keying function. It should identify two clusters.
  • Click the Merge? box beside each cluster, then click Merge Selected and Recluster to apply the corrections to the dataset.
  • No more clusters are found, for example to merge Ruaca-Nhamuenda with Ruaca or Chirdozo with Chirodzo.
  • To merge these values we will hover over them in the village text facet, select edit, and manually change the names. Change Chirdozo to Chirodzo and Ruaca-Nhamuenda to Ruaca. You should now have four clusters: Chirodzo, God, Ruaca and 49

 

Exercise 5: Transforming data

  • Cleaning data in the items_owned column
  1. Click the down arrow at the top of the items_owned column. Choose Edit Cells > Transform...
  2. This will open up a window into which you can type a GREL expression. GREL stands for General Refine Expression Language.

  3. Remove all of the left square brackets ([). In the Expression box type value.replace("[", "") and click OK. (What the expression means is this: Take the value in each cell in the selected column and replace all of the “[” with “” (i.e. nothing - delete).

  4. Click OK. You should see in the items_owned column that there are no longer any left square brackets.
  • Use this same strategy to remove the single quote marks ('), the right square brackets (]), and spaces from the items_owned column.
  • Create a text facet to see which items were commonly owned or rarely owned by the interview respondents.

    1. Click the down arrow at the top of the items_owned column. Choose Facet > Custom text facet...
    2. In the Expression box, type value.split(";").
    3. Click OK.
  • Which two items are the most commonly owned? Which are the two least commonly owned?
  • Perform the same cleanup steps and customized text faceting for the months_lack_food column. Which month(s) were farmers more likely to lack food?

Hint: To reuse a GREL command, click the History tab and then click Reuse next to the command you would like to apply to that column.

Optional: If you are on a roll, clean up  months_no_water, liv_owned and no_food_mitigation columns. Post your solution in the chat, we will see who figures out first which months is dryest, which livestock is most commonly owned and what most people do if they run out of food. 

Full documentation for GREL is available here.

 

Exercise 6: Undo, redo and trimming whitespace

  • Trying undo and redo
    1. Click where it says Undo / Redo on the left side of the screen. All the changes you have made so far are listed here.
    2. Click on the step that you want to go back to, in this case go back several steps to before you had done any text transformation.
    3. Visually confirm that those columns now contain the special characters that we had removed previously.
    4. Notice that you can still click on the later steps to Redo the actions. Before moving on to the next lesson, redo all the steps in your analysis so that all of the column you modified are lacking in square brackets, spaces, and single quotes.
  • Trimming whitespace
    1. Edit the village on the first row to introduce a space at the end, set to God .
    2. Create a new text facet for the village column . You should now see two different entries for God, one of those has a trailing whitespace.
    3. To remove the whitespace, choose Edit cells > Common transforms > Trim leading and trailing whitespace.
    4. You should now see only four choices in your text facet again.

 

Exercise 7: Filtering and Sorting

  • Filter to work on a subset of the data
    1. Click the down arrow next to respondent_roof_type > Text filter. A respondent_roof_type facet will appear on the left margin.
    2. Type in mabat and press return. There are 58 matching rows of the original 131 rows (and these rows are selected for the subsequent steps).
    3. At the top, change the view to Show 50 rows. This way you will see most of the matching rows.
  • What roof types are selected by this procedure? How would you restrict this to only one of the roof types?
  • Sort data by a column by using the drop-down menu in that column. There you can sort by text, numbers, dates or booleans (TRUE or FALSE values). You can also specify what order to put Blanks and Errors in the sorted results.
  • Sort the data by gps_Altitude. Do you think the first few entries may have incorrect altitudes?
  • Earlier we saw the value for one of the village entries was given as 49. This is clearly wrong. By looking at the GPS coordinates for the entries of the other villages can we decide what village the data in that column was collected from?
  1. Sort on gps_Latitude as a number with the smallest first.
  2. Add a sort on gps_Longitude as a number with the smallest first.
  3. Using the drop down arrow on the village column, select Edit column > Move column to end. This will allow you to compare village names with GPS coordinates.
  4. Scroll through the entries until you find village 49. Can you tell from it’s GPS coordinates which village it belong to?
  5. Now sort only by interview_date as date. Move the village column to the start of the table. Does the row where village is 49 group with one particular village? Is it the same village as when comparing GPS coordinates?
  • Perform a text facet on the village column and change 49 to the correct village name.

 

Exercise 8: Examining Numbers

Transform cells to numbers:

  • Remove any Text filter facets you have enabled. You can remove an existing facet by clicking the x in the upper left of that facet window
  • Transform cells in the years_farm column to numbers by clicking the down arrow for that column, then Edit cells > Common transforms… > To number. You will notice the years_farm values change from left-justified to right-justified, and black to green in color.
  • Transform three more columns, no_members, yrs_liv, and buildings_in_compound, from text to numbers. Can all columns be transformed to numbers? - Try it with village for example.

Numeric Facets:

  • Numeric Facets can help to find errors
  • For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as abc) or blank (no number or text). You will need to change the Data type to text using the drop-down menu.
  • Use the column pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
  • Notice that there are several checkboxes in this facet: Numeric, Non-numeric, Blank, and Error. Below these are counts of the number of cells in each category. You should see checks for Non-numeric and Blank if you changed some values.
  • Experiment with checking or unchecking these boxes to select subsets of your data.

 

Exercise 9: Using Scripts and Exporting Data

  • Saving your work as script:
  1. In the Undo / Redo section, click Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.
  2. Copy the code from the right-hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting Format > Make plain text and save the file as a .txt file.
  • Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.
  1. Start a new project in OpenRefine using the messy dataset you downloaded before. Give the project a new name.
  2. Click the Undo / Redo tab > Apply and paste in the contents of .txt file with the JSON code.
  3. Click Perform operations. The dataset should now be the same as your other cleaned dataset.
  • Exporting the whole OpenRefine Project:
  1. Click the Export button in the top right and select OpenRefine project archive to file.
  2. A tar.gz file will download to your default Download directory. Depending on your browser you may have to confirm that you want to save the file. The tar.gz extension tells you that this is a compressed file. The downloaded tar.gz file is actually a folder of files that have been compressed. Linux and Mac machines will have the software installed to automatically expand this type of file when you double-click on it. Windows-based machines might need a utility like ‘7-zip’ in order to expand the file and see the files in the folder ( but it is not necessary to do this, OpenRefine will be able to handle this data regardless).
  3. Looking at the files that appear in this folder (if you can): What files are here? What information do you think these files contain?
  • Exporting Cleaned Data
  1. Click Export in the top right and select the file type you want to export the data in. Tab-separated values (tsv) or Comma-separated values (csv) would be good choices.
  2. That file will be exported to your default Download directory. That file can then be opened in a spreadsheet program or imported into programs like R or Python.

 

These exercises are based on Data Carpentry lessons.