This page is used in conjunction with training sessions on "Data Cleaning with OpenRefine"
items_owned
columnitems_owned
column. Choose Edit Cells
> Transform...
This will open up a window into which you can type a GREL expression. GREL stands for General Refine Expression Language.
Remove all of the left square brackets ([
). In the Expression box type value.replace("[", "")
and click OK
. (What the expression means is this: Take the value
in each cell in the selected column and replace all of the “[” with “” (i.e. nothing - delete).
OK
. You should see in the items_owned
column that there are no longer any left square brackets.'
), the right square brackets (]
), and spaces from the items_owned
column.Create a text facet to see which items were commonly owned or rarely owned by the interview respondents.
items_owned
column. Choose Facet
> Custom text facet...
Expression
box, type value.split(";")
.OK
.months_lack_food
column. Which month(s) were farmers more likely to lack food?Hint: To reuse a GREL command, click the History
tab and then click Reuse
next to the command you would like to apply to that column.
Optional: If you are on a roll, clean up months_no_water
, liv_owned
and no_food_mitigation
columns. Post your solution in the chat, we will see who figures out first which months is dryest, which livestock is most commonly owned and what most people do if they run out of food.
Full documentation for GREL is available here.
Undo / Redo
on the left side of the screen. All the changes you have made so far are listed here.Redo
the actions. Before moving on to the next lesson, redo all the steps in your analysis so that all of the column you modified are lacking in square brackets, spaces, and single quotes.village
on the first row to introduce a space at the end, set to God
.village
column . You should now see two different entries for God
, one of those has a trailing whitespace.Edit cells
> Common transforms
> Trim leading and trailing whitespace
.
respondent_roof_type
> Text filter
. A respondent_roof_type
facet will appear on the left margin.mabat
and press return. There are 58 matching rows of the original 131 rows (and these rows are selected for the subsequent steps).Show
50 rows
. This way you will see most of the matching rows.text
, numbers
, dates
or booleans
(TRUE
or FALSE
values). You can also specify what order to put Blanks
and Errors
in the sorted results.gps_Altitude
. Do you think the first few entries may have incorrect altitudes?village
entries was given as 49. This is clearly wrong. By looking at the GPS coordinates for the entries of the other villages can we decide what village the data in that column was collected from?gps_Latitude
as a number with the smallest first.gps_Longitude
as a number with the smallest first.village
column, select Edit column
> Move column to end
. This will allow you to compare village names with GPS coordinates.49
. Can you tell from it’s GPS coordinates which village it belong to?interview_date
as date. Move the village
column to the start of the table. Does the row where village is 49
group with one particular village? Is it the same village as when comparing GPS coordinates?village
column and change 49
to the correct village name.
Transform cells to numbers:
Text filter
facets you have enabled. You can remove an existing facet by clicking the x
in the upper left of that facet windowyears_farm
column to numbers by clicking the down arrow for that column, then Edit cells
> Common transforms…
> To number
. You will notice the years_farm
values change from left-justified to right-justified, and black to green in color.no_members
, yrs_liv
, and buildings_in_compound
, from text to numbers. Can all columns be transformed to numbers? - Try it with village
for example.Numeric Facets:
abc
) or blank (no number or text). You will need to change the Data type
to text
using the drop-down menu.Numeric
, Non-numeric
, Blank
, and Error
. Below these are counts of the number of cells in each category. You should see checks for Non-numeric
and Blank
if you changed some values.
Undo / Redo
section, click Extract...
, and select the steps that you want to apply to other datasets by clicking the check boxes.Format
> Make plain text
and save the file as a .txt
file.Undo / Redo
tab > Apply
and paste in the contents of .txt
file with the JSON code.Perform operations
. The dataset should now be the same as your other cleaned dataset.Export
button in the top right and select OpenRefine project archive to file
.tar.gz
file will download to your default Download
directory. Depending on your browser you may have to confirm that you want to save the file. The tar.gz
extension tells you that this is a compressed file. The downloaded tar.gz
file is actually a folder of files that have been compressed. Linux and Mac machines will have the software installed to automatically expand this type of file when you double-click on it. Windows-based machines might need a utility like ‘7-zip’ in order to expand the file and see the files in the folder ( but it is not necessary to do this, OpenRefine will be able to handle this data regardless).Export
in the top right and select the file type you want to export the data in. Tab-separated values
(tsv
) or Comma-separated values
(csv
) would be good choices.Download
directory. That file can then be opened in a spreadsheet program or imported into programs like R or Python.
These exercises are based on Data Carpentry lessons.