Skip to Main Content

Research Data: Regular Expressions

Introduction

This page is used in conjunction with training sessions on regular expressions. The material is based on this Library Carpentry lesson.

Introduction

Exercise 1

Build teams of four.

When you think you have the right answer, check it against the solution.

When you finish, split your team into two groups and write each other some tests.

These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match.

Then test each other on the answers.

To visualise your regex you can use: https://regexper.com/

To test regex on data use: https://regex101.com/, http://myregexp.com/  or https://www.regexpal.com/

1 What will the regular expression Fr[ea]nc[eh] match?

2 What will the regular expression Fr[ea]nc[eh]$ match?

3 What would match the strings French and France that appear at the beginning of a line?

4 How do you match the whole words colour and color (case insensitive)?

5 How would you find the whole word headrest and or head rest but not head  rest (that is, with two spaces between head and rest?

6 How would you find a string that ends with four letters preceded by at least one zero?

7 How do you match any four-digit string anywhere?

8 How would you match the date format dd-MM-yyyy?

9 How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a line only?

10 How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

 

Exercise 2: Using Regex101.com

Go to https://regex101.com .

Open the swcCoC.txt file (with Notepad), copy it, and paste it into the test string box.

For a quick test to see if it’s working, type the string community into the regular expression box. The expression matches seven instances of the string ‘community’ (the instances are also highlighted within the text).

1 Type community. Include a space after the word. You get three matches. Why not seven?

2 If you want to match ‘community-led’ by adding another regex character to the expression community, what would it be?

3 Change the expression to communi and you get more matches of several words. Why?

4 Type the expression [Cc]ommuni. You get more matches. Why?

5 Type the expression ^[Cc]ommuni. You get no matches. Why?

6 Find all of the words starting with Comm or comm that are plural.

Finding email addresses:

7 What character do you know is held in common with all email addresses?

8 The string before the “@” could contain any kind of word character, special character or digit in any combination and length. How would you express this in regex? Hint: often addresses will have a dash (-) or dot (.) in them, and neither of these are included in the word character expression (\w). How do you capture this in the expression?

9 The string after the “@” could contain any kind of word character, special character or digit in any combination and length as well as the dash. In addition, we know that it will end with two or three characters after a period (.) What expression would capture this. Hint: the . is also a regex expression, so you’ll have to use the escape \ to express a literal period.

Exercise 3: Regex with Google Sheets

1 Export and unzip the 2017 Public Library Survey (originally from the IMLS data site) as a CSV file.

2 Upload the CSV file to Google Sheets and open as a Google Sheet if it does not do this by default.

3 Look in the ADDRESS column and notice that the values contain the latitude and longitude in parenthesis after the library address.

4 Construct a regular expression to match and extract the latitude and longitude into a new column named ‘latlong’. HINT: Look up the function REGEXEXTRACT in Google Sheets. That function expects the first argument to be a string (a cell in ADDRESS column) and a quoted regular expression in the second.