Regular expressions instructions

Regular expressions (regex) is a very useful means of working with serial data which repeats similar information using similar formats.

Regexr is a great place to learn about and try out regular expressions.

Using regex in Oxygen

You can use regex in the Find/Repace in Files tool when you enable the Regular Expression option in it. Some examples of commonly used regex codes:

  • Search \d+ \w+ \w+ for patterns like “100 tons cotton”
  • Search \W\w+ \W\w+ for personal names
  • Search at [A-Z]\w+ for locations (Remember to enable Case Sensitive)

Using regex in Atom

To put <persName> around passenger names in a list: find Mr. [A-Z][a-z, 0-9]+, replace with <persName>$&</persName>.

Cleaning OCR

The spotty quality of the microfilm original produces OCR errors of various kinds. One of these is easy to solve using regex: seeing non-Latin characters (such as Arabic and Cyrillic letters) where there are none. This is especially annoying when the OCR conjures bidirectional (right-to-left) characters, which causes Oxygen to give you an error message.

This regex seems to find most such OCR errors in Unicode: [\u0370-\u1fff]+|[\u2070-\u214F]+|[\u215F-\u25ff]+|[\u2700-\uffff]+.

Cleaning XPath results

  1. Select all, copy, and paste results into your plain text editor.

  2. First, let’s remove the lines that start with “XPath location,” “Start location,” and “End location,” because we won’t need these results. Open find and replace. Click the Regex option, then use this regex to find the first of these results: XPath location: .+\n. Note: if you’re using Windows, you may have to replace the \n (new line indicator) at the end of this string with \r\n, which is how Windows sometimes indicates new lines.

Once you’ve selected all of these “XPath location” lines, replace them with nothing (i.e., leave the replace box empty). Click Replace All.

Now you can do the same for Start location: .+\n and End location: .+\n

  1. Remove the file location that precedes the issue date. Find System ID: /Users/whanley/GitHub/DEG-content/ (this will be different on your computer–just select everything that comes before the date filename). (You may need to turn off Regex in order to find this string text.) Leave the replace box empty, then click Replace All.

  2. Now we replace what comes between the date and the results. Turn Regex back on, then find .xml\nDescription: (or .xml\r\nDescription: if you are using Windows) and replace with \t. Replace All.

  3. If you have empty lines in your file, you can remove them by finding \n\n (or \r\n\r\n) and replacing it with \n.

  4. You might have a bit of garbage left over at the beginning and end of the file. Delete this. Now you will have a tab-separated, two-column table that you can paste into a spreadsheet.

Using regex in Microsoft word

Say you are trying to make a table of the results that you exported from Oxygen.

Import these results into a Word document. Then use Edit > Find > Advanced Find and Replace.

  1. Find two paragraph marks (^p^p), and replace with @@.
  2. Find one paragraph mark (^p), and replace with comma.
  3. Find @@, and replace with paragraph mark (^p).
  4. Select all text, then use Table > Convert > text to table.