This will open a dialogue box. Under “Scope,” choose “All opened files,” then proceed with your search.
To search everything in the Digital Egyptian Gazette content repository, under Find/Replace in Files > Scope, choose “Specified path,” then navigate to the location where you’ve cloned your fork of the content repository. For more information about how to find this location, consult the Github tutorial.
To search directly with XPath, locate the XPath query box near the top left. The drop-down icon on the left of this box allows you to choose the scope: current file, all opened files, and so on. Now, try these commands:
//div
//div[@type="item"]
//div[@type="item"][contains(., 'cotton')]
//persName
or //placeName
XPath asks you to do two things: specify where you want to search, and specify what you want to search for. An XPath query is a series of terms (words) separated by slashes (/) and other punctuation. For example:
//div[@type="section"]/div/p[contains(., 'cotton')]
We use XML to structure our issues of the Egyptian Gazette by page, section, item, and so on. For example, we use nesting pairs of tags to put <div type="item"> </div>
inside <div type="section"> </div>
, and <div type="section"> </div>
inside <div type="page"> </div>
.
This structure is commonly described as a “tree.” The root is the issue, which branches into six or eight pages, and each page branches into sections and items and paragraphs.
An XPath query shows the tree parts separated slashes, starting from root and heading towards the branches, like so:
//div[@type="page"]/div[@type="section"]/div[@type="item"]/p
A double slash (//) tells the computer to look anywhere in the document for the item that comes next. A single slash (/) tells the computer to look only one level up the tree. What difference does this make?
//div//head
would return any headline in any div in the whole document. This would not be a very good search, because it would return a huge number of results.//div/div/div/div/head
would return any paragraph that is inside four divisions (for instance, a paragraph in an item in a section in a page). This would return fewer results.The best way to say exactly where you want to look is by using attributes. These are contained in square brackets. For example, if you want to search for the headlines within page 1 only, you would say //div[@type="page"][@n="1"]//head
. What goes in the square brackets is the attributes you put in the <div>
tag when you’re encoding the issues.
With a little practice, you’ll learn to search for results only in the relevant parts of the newspaper.
After you tell XPath where you want it to search, you can tell it what you want it to return. (This is optional.) Probably the most common thing to search for is a word or words. To do this, add [contains(., "searchtext")]
to the end of your search, putting your search word(s) between the quotation marks. For example, //div[contains(., "plague")]
will return all divs that contain the word “plague.” Note that this search is case sensitive, and will not return “PLAGUE” and “Plague.” To remedy this ,use the matches
function with the 'i'
flag, which makes the search case insensitive: //div[matches(.,'the plague', 'i')]
.
You can also search for particular kinds of information. Add a slash to the end of the location, then tell it you want
count()
– how many of these things are there?string()
– what is the string of text this thing contains?number()
– what number is found on this branch?You can navigate around the tree by using commands like parent::
or following-sibling::
instead of the slash. These will move you up and down the tree.
For example:
//div[@type="wireReport"]/parent::div//dateline
//div[@type="wireReport"]/parent::div//dateline//placeName
Many ads, templates, and sections have xml:id
or feature
tags embedded in them. These are meant to simplify your search. For features, use //div[@feature="shippingMovements"]
. For tables, use //table[@xml:id="deg-ta-cppa01"]
.
Tables are a powerful part of the XML-encoded Egyptian Gazette. XPath gives us tools to return precise parts of the information these tables contain. For example:
//table//cell[contains(.,'P.T.')]/following-sibling::cell[1]/text()
. Does not work for numbers, it seems.//table//cell[contains(.,'Augment.')]/following-sibling::cell[1]/number()
. Does not work if number appears as a string (i.e. contains a comma after the thousands for example).//table//cell[contains(.,'Augment.')]/following-sibling::cell[1]/string()
//div//head[contains(.,'MARCHE DE MINET')]/following-sibling::p[1]/string()
//div[@xml:id="deg-ad-orm01"]//head/following-sibling::p[3]/date/@when
An example of working your way through table cells, looking at prices for cotton:
//table//cell[contains(.,'Cotton')]/following-sibling::cell[4]
//table[@xml:id="deg-ta-cotn01"]//cell[contains(.,'Russie')]/following-sibling::cell[1]
The possibilities are endless. Here are some samples.
//placeName[contains(.,'Aden')]
count(//div[contains(., 'native')])
count(//div[@n="1"]//div)
count(//div/head[contains(.,'MARCHE DE MINET')]/following-sibling::p)
//p[string-length() > 5000]
//div[not(descendant::div[@n="1"]) and not(ancestor-or-self::div[@n="1"])]
//p[contains(.,'theatre')]/text() | //p[contains(.,'theatre')]/*[not(contains(.,'performance'))]/text()
It is possible to combine regular expression and XPath searches by using the find/replace menu. Enter the regular expression you wish to search for in the Find box, and the XPath location in which you wish to search in the XPath box.
Right click on results, then export file. You can then clean up these results with regular expressions to remove the parts you don’t want. After this, you can work with the results in a spreadsheet.
More complex querying can be accomplished using XQuery.
Example, returning the date of every issue with a page 7: for $a in collection("file:/Users/whanley/GitHub/DEG-content/?select=1905-*.xml;recurse=yes") where $a//*:div[@n="7"] return $a//*:bibl/*:date
Identify six versus eight page issues: for $a in collection("file:/Users/whanley/GitHub/DEG-content/?select=1905-*.xml;recurse=yes") return if ($a//*:div[@n="7"]) then <eight-page>{data($a//*:bibl/*:date)}</eight-page> else <six-page>{data($a//*:bibl/*:date)}</six-page>