The Source, the Publisher and Data Extraction

What are the "Asian Directories"?

The Asia Directories and Chronicles is a reference work for foreign traders in the Asian region, published annually by the Hong Kong Daily Press, annually between 1863 and 1941. It contains conversion tables for mass and currencies, information on import and export duties, legal texts, treaties and agreements, calendars, ship lists, advertisements, and information on local traditions and festivities.

Alt Text
Volumes over the years.

The core of the Directories, however, are the effective company directories. Arranged by country and then by locality, lists of (mostly) Western merchants, consulates, public institutions, insurance companies, transportation players, and more are printed. The Directories provide information on almost all aspects of a foreigner's daily life in the Asian areas mentioned.

The Institute for European Global Studies of the University of Basel has digital copies of all issues. The corresponding project to digitize these holdings has been running at the institute since 2015.

Typical Contents of a Volume

The individual Directory volumes are structured similarly year after year. In the early years, the order is different, but the parts remain the same. The first part contains tables of contents and a yearly calendar that refers to various events and anniversaries. The most comprehensive is the Hong Kong Postal Guide, which contains detailed information on mail and merchandise shipments.

The second part, Treaties, Codes, &c., makes up about a quarter of a Directory volume and contains fully printed contracts and tariff regulations. Here, for example, is found the Agreement of 1842, which the British imposed on the Chinese. In addition to the British documents, French, German, Russian, U.S., Portuguese, and Japanese treaties with China are printed. Agreements very similar in wording were ratified by the British, Americans, and Japanese with Korea and Siam. Japan has a special role in this section as the only non-Western power that had signed such treaties and declarations with the Asian countries mentioned. This is reflected, for example, in the fact that the usual formulations on jurisdiction in the European-Japanese documents appear only in a weakened form or are omitted. With the Regulations for foreign companies in Japan Japan is also the only Asian nation for which restrictions on foreign traders are printed. A good half of the treaty section consists of the H.B.M. Rules of Court, the Standing Rules and Orders of the legislative council of Hong Kong and similar documents concerning the (mainly British and partly U.S.) court system.

The third and in all volumes most extensive part is the core of the source, namely the effective directories. This section is first arranged according to countries, respectively regions and therein according to individual localities. For most countries and localities, there is an introductory text that provides information about the geographical location, the nature of the port, but also about scenic features and local customs.
In addition to the sober lists of goods to be taxed, there are poetic descriptions that explain, for example, the beauty of the harbor or the quality of an annual spring concert. The order of the countries is roughly geographically oriented from northeast to southwest. Within countries, however, the order seems random. For example, the cities in Japan are printed in the 1900 volume as follows: Tokyo, Yokohama, Hakodate, Osaka, Kobe, Nagasaki. This does not correspond to an order by size, alphabet, or geography. Nor does a classification by "importance" seem plausible, since in the introductory text the fourth city listed, Osaka, is described as "second city in Japan in point of size and commercial importance". Between 1860 and 1940, the number of towns listed changes rather slightly. Between 1900 and 1920, only two new localities are added that were previously covered by larger structures. For example, Zambonga was re-listed (Philippines) and Labuan was included separately instead of listing it under British North Borneo.

The introductory texts on the localities are followed by the effective directory, the list of companies, consulates, public institutions, missions and government offices. Here, the individual volumes differ markedly from each other in the type of presentation, the information content and the sorting. Although only about four different "presentation styles" can be identified, even these differ in details. This is especially problematic for automated data acquisition procedures. In the early years (1863 to about 1878), attempts were made to group entries by category. For 1877, the (larger) localities were divided into Government Offices, Consulates, Educational, Clubs, &c., Masonic Lodges, Ecclesiastical, Public Companies, Insurances, Banks, Profession & Trades, and Hotels, Taverns, &c. This subdivision probably seemed either too burdensome or no longer useful to the editors in subsequent years. Although consulates, churches, missions, and government institutions remained grouped later, the remaining entries were arranged alphabetically. It can be surmised that the varied categories of the companies contributed to this. Whereas in the 1860s most of the firms listed referred to themselves simply as merchants, in the 1930s they called themselves general merchants, exporters, importers and commission agents, &c..

The individual entries also differ greatly. The following figures show a minimum and an average example. In addition, there are companies that fill several columns with their employee and agency lists. Basically, "non-commercial" entries can be distinguished from those of commercial firms. Often they kept long lists of employees including job titles and various other information (see 3rd figure below), while the commercial companies in most cases listed only a few employees and (if available) a list of agencies.

Alt Text
Minimal example.
(1900, p. 788)
Alt Text
Average example.
(1930, p. 1028)
Alt Text

(1900, p. 803)

The Foreign Residents Lists

The fourth part of the Directories consists of the Foreign Resident lists and, in later years, additional lists of steamships or local factories, etc. These are alphabetical lists of all foreign residents. These are an alphabetical enumeration of all foreign persons listed in the Directories. Each line usually consists of the name, a job title with company affiliation, and a place.

Alt Text
Start of the foreign residents list of 1925
(1900, p. 803)

However, random checks show that not all persons listed in the Directories are also listed in the Foreign Resident Lists. For example, Dodwell & Co. lists over 30 employees in Hong Kong for the year 1925, but only 18 Dodwell employees in Hong Kong are found in the aforementioned lists. These checks are difficult, however, because various employees are listed in the directory section in several cities, but in the foreign residents section only once with a single city designation. In addition, the individual entries repeatedly list persons who are not even present on site (absent). But even taking these circumstances into account, the foreign residents do not seem to be "complete". It is noticeable that although individual women are listed, they seem to appear more frequently in the directory section than in the aforementioned list. In addition to the note "absent", it often happens that cities are listed in parentheses after the names. It can be assumed that these persons are also not present on site. The following example shows that a considerable part of the recorded workforce must not be present at all. The entry is recorded in the Iloilo part, and even though the other employees are also in the Philippines, there are apparently only two people present in Iloilo itself.

Alt Text
Not all of the listed people are actually present.
(1927, p. 1398)

It is noticeable that the name givers (and thus mostly owners) of many companies are listed, but with a European city added. Thus Ed. A. Keller of the Ed. A. Keller Company is listed as the first employee for many years, but in each case supplemented with the addition (Zurich), where he also resides (at least for the majority of the time) (Wolle 2009).

The last part of the volumes consists of a collection of mostly half- or full-page advertisements. The companies listed can also be found in the directories themselves and are mainly service providers (postal and ferry services, banks, insurance companies) and suppliers of luxury goods (cigars, alcoholic beverages). Advertisements are also increasingly scattered through the volumes, especially from the beginning of the 20th century. In addition, individual entries are highlighted in bold or companies refer to their advertising pages ("see advertisements"). The advertisements can provide exciting information in the details. On the one hand, they can be used to identify individual companies, because often the home office in Europe or the nationality is listed. In addition, price developments or changing service offerings can be observed over the years. For example, postal ships presented their routes and the cadence of trips, or suppliers of cigars listed prices of their products.

The Murrows - A Publisher Family Making Money out of Information

Making Money out of Information

During their whole lifetime, the Chronicles & Directories remained in the hands of the same Hongkong-based publisher family. The founding editor, Yorick Jones Murrow (1815-1884), owned the Hong Kong Daily Press, the first English newspaper published in Hongkong, adding the Asia Directories to his press empire in 1863. After his death his widow Mary and later on one of his sons, Henry Lloyd Murrow, kept the Asia Directories in print until the surrender of Hongkong during World War II in 1941.

Although the history of the owner family is not well known, an extended obituary published in the London and China Express in 1884 gives us a glimpse of a rather rough but very successful businessman. Murrow regularly crossed the line with the British authorities and spent time in prison, but he discovered how to make money from the long distance trade in information. From the very beginning in the mid-century, Murrow developed an elaborate system of running different press releases at different frequencies, so that the same information could be sold more than once in various places under different banners. In the 1860s, for example, the English edition of the Daily Press was published every morning in Hongkong, while the Chinese version was available three times a week.

Alt Text

The paper’s long distance delivery, the so-called Overland Trade Report, left the printing press twice a month in coordination with the Far Eastern press delivery to England, adding additional foils to the Straits Times Extra and having connections to the agents of the London & China Herald. Within this business of fast delivery, the Chronicles & Directories offered an annually updated source of information, a manual for daily business and an opportunity for self-representation for a world-wide trading community. The multilayered delivery of information was a carefully cultivated and addressed topic. The Asia directories therefore are a source that helps us to understand how the world market developed in the second half of the 19th century, and to what extent new communication technologies were implemented spanning from the opening of the Suez Canal to steamships, trains, telegraphy and submarine cables.

Although Murrow's information empire was based in Hongkong, the family ran an office in London’s Fleet Street, right in the heart of the British Empire’s media business. Murrow was also listed as a foreign resident, though often indicating his absence from the Far East, as was commonly done by residents while temporarily on home leave. In Jersey and London he was a member of the well-connected Masonic community of the United Grand Lodge of England. His biography is shaped by the typical patterns of global connectivity and, simultaneously, territorial absence. From the late 1860s onwards, it seems Murrow no longer ran his media empire from the island of Hongkong, but from the distant Channel Island of Jersey. It is here, in between Britain and continental of Europe, that his grave lies in the Cemetery of St. Saviour.

Alt Text
From The Hongkong Daily Press, 1873-04-01.

Digital Resource Acquisition – Collecting Historical Archival Sources Scattered Around the World

In 1917 Wolf von Dewall wrote in a review in Weltwirschaftliches Archiv Vol. 9 that the Chronicle & Directory was available on every desk in the "Far East". More than 100 years later it is challenging to collect one complete set of the books. Published from 1863-1941 in Hong Kong, we expect there to be 79 books. Of these 79 books we found 78 listed in online catalogs of memory institutions around the globe. The EIB project has managed to digitize or acquire digital copies of 74 books from 13 different holding institutions. No single institution has a complete set of these volumes: their collections very between 1 and 37 books. Some institutions had already digitized copies available, many did not. Besides regular digitization orders financed by the EIB project, digitization was made possible through generous support of holding institutions and personal contacts. The gaps in the digital collection are due to prohibitively high digitization costs at some institutions, or lack of response from the holding institution or in one case incorrect catalogue information, where the catalogued microfilm did not contain a copy of the Chronicle & Directory but some other business directory from Hong Kong.

We started building the Asia Directory digital collection in December 2015 with a copy of the 1863 Chronicle & Directory held by the Staats- und Universitätsbibliothek Hamburg. Soon after the largest available batch of 37 books was ordered at the Staatsbibliothek zu Berlin. Only knowing the obvious and simple structure of the 1863 Chronicle & Directory and assuming that the size and structure would be relatively stable over time, we asked the digitization contractor for delivery of batches based on book sections defined in the printed table of contents. This turned out to be ineffective and error-prone, since the principles on which the table of contents were created and the size and contents of the books changed over time. Subsequent digitization was delivered as book batches and a separate segmentation workflow was implemented to create reliable tables of contents, enabling page ranges to be generated for each section of the volumes—such as the foreign resident listings. In addition, an image pre-processing for OCR optimization was implemented, due to poor paper and print quality observed with the late 1930s volumes.

Extract, Parse and Publish Data for Global History Research

The Foreign Residents Listings are particularly suitable for digital recording and quantitative studies because of their structure as a list. However, the effort required to collect the data across all available years should not be underestimated because of the inconsistencies. For example, volumes from earlier publication years differ from later years in the order of the individual parts of the list (occupation and institution are interchanged and the institution is indicated in parentheses). This leads to difficulties especially in the automatic categorization of list entries, since the recognition pattern is different. However, as these patterns are often consistent over several years, the effort is worthwhile as the Directories provide unique access to the dynamic developments of trade in the Asian region. Relationships between traders, agencies, firms, and partners, as well as localities, can be captured in networks and observed over a period of approximately 80 years - which is a distinctive feature of this source and can provide unique inputs for historical research.

Tokenization and Building Dictionaries

The strict one-line-format of the Foreign Resident Listings (with few examples) makes an automated tokenization possible. In our case this resulted in ALTO-XML files which contain all the data on a page separated by lines. Within the lines words are tokenized by spaces. In a first step, some of the data is automatically corrected. The OCR has some limitations for certain characters (such as '&') and some words are frequently recognized incorrectly. With a curated list of common errors we can correct these errors before we start the information extraction workflow.

Alt Text
Example ALTO File

In order to assign categories to different text parts, rolling dictionaries were used. In a first step, a team of researchers classified text parts by combining single words to name parts, company names, professions and other categories. These values were used to build classified dictionaries which helped identify subsequent entries and thus the workflow recognized incrementally more text parts on its own. After the initial manual categorization the research team mainly had to confirm pre-categorized data and only correct false positives. To make this a collaborative workflow, a web service was created to dynamically enrich the dictionaries and let multiple researchers work on the same datasets.

Alt Text
Workflow to semi-automatically categorize text parts

After the manual correction/confirmation of about 50'000 entries, the dictionaries contained all the classified entries with a category. Simply speaking, we built large lists of company names, profession names, location names, etc. This was the base to create a workflow which enabled an automated recognition of large parts of Foreign Resident Listings.

Creating a Parser

The parser was written in such a way that it was able to analyze a single line, categorize its contents and create a confidence score to indicate how well the line was parsed. All the lines of a year were then fed into the parser and an average score was calculated. When we reached a threshold of 90% successful recognition we treated the year as successfully parsed.
The parser takes advantage of the structure of each line. It starts from the beginning, identifying all the name parts and from the back, identifying location parts. The remaining information is searched for in the dictionaries and matches lead to a successful identification. When no exact matches were found, similar entries were searched. If the Levenshtein distance (or other measures) to a dictionary entry was small enough, it was assumed that the parsed text is of that category.

Alt Text
Example of how the parser works

In order to review the parser results, not only structured data in JSON format was produced but also an easily readable report which allowed to confirm the validity of the found data. Missing entries could be added manually to the dictionaries to identify parts which the parser did not recognize.

Alt Text
Parser report snippet
Creating structured data entries

We created a historic persons schema with which we created JSON data for each line in the Foreign Residents listings. These data points not only include the original and corrected transcription but all recognized tokens with their respective category. Furthermore we enriched the data with additional information where possible. For example, geolocations were added using a identifier and implicit data was made analyzable (e.g. the notation "Mrs." was translated into structured gender and marital status information.

Alt Text
Resulting JSON data
Presenting the data
The page you are looking at now is the result of the efforts to present this research data to a wider audience. The data itself is stored in a stable repository and made available over an API which allows to make presentation websites without having to store the data outside of the repository and makes sure the investment in the data is save for long term use, also if the presentation might not endure.