Scientific, Technical, and Medical (STM) journal articles are an explosively growing data source with constantly increasing requirements for complex metadata. Newly important has been the identification of funding and grant data. It has become critical that publishers know the sources of funding for journal articles so that they can meet the associated open source distribution obligations and so that funding organizations can track publications associated with their funding. Managing this information manually is costly and often inaccurate.
Over the past few years, we have been working on enhancing techniques to extract various kinds of metadata automatically from scientific and legal documents. We report here on recent work to extract funding source and grant information from within STM journal articles.
Identifying funding sources has taken on particular importance as funders want to better track what their funding is accomplishing, and scholarly journals want to better track their open source distribution obligations. While extracting funding information is a fairly new requirement, the need for extracting metadata of all kinds from within documents is not new, and applies to a number of areas which we have been working with. Examples include:
Cookbook recipes: Inside recipes are hidden away information on ingredients, procedures, timings. All of which would help you plan meals, figure out what you need in your fridge, maybe even automatically refer other ingredients to help you make that perfect meal.
Leases: Terms, expiration dates, default clauses, renewal clauses, rights of first refusal. Just think of the information you need when you manage a large portfolio of units that you're leasing out; there must be a better way to track information than in a thousand separate documents.
Loan & mortgage documents: There is a lot of information regarding valuations of property and the nature of the borrower hidden inside forms and documents. It would be helpful when valuing a mortgage portfolio to have this information pulled out and ready to use (dare I say, something like this might have even prevented the last housing melt-down).
STM journals: Buried within, these journals can contain information on authors, affiliations, citations, etc.
While the methods we use apply to many other areas, for the purposes of this article, we focus on one application - to identify and extract funder and grant metadata from free-form article text, specifically the granting organizations (sponsors), grant numbers and grant recipients. This article describes how we've used rule-based processes plus Natural Language Machine Learning (ML/NLP) to automate the process of getting at this data work.
At this point, you may be asking yourself, "Why not just ask the authors for this information?" Maybe someday. Some publishers are starting to collect this information but it will take time until it's implemented. And even when collected, not all authors are providing this information accurately, especially when there are a number of ways to refer to the funding source (refer to https://scholarlykitchen.sspnet.org/2015/01/28/what-has-fundref-done-for-me-lately/).
Specific problems can include: names not normalized, the taxonomy is unclear (Federal Highway Authority was not noted under U.S. Department of Transportation), information is ambiguous (University of California - which one?), and mnemonics are ambiguous where a single mnemonic can refer to many different organizations (e.g. a search for NSF in the Fundref database shows 8 entries - look it up on http://search.crossref.org/funding).
While some journals are starting to require authors to explicitly supply metadata, it's still only for a minority of journals and will likely be a long time coming; publishers are reluctant to rock the boat and trouble authors too much, as they might risk losing authors to competing journals. And automated extracted would be quite important for historical analysis, there's something like 50 million articles out there according to at least one source (https://duncan.hull.name/2010/07/15/fifty-million).
Funding Data in STM Articles
Over the past few years we've done extensive work to develop techniques to identify and extract metadata from inside various textual collections, including grant information in STM (Scientific, Technical and Medical) journal articles.
While grant information might be found almost anywhere in an article, it is often contained in either a section of several paragraphs identified as an acknowledgement section or in footnotes near the beginning of articles. The work described in this article assumes that such a section exists and has been identified. Following is an example of a typical acknowledgement section:
<ce:section-title id="st040">Acknowledgments</ce:section-title><ce:para id="p0255">This work was supported by grants of Saint Petersburg State University (11.38.271.2014 and 22.214.171.1245), Russian Foundation for Basic Research (RFBR) projects (Project No. 13-02-91327) and was performed in the framework of a collaboration between the Deutsche Forschungsgemeinschaft and RFBR (RA 1041/3-1). The authors acknowledge support from Russian-German laboratory at BESSY II, the program "German-Russian Interdisciplinary Science Center" (G-RISC) and the Resource Center of Saint-Petersburg State University "Physical Methods of Surface Investigation".</ce:para>
It's typically in prose, may identify one or more funding, and may or may not identify specific grants or other identifying characteristics. It may also acknowledge other sources of assistance who were not necessarily funders of the project. While simpler constructions refer to only one or two funding sources and follow regular grammar, most such sections are more complex. Also, since many authors are not totally fluent in English, the grammar is sometimes convoluted.
Humans are quite good at reading these sections and identifying the relevant information. But the process is labor-intensive, expensive, and because it relies on judgement calls, doesn't always provide consistent results. The increasing speeds required in the publishing industry, and the constant pressure on costs, cries out for automation to speed up the process, reduce manual costs, and provide consistent results. Automating the process also holds the promise for continual improvement as the feedback is used to improve the process.
Automating language analysis of course isn't easy to do reliably because of ambiguities in the use of languages, lack of grammatical knowledge, and the absence of standardization in how people use language ("if everyone followed the rules, this would be easy, alas..."). Also, while this article focuses on English language sources, the techniques we describe can be adapted to other languages, though the analysis of grammatical constructs would need to be adapted to the specifics of the language.
Worth noting is that it could contain multiple institutions, several grants within an institution, and may also include non-funding support from various sources. Let's unpack this even more with the question, Why is extracting funder and grant information important? Here are some answers:
There are emerging rules on open access that require publishers to monitoring the varying open-access posting and payment requirements of the wide variety of licenses. According to the Registry of Open Access Repository Mandates and Policies (ROARMAP) there are currently at least 764 access policies to track (http://roarmap.eprints.org/view/policymaker_type/. Compliance requires knowing which institution funded the research.
Funders want a better way to track the yield of their funding. Automatic tracking will help address this need.
Identification of conflict of interest situations. Example: Where funders might be beneficiaries of the research.
Let's define some terms in our process:
Tokenization: This is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called 'tokens'. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Extended Tokenization: As we use it, and as described by Hassler and Fliedl (http://www.witpress.com/elibrary/wit-transactions-on-information-and-communication-technologies/37/16699) this is tokenization that takes advantage of domain specific knowledge (also known as context-dependent tokenization). In our case we would use it to identify keywords or phrases that have been determined to consistently indicate the start of a text sequence, such as tokens that identify grant information, granting organization, a grant ID, or an author. Taking advantage of context allows us to more easily optimize the analysis to provide meaningful results.
Classification: Classification has its English meaning - to identify a thing as belonging to one distinct type as opposed to another. In Machine-Learning (ML) it goes beyond that definition to suggest a methodology - using features of the text to separate classes. In this paper we are using classification to identify a sequence of words as either being a candidate funder, or as not a funder (binary or logistic decision).
The following provides an overview of our methodology; it's not intended as a machine learning tutorial, but rather to provide the context for what we were able to accomplish, and where we think there is room for further development.
Included is discussion of the following methods we applied in order to improve the initial results:
Lexical Analysis: There is an expected sequence for the association between a grant-id and a grant-sponsor. This suggests a 'grammar' can be imposed on the output of the lexical analyzer through a parser.
Machine Learning/Natural Language Processing (ML/NLP): In order to reach higher we are working ML/NLP techniques to identify funding sources which are not necessarily in the Funding Data database, or are in formats which don't cleanly match. (A good overview is found in https://en.wikipedia.org/wiki/Natural_language_processing).
External verification: Use a table with known granting organizations to verify the offered candidates. To date we have been using the Funding Data Registry at CrossRef (http://search.crossref.org/funding formerly known as FundRef).
The document set we used for this analysis consisted of a corpus of 1000 xml journal articles in which we identified, and segregated, the sections which would be most likely to contain funding data, which was identified as the Acknowledgement section or something similar. While journals from various publishers have some variance in how they handle this, most do have the information in some identified section.
Our approach, using context-dependent tokenization of the text, was to parse the token stream to find sentences relating to grant funding. For each document the section containing the words: grant, trust, fund, and support were extracted, stripped of any XML tags and used to create a free-text data stream.  
Let's break down how we parse the token string. In the example shown, figure 1, having identified the appropriate sentence because it contains the word "Grant", we would then be looking for a start word (a token), like "supported", which possibly identifies that a supporting organization follows. In the example we're looking at, the organization providing the support follows somewhere down the line, and the funding instrument and recipient follows in due course. Typical starting words include "grant", "support project", "award", "contract", "agreement", and "scholarship". Once you've identified the start word, it will define the context of the rest of the sentence, and lets you parse the remainder of the sentence based on the following rules:
Find a start token based on the list of common start tokens.
Once you find a start token, search for the start of funding organizations by skipping words until you find the first proper noun, which we assumed would be the first capitalized word, see figure 2.
Next search for the Grant ID token.
Must follow a grant start word, i.e. token.
All letters in a Grant ID must be capitalized.
Minimum length for a Grant ID is four characters.
Can include but not end with these characters: space, #, 0-9, /, -, _
Our primary goal was to identify all grant funders and the grant numbers within the text, with a secondary goal to locate grant recipients, which we searched for with logic similar to the above. Internal divisions such as punctuation and parenthesis aided in the transition between "grant text" and "non-grant text". In addition to looking for the first capitalized word, we did some more analysis using pattern matching and textual context. We also used the start words as stop words for the prior segment. This was to let us know that we have probably identified all the information for that funding source and that we were probably now looking at another funding source. At the end of this analysis we would expect to have found a set of candidate phrases where each phrase had a candidate grant funder, zero or more grant numbers, and zero or more recipients.
Classification is a key methodology we used next. The above process alone correctly identified over 50% of the granting institutions, along with grant ID's, with a lower yield for recipients. These results, while impressive, were not sufficient for a production setting. Our goal was to get the automated process into the 80-90% area for the process to be considered useful. Getting into the 90-95% was the ultimate goal.
In order to improve precision we added a machine learning classifier trained on proper names taken from our corpus of documents. The internal functioning of a classifier is outside the scope of this article, but a good overview is available in this article: https://en.wikipedia.org/wiki/Support_vector_machine. The basic idea was to identify which combination of words was likely to indicate it being a funding sourcing. This was done by working with training sets, which we could then manually analyze and determine whether the selected items fit into the category (valid funding source) or not. If the training set was sufficiently representative, than the computed ML algorithms should correctly predict the correct answers for the other sets that would be analyzed, which would then be an automated process.
The training of the classifier was based on a set of natural language processing tools from R (a set of open access software tools from the The R Project for Statistical Computing, www.r-project.org. The first step was to convert the phrase into a "bag of words" by dividing the textual stream into a set of words. "Bag of Words" is a term used in ML when you want a list of all words appearing in a set, but don't care what the order might be. The words are stemmed to normalize and simplify the analysis (e.g. walk, walker, walking, walked all have the same stem) and common stop-words (a, the, and, etc.) are removed. A term frequency vector was calculated, and words with low frequency, in our case those occurring in less than 10 documents, were removed from the analysis. Surprisingly, after this analysis, on 1000 documents, only about 50 words remained.
The subsequent words were used to build a document term matrix where each column was either 0 or 1 to indicate the presence or absence of a word and each row contained a place holder for one of the 50 words used in the analysis. From the above data, a Support Vector Machine (SVM) was trained. Background on SVM is available at https://en.wikipedia.org/wiki/Support_vector_machine.
The purpose of the classifier, once trained, is to return an evaluation for each instance, of whether the instance is a funding institution or not. The model based on initial testing showed an F1-measure of .86; the F1-measure is considered a measure of "goodness of fit", and without getting too deeply into the statistics, tells that the chances that we guessed correctly is about 86%.
To further improve the accuracy of our results, the next step would be to compare them to a known list of granting organizations. If there was a complete, authoritative database, we probably wouldn't need some of the sophisticated techniques we've discussed. But that doesn't exist today. The Funder Registry maintained by Crossref (formerly known as FundRef) is considered the most authoritative, and while it is being regularly improved, it is not yet complete and there are some ambiguities.
By using the result of our analysis, along with the Funder Registry, we have gotten results in the 90%'s, and anticipate being able to improve on that.
In addition to improving automated results, there is further work we are doing to disambiguate mnemonics, in which a designated funder could be one of many, and to better identify grant recipients, who are not often not fully identified in funding acknowledgment.
The goal going forward will be to increase yield further, and with better precision, while reducing false positives. Our plans are to achieve better calibration with larger test sets, improvements in machine learning and in our parsing algorithms, improvements in the Funder Registry and authority files.
 This type of extended-tokenization based on a domain specific language is described by M. Hassler & G. Flied (Text preparation through extended tokenization in Data Mining VII: Data, Text and Web Mining and their Business Applications 13). Because the lexical tokenization takes advantage of the surrounding text there is a greater confidence that the extracted text relates to the domain of interest.
 The tokenizing software, or lexer, was based on lex which was originally written by Mike Lesk and Eric Schmidt (Lesk, M.E.; Schmidt, E. (July 21, 1975). "Lex - A Lexical Analyzer Generator" (PDF). UNIX TIME-SHARING SYSTEM:UNIX PROGRAMMER'S MANUAL, Seventh Edition, Volume 2B. bell-labs.com. Retrieved Dec 20, 2011.)