Research Portal Denmark has created a data pipeline and a system architecture that can collect, enrich and link data from various sources. The portal also offers a range of services for our end users.
The High Level Architecture diagram illustrates the four main processes involved:
- Harvesting and pre-processing: The data from different commercial and local data providers is gathered and stored in the databases.
- Data Enhancement: The data is augmented with additional or improved data elements, in collaboration between the main pipelines and the NORA data analysts.
- Data Consolidation: The data is clustered to identify and link identical publication records from different providers.
- Dissemination: The data is made available through a web/search interface and analytics tools.
In more details
There are two main pipelines that share some steps of the pipeline: the Global pipeline and the Local pipeline.
Global data is stored in a raw MongoDB collection that standardizes the heterogeneous inputs into more comparable JSON structures. Within the MongoDB database, the raw collections store the results in JSON format but with all the original fields and nested structures.
Local data is harvested by querying all the local repositories for a full dataset in XML format and is stored raw in an SQLite database. Then the data is transformed into JSON format and also stored in MongoDB.
As a rule, all data exists within the one MongoDB server in one or more collections within that server. The objective is to have a single source of truth that creates a common reference point and to facilitate the technical management of the data flows.
The next common process for both pipelines is the name variants extraction, which involves:
- Extracting and counting the list of unique affiliation ids to feed into the affiliation mappings
- Extracting and matching DOIs within and between data providers and creating a new collection with the results
- Creating more uniform data structures (parsed) that allow feeding processes further down the line.
This extraction is crucial for the main service of data enhancement, also common to both pipelines. The tools used here are neo4j and Google Sheets, with the latter being used for storing the master mapping tables of organizations, countries and subject classifications. When a name variant is extracted, it is compared against the mappings to see if there is a match, meaning that this variant has been identified and handled before. If it is already mapped, the number of its occurrences in all the records gets updated in the mapping sheet. If the variant is not already mapped, a new code is proposed in the Google Sheet for subsequent manual validation. Details about the exact process and the rules can be found here.
The next main service that is provided is the data consolidation among the different data providers, both global and local. The clustering of the global data – GOI clusters of same records across the three global providers – is created within neo4J. The clustering of local data – LOI clusters of same records across all local data providers – is done in a different way due to the type of data and its quality. As the diagram depicts, the LOI clusters are also imported in neo4J to be used for the final step of the universal clustering – NOI clusters of same records across GOI and LOI clusters. Detailed information about the algorithms and the rules used for clustering can be found here.
The information generated about the enhanced name variants and the clusters are stored back to MongoDB, and via the FastAPI they become available for our websites.
The presentation of this data is happening through search interfaces, where the users can perform simple and advanced text searches and use the advanced filters, that are are possible due to the enhanced and consolidated data. The well-structured nature of the Global data allows the usage of VIVO for this matter, which is an open source semantic web tool for research discovery. There is one VIVO instance for every data provider. The clustered data (GOI clusters) is derived from these instances and is presented in another VIVO instance “Across All Data” where the user can search across data from all three global providers . The search interface for the local data is based on an Elastic Search index and special efforts are made to create a consolidated metadata presentation of the clustered records. More information about the merging display rules can be found here.