Seyed Amir Hosseini Beghaeiraveri

19 Nov 2021

Biohackathon Europe 21 - Project 21 Report

BioHackathon Europe is the annual gathering of bioinformaticians from all around the world held by Elixir since 2018. Each biohackathon consists of different projects with different focuses but all related to bioinformatics. This year, the biohackathon was held in Barcelona and gathered together in-person and remote participants in 38 projects. I have been at project 21: Handling Knowledge Graphs Subsets.

Project 21: Handling Knowledge Graphs Subsets

As far as I know, the subsetting project was started from SWAT4LS hackathon 2019 by Andra Waagmeester, Dan Brickley, and others. The project's goal initially was to have a very fast and effective tool that can extract a subset of Wikidata just with a few clicks. That goal very soon expand to cover all kinds of RDF KGs. The project was followed in the Biohackathon 20 last year in which I was introduced to the group by my supervisor Dr. Gray (the project could help me investigate Wikidata and scape from its huge size).

The first SWAT4LS hackathon (2019) was more about defining the problem and identifying the pros and applications of subsetting. In Biohackathon 20 and its following SWAT4LS hackathon (2020), we were able to suggest several pipelines of subsetting. The focus last year was more on the use of ShEx and ShEx validators. However, the team became familiar with WDumper, and very soon we found that ShEx slurpers (Slurping is an ability of some ShEx validators to return visited RDF triples during the validation) are not suitable for large-scale subsetting.

Last year we made a lot of progress on use cases too. The notion of topical subsets which I am using in my current research actually comes out from last year's project. Last year we also have been able to extract some initial subsets as well. We had a complete Docker image of the Wikidata Gene Wiki project combined with information of other datasets by Dan Brickley.

Over the past year, the use of subsets has increased and more papers have been published about subsetting. I am using subsets at the base of my recent paper and will use them in the RQSS. We also found a bunch of new subsetting tools. We found KGTK, which is a work of the University of Southern California Information Sciences Institute. Jose Labra has been working on 2 new tools. We have got good experiences using WDumper through the past year. In short, subsetting has somehow found its place in the Linked Data research. So this year's project was titled "Handling Knowledge Graphs Subsets". The aim was to achieve an integration of different subsetting approaches as well as enrich the existed subsets and improve the existed tools as much as possible.

Trulli — Project 21 outcomes, participants, and future work. Taken from Biohackathon 21 final report slides.

Hacking days

On the first day, we talked about our achievements in the past year and the uses of subsets we have made. I talked about my experiences with WDumper and how I used it in the last paper (Reference Statistics) to create 6 subsets based on 6 WikiProjects. We discussed how we can keep the outputs (Zenodo, private servers, creating HDT versions, etc.). We talked about creating a Wikibase from subsets from NTriple files directly, mainly via Andrea's Wikidata Integrator tool.

Making most advantage of Dan Brickley's presence!

In the only afternoon that we had DanBri in the group, we had a very long discussion about encoding problems in Wikidata RDF dumps and WDumper outputs and ways to fix those problems. We believed problems are due to the JSON conversion to RDF and the origin of errors is in the Wikidata ToolKit so one should fix the problem from the root (to be done!). We also discussed the best triplestore for handling the subsets.

Do some programming

In the first two days of improving methods and subsets, I wrote a script that adds labels and descriptions of missing items to WDumper-based subsets. The original idea was from Dr. Gray when he used one of my subsets for one of his courses last year. He realized this problem that there are lots of Q and P IDs in the second level of subsets that don't have any other information in the subset. We thought it would be good to add at least labels and descriptions of those Q/P IDs from Wikidata to the subset to increase readability. Ammar obtained a subset of lipids extracted by PyShex slurping, which the group had not been able to do before.

Ontology and OWL Subsetting

On the very first day, we had guests from Project 32: WikiProjects. They were looking to get a subset of ontologies and as far as I know they were going to use SheX constraints instead of OWL to reduce the amount of OWL checking in the first steps to narrow down the required computations and speed up reasoning over ontologies. It was a new and very interesting use case. Eric from our group went deeply into the problem with them.

KGTK is now our teammate!

In the middle of the project, I managed to invite Filip, the head of the KGTK team, to the group. Sabah joined our meetings too and after then, we had the most experienced people in each subsetting approach in the discussions.

Outcomes and Future Work

The outcomes of the project this year if I want to mention one by one:

Jose Labra's two Subsetting tools. Jose has been developing two subsetting tools (WDSub and SparkWDSub) that use ShEx for filtering and Wikidata Java Toolkit to extract information directly from the Wikidata JSON dumps. They do like ShEx filtering over the JSON dump which is the best option I believe. SparkWDSub is very interesting, with a mathematical basement (Pregel Algorithm) and distributed computation!
The main achievement I believe is we aggregated the features and pros/cons of each tool in a detailed table. That table will be completed
We also collected all available subsetting approaches and tools We defined a quite new use case of subsetting which is ontology subsetting, and we had lots of discussions about improving the subsets and errors.

We are going to submit a report preprint on BioHackrXiv very soon. Then we have plans to complete some experiments on large-scale data with different tools and report all achievements in a journal paper, probably in Nature’s scientific data journal. I have also the ambitious idea of making SparkWDSub the most flexible, best performance, most accurate subsetting tool, which seemed cool to teammates, but surely needs time!

At the end, I say Biohackathon 21 was a great pleasure. Hope to work with all my teammates soon.

Please share your comments with me via email (sh200 [at] hw.ac.uk) or Twitter.

Seyed's Blog