Seyed's Blog

16 Sep 2021

Wikidata Quality Days 2021

From 8-15 September 2021, I participated in Wikidata Quality Days 2021. The event included different editing sessions for Wikidatians to populate any quality-related thing in Wikidata like properties, ranks, and constraints. But I was in the presentation sessions only.

Day 1

The first day was n introduction about what is data quality, why it is important, why it is tricky, the notion of subjective dimensions, and so on. Then there was a presentation about using scholarly articles metadata to enrich Wikidata biomedical statements. In particular, a Wikidata bot called RefB.

The internal operation of such bots is pretty simple: first, a query is performed for getting unreferenced statements, then the bot searches third-party databases (like "PubMed Central" in this example) and adds the returned IDs as provenance into Wikidata. But as always I would say, the quality of these references should be assessed to conclude that such a bot increases the data quality of Wikidata or any knowledge graph.

Anyway, the next presentation was Alessandro Piscopos's talk about the research he did during his Ph.D. on the quality of Wikidata, especially on references and bot impacts. His works were my early motivation to work on Wikdata reference quality.

Day 2

Started with a presentation about ORES, which is a machine learning approach to detect vandalism in Wikidata by learning about old deleted statements and tags and labels that human users previously placed on incorrect and deleted data.

The next part was the ShEx community presentation about the entity schemas which had the most audiences. About 26 people were there. There was a quick introduction by Andra about ShEx and its tools. Then Eric presented good points about the benefits of using ShEx in visualizing schema structures and attracting UML/XML people into ShEx schemas to expand the society. Then Andra came back and presented a live demo of creating a sample Entity Schema.

I would say that it was the best and the most useful session of the week.

Day 3

we had a presentation about Mismatch Finder which is a proposal to check for mismatches between Wikidata statements and other trusted data sources like Authority Files in different big national libraries. Users can report items they suspect to this system and the system will check and report the mismatches, not immediately but after a period. I didn't find whether it is a heuristic system but they were talking about using machine learning in it.

there was also a presentation about linking Czech Republic libraries and Authority Files to Wikidata and challenges and experiences.

Day 6

At the morning, there was a presentation about periodic Wikidata editathons in Italian libraries and the gains of these edit sessions in populating Wikidata. I the evening, there was a session talking about ontology issues of Wikidata. Problems like having an upper ontology for Wikidata and the mess of Wikidata ontology were discussed.

At the next session which delve with working on Wikidata Fabricator most demanded tickets, I had the chance to speak about my PhD plans on the quality of references and I got some guidance of tracing bot activities and reference specific properties.

Summary

The event was a good opportunity to see what are the concerns about Wikidata Quality. However, the most shortcoming in my opinion is the lack of the critical look at the quality of Wikidata. I didn't feel a critical discussion neither from chairs nor speakers and audiences. It was mostly demos and tools representations and it was like everything is good, we have everything, just let's populate Wikidata. But overall, it was good. Learning about tools is always good.


Please share your comments with me via email (sh200 [at] hw.ac.uk) or Twitter.