Pittsburgh, USA — On 19 March, Simon and Sonja presented recent work at the University of Pittsburgh’s Hub for AI and Data Science Leadership (HAIL) seminar series. Their talk, titled “From Scarcity to Signal – Combining LLMs, Synthetic Data and an Active Learning Framework for Rare Event Detection,” explored new approaches to building structured event data from text when the events of interest are rare and difficult to identify.
The talk presented a semi-automated pipeline for constructing an event dataset from newspaper articles, using attacks on education as a rare and underreported event type. The approach combines LLM-based event extraction with an active learning framework that helps prioritize the most informative samples for human annotation. To address extreme class imbalance, the method also incorporates synthetic data to strengthen rare event classes and make more efficient use of limited labeling resources while improving model performance.
Together, these components support a reliable and scalable way of producing structured event data, while keeping human expertise central to the process.
Their work is part of the EdAttack project led by Gudrun Østby at PRIO.
