Background
LTG staff actively participates in the informal, multi-national research collaboration on Deep Linguistic Processing with HPSG Initiative (DELPH-IN). At the annual DELPH-IN Summit (i.e. gathering of the clique), partners often give overviews of (more or less) relevant developments at individual sites. This page is intended to develop into a stream of LTG updates related to DELPH-IN.
2017 Site Update
Continuous, if sometimes low-energy: ERG Semantic Documentation; MRS-derived dependencies (in 2017 so far, SDP about as popular as the PCEDT; four papers and one keynote at ACL use ERG-based dependencies); 1214 WikiWoods release imminent: new, ‘atomic’ export format.
Ongoing doctoral project (Murhaf Fares): Joint learning for the identification, bracketing, and thematic interpretation of nominal compounds. Comparison of bracketing in DeepBank, PCEDT, and PTB.
Completed MSc project (Kjetil Bugge Kristoffersen): Extract ‘high-quality’ corpus from Common Crawl: 130 billion tokens of English.
Sabbatical Syndrome Recovery: SynSem umbrella will bring together DELPH-IN and ParGram folks with other NLP researchers in the academic year 2017–18 (as well as sponsor the 2017 DELPH-IN Summit).
New initiative: Nordic Language Processing Laboratory (NLPL). Among other things: repository of very large corpora and re-usable word embeddings; on-line explorer for semantic similarity and analogy; generic infrastructure for Extrinsic Parser Evaluation (EPE).
2016 Site Update
Currently not much funded project activity with near-exclusive focus on DELPH-IN technologies; Linguistic Analysis Portal (LAP) now (almost) in production; ERG parsing stack to be integrated in the next twelve months.
Angelina Ivanova has defended her doctoral thesis, which contains some encouraging results for grammar-based parsing; in-depth summary of parser comparison recently published in Journal of Language Modelling.
Semantic Dependency Parsing tasks at SemEval 2014 and 2015 attracted some 15 teams; emerging visibility of ERG-derived (bi-lexical) semantic dependencies. All target representations now available via the LDC; open-source sub-set available for public download.
WeSearch services re-worked recently; now client-side visualization (generic JavaScript packages) and semantic dependencies. The code is available under an open-source license.
New, work-in-progress RESTful interface to ERG on-line parser; see ErgApi page.
ERG 1214 release, finally official (as of June 15, 2016).
Ongoing work on generation from Elementary Dependency Structures (later on the programme).
2014 Site Update
The WeSearch project (on methods for parser adaptation to user-generated content) is wrapping up at the end of this year: Angelina is writing up her thesis; Bec will be moving on (and will present on domain variation later in the week).
LAP about six months behind schedule; working prototype, but not yet publicly accessible; DELPH-IN technologies remain to be integrated. Preparing to contribute to ToE initiative this fall: European Parliament proceedings with meta-information and (linguistic) annotation; all coded in RDF.
[to be completed]
Angelina
Milen Kouylekov
Arne Skjærholt
Norveig Eskelund
NeIC
2013 Site Update
Two funded projects currently use and extend DELPH-IN technologies, WeSearch (on methods for parser adaptation to user-generated content) and LAP (the Language Analysis Portal, part of the Norwegian CLARIN(O) initiative).
Work in WeSearch by AngelinaIvanova (on relating bi-lexical dependency representations and DELPH-IN HPSG analyses), by RebeccaDridan (on, among things, ubertagging for faster and more accurate parsing), and by StephanOepen and off-site collaborators (on working towards documentation of ERG Semantic Analyses) are presented individually at the 2013 Summit.
Another WeSearch activity has been collaborative work with DanFlickinger on enabling the ERG to analyse inputs annotated (optionally) with (two types of) candidate phrase boundaries, or candidate target bi-lexical dependencies. Following are some example inputs (using GML mark-up; see below) that exemplify this functionality:
She met the ⌊(⌋cat in the hotel.⌊)⌋
She met the ⌊(⌋cat in the hotel⌊)⌋.
the cat saw⌊←¦sb-hd⌋ runs.
the cat saw⌊←¦sb-hd¦<29:34>⌋ runs.
This functionality is not in the 1212 release of the ERG but currently coming together in the ERG trunk; in a first instance, it will be validated in in-house projects at LTG.
In the LAP context, there now is a live pilot portal providing Web access to pre-configured tokenization, PoS tagging, and syntactic dependency parsing tools for English and Norwegian (running on a Norwegian national supercomputer, i.e. potentially making available high-performance computing capabilities to non-technical users). The LAP architecture is based on the LAF (Linguistic Annotation Framework) data model, but using a distributed NoSQL database as the annotation store, where components record and retrieve annotatons from earlier components in complex workflows. In the year to come, it is expected that the core DELPH-IN toolchain will be made available through the LAP.
Finally, two recent MSc projects at LTG have contributed to DELPH-IN advancement. Solberg (2012) develops a generic infrastructure for extracting ‘relevant linguistic content’ from Wikipedia dumps, preparing a forthcoming new revision 2.0 of the WikiWoods Corpus, seeking to address some of the shortcomings in the (somewhat ad-hoc and overly surface-orientated) original procedure used in the construction of WikiWoods 1.0. A side-effect of this project is an updated definition of the Grammar Markup Language, an attempt at allowing non-intrusive in-line mark-up of layout information that may be relevant to parsing.
In another recently completed MSc thesis, Fares (2013) applies machine learning (binary, CRF-based classification) to the tasks of tokenization (i.e. deciding on token boundaries, in either the PTB-like initial tokenization scheme, or the ERG-defined lexical scheme) and lexical categorization (i.e. assigning morpho-syntactic categories to lexical tokens). In this work, he advances the state of the art in PTB-style tokenization, achieves nearly 95% sentence accuracy for lexical tokenization, and about 93% token-level accuracy in assigning ERG lexical types. When putting these together, i.e. parsing inputs with disambiguated ERG tokenization and annoated with lexical types (selectively, i.e. only constraining the parser when lexical categorization was above an experimentally set confidence threshold), improvements in parsing effiency of factors between two and three were obtained, with mildly increased coverage (due to fewer time-outs) and moderately better parse selection accuracy (due to the reduced search space). End-to-end parsing results for these experiments are presented in the slides from his MSc presentation.
Last update: 2017-08-09 by StephanOepen [edit]