Background

WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full English Wikipedia. A high-level discussion of the WikiWoods motivation and general methodology is provided by Flickinger et al. (2010). The corpus itself and a preliminary set of parsing results (in [incr tsdb()] form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release is now available for download from this site, in two different formats (see below).

Corpus Organization

The WikiWoods Corpus is extracted from a Wikipedia snapshot of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million ‘sentences’ (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of raw articles (4.4 gigabytes compressed), prior to preprocessing; (b) as a set of preprocessed and sentence-segmented text files, including normalized wiki mark-up (2.2 gigabytes compressed); and (c) in a more recently and more thoroughly preprocessed plain-text version, using more normalized GML mark-up.

Both sets of files are organized by segments, each comprised of 100 articles. Please see Flickinger et al. (2010) and Solberg (2012) for details.

First Release (1004)

As of May 2010, parsing the WikiWoods corpus is complete, and [incr tsdb()] profiles are available for download (typically, one would extract the HPSG derivation from the result relation, i.e. field #11 of the underlying tsdb(1) data files). Each archive contains [incr tsdb()] data files for about 1300 WikiWoods segments, and the files are designed to ‘plug into’ the directory structure of the so-called LOGON distribution.

To simplify access to the derivation trees, and to readily make available other views on the HPSG analyses—as described by Flickinger et al. (2010)—we also provide a set of plain text files, exported from [incr tsdb()]. As of early June, export files are available for download as ten archives, each containing compressed export files for about 1300 segments. Due to technical issues in a few corner cases, some 30 segments are currently still missing from these exports.

Subsequent Versions

For each of the official ERG releases since April 2010, the full WikiWoods Corpus was re-parsed, re-exported, and packaged in the same set of formats as provided for the initial release. More in-depth instructions on how to utilize this data are, sadly, still pending. The 1111 release of WikiWoods has been stable for a while. As of mid-2013, a 1212 release is available, moving to a newer version of the underlying corpus and taking advantage of the advances in content extraction from Wikipedia and markup processing developed by Solberg (2012).

License Information

The original Wikipedia content, i.e. the WikiWoods Corpus (as of 2008) is licensed under the GNU Free Documentation License (Version 1.2). HPSG annotations of the raw text, i.e. the WikiWoods Treecache, are made available under the terms of the GNU Lesser General Public License (Version 3).

Acknowledgements

This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the high-performance computing facilities at the University of Oslo, and we are grateful to the Scientific Computing staff at UiO, as well as to the Norwegian Metacenter for Computational Science. Distribution of the WikiWoods data is supported by the national NorStore Storage Infrastructure and the UiO on-line Language Technology Resources collection.

Related Projects

Following is an attempt at listing related initiatives. In case you know of additional pointers that should be included, please email Stephan Oepen.

Last update: 2021-07-20 by Alexandre Rademaker [edit]