The compilation of a sample PFR Chinese corpus of Skeleton-parsed sentences



Argitaratua 2005-04-10
May Lai-Yin Wong


The approach taken in this paper for the construction of a treebank is inspired by the skeleton parsing approach. From the PFR Chinese Corpus, a sample text of some 100,000 word tokens was chosen for the production of the treebank. A clear account of the 17 non terminal constituents that are defined and instantiated in the corpus texts will be provided in a parsing scheme. A set of parsing guidelines on practical issues related to map any parses on to sentences in the application of the parsing scheme will also be considered. It is noteworthy also to discuss the major difficulties encountered in the course of skeleton parsing, as this illuminates some of the peculiarities of the Chinese language. The conclusion is an evaluation of the success of the treebank compilation.

