Penn Treebank III 3 LDC99T42
Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor

LDC99T42_Penn_Treebank_3.tar.zst29.83MB
Type: Dataset
Tags: Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB

Bibtex:
@article{,
title= {Penn Treebank III 3 LDC99T42},
journal= {},
author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor},
year= {1999},
isbn= {1-58563-163-9},
islrn= {141-282-691-413-2},
dcmi= {text},
language= {english},
doi= {10.35111/gq1x-j780},
url= {https://doi.org/10.35111/gq1x-j780},
abstract= {# Penn Treebank III

## Metadata

- _Item Name:_ Treebank-3
- _Author(s):_ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
- _LDC Catalog No.:_ LDC99T42
- _ISBN:_ 1-58563-163-9
- _ISLRN:_ 141-282-691-413-2
- _DOI:_ [https://doi.org/10.35111/gq1x-j780](https://doi.org/10.35111/gq1x-j780)
- _Member Year(s):_ 1999
- _DCMI Type(s):_ Text
- _Data Source(s):_ telephone speech, newswire, microphone speech, transcribed speech, varied
- _Project(s):_ TIDES, GALE
- _Application(s):_ parsing, natural language processing, tagging
- _Language(s):_ English
- _Language ID(s):_ eng
- _License(s):_ [LDC User Agreement for Non-Members](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf)
- _Online Documentation:_ [LDC99T42 Documents](https://catalog.ldc.upenn.edu/docs/LDC99T42/)
- _Citation:_ Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.

## Introduction

This release contains the following [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) Material:

-   One million words of 1989 Wall Street Journal material annotated in Treebank II style.
-   A small sample of ATIS-3 material annotated in Treebank II style.
-   A fully tagged version of the Brown Corpus.

and the following new material:

-   Switchboard tagged, dysfluency-annotated, and parsed text
-   Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

## Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)) and Treebank-3 ([LDC99T42](https://catalog.ldc.upenn.edu/LDC99T42)) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Samples

Please view the following samples:

-   [Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.pos.txt)
-   [Dysfluency Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dff.txt)
-   [Dysfluency Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mgd.txt)
-   [Dysfluency Annotation, Part-of-Speech Tags & Turns Joined](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dps.txt)
-   [Syntactic Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.prd.txt)
-   [Syntactic Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mrg.txt)

## Updates

After publication, it was discovered that not all of the postscript (\*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to [addenda](https://catalog.ldc.upenn.edu/desc/addenda/LDC1999T42) for a list of the files available.

As of October 5, 2016 252 wsj files from [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)).

Corpus downoads after these dates will include these missing files.},
keywords= {Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback