04.22.08

The Brown Corpus Tag-set

Publicado en Language Resources a 10:17 am por Ana Carrera

In this page we can see a list of tags used in the Brown corpus, ordered alphabetically. It also includes description and examples of each tag. It uses ‘combined tags’ for word such as won’t and I’d, that come in only these two forms. If the list of examples ends with an ellipsis marker then the tag category can be assumed to be an open class. Either negated words have an asterisk appended after their tag or the plus symbol separates the tags for the different tokens that make up the complete combined word. This makes it a trivial task to split combined tags. AMALGAM’s version of the Brown tagger annotates with combined tags only if the tokeniser is switched off. If the tokeniser is used the combined words are split into their constituent parts and the tags applied to each part. So, won’t (MD*) becomes will (MD) plus n’t (*) and I’d (PPSS+HVD) becomes I (PPSS) plus ‘d (HVD).

The intention behind the present set of programmes is to put at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions, like lexical analysis. The main programme, Lexa, allows to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what possible words are to be assigned to what lemmas. The rest is taken care of by the programme.

It is assumed that the user is acquainted with the basics of computer hardware and software and that one has at least some experience with word processing if not with database management. Those users who have no basic notions, are strongly advised to acquire the necessary background knowledge in these relevant areas before embarking on linguistic data processing.

Lexa is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later.

 

Article based on the page AMALGAM.

Further bibliography: ICAME-LEXA: Corpus Processing Software

Dejar un comentario

Debes ser Sesión como para publicar un comentario.