04.29.08
Time Corpus (Mark Davies)
This website created by Mark Davies is very useful to find words of American English from 1923 to the present, just by choosing the date and the word to look up. This corpus has more than 100 million words, as found in TIME magazine.
Some of the advantages of it, is that we can see how words and phrases have increased and decreased in usage, and how they have changed their meaning over time, by looking at changes in collocates (co-occurring words). We can also have the corpus generate a list of words that were used more in one period than another, even when you don’t know what the specified words might be.
To look up a word, we can choose the type of display:
- CHART: This option presents “bar charts” that indicate the overall frequency for all matching words or phrases in each section of the corpus. This is probably the best option for comparing between different genres, or to compare time blocks.
- LIST: With this option, we see a listing of each individual word or string that matches the query.
- COMPARE WORDS: This allows us to compare the collocates (nearby words) for two different words. When selected, we will see the frequency of each matching string for the following nine groupings: [genres] spoken, fiction, magazines, newspapers, academic; [time blocks] 1990-1994, 1995-1999, 2000-2004, 2005-2007. When it is not selected, you will see the overall frequency in the entire corpus.
In the search string:
- You enter the basic search string (words). We can also enter “context” words and indicate how many words away this is with. We can use parts of speech as part of our query. For example, [j*] eyes in [1] would find a two word string, composed of a form of eyes immediately preceded by an adjective.
- We can also create “User lists” or “customized lists”, relating to a certain topic, words that are grammatically related, or any other listing.
In the section:
- We can chose the date, or dates to compare and the minimun frequency.
Now that we have seen all the option of search and we have given a little introduction of how to use them, we can try and see all the possibilities this program offers us, wich is very useful for doing any research or statistics.
04.22.08
CLUVI
CLUVI is the Corpus Lingüístico da Universidade de Vigo.
It is a group of parallel textal corpus of registers, specialized in contemporary Galician language. It has been made by SLI (Seminario de Lingüística Informática) in 2003. It has an extension of 22 million words, and its main components are the TECTRA corpus of English- Galician literary texts, the FEGA corpus of French- Galician literary texts, the LEGA corpus of Galician- Spanish legal and administrative texts, the English- Galician- French- Spanish UNESCO corpus of scientific spreading, the LOGALIZA corpus of English- Galician software location and the Spanish- Galician- Catalan- Basque CONSUMER corpus of information about the consumption.
The public consultation of the texts is made through a spanish interface available in http://sli.uvigo.es/CLUVI/. It allow making simple or complex researches of words or expressions, and see the plurilingual equivalences of the searched words in the contexts of use in real or documented translations.
The number of works available in the page and the number of languages in which they are available grow regularly, as the investigation project of CLUVI is still working.
Moreover, the Corpus Paralelo CLUVI allow look for another corpora apart from TECTRA, FEGA, LEGA, LOGALIZA and UNESCO. I also has to be said that through the CLUVI interface we can access to the TURIGAL corpus of turism portuguese-english, the LEGEBIDUN corpus Basque- Spanish of legal and administrative texts developped by the group DELi of Deusto University.
Bibliography: CLUVI
The Brown Corpus Tag-set
In this page we can see a list of tags used in the Brown corpus, ordered alphabetically. It also includes description and examples of each tag. It uses ‘combined tags’ for word such as won’t and I’d, that come in only these two forms. If the list of examples ends with an ellipsis marker then the tag category can be assumed to be an open class. Either negated words have an asterisk appended after their tag or the plus symbol separates the tags for the different tokens that make up the complete combined word. This makes it a trivial task to split combined tags. AMALGAM’s version of the Brown tagger annotates with combined tags only if the tokeniser is switched off. If the tokeniser is used the combined words are split into their constituent parts and the tags applied to each part. So, won’t (MD*) becomes will (MD) plus n’t (*) and I’d (PPSS+HVD) becomes I (PPSS) plus ‘d (HVD).
The intention behind the present set of programmes is to put at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions, like lexical analysis. The main programme, Lexa, allows to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what possible words are to be assigned to what lemmas. The rest is taken care of by the programme.
It is assumed that the user is acquainted with the basics of computer hardware and software and that one has at least some experience with word processing if not with database management. Those users who have no basic notions, are strongly advised to acquire the necessary background knowledge in these relevant areas before embarking on linguistic data processing.
Lexa is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later.
Article based on the page AMALGAM.
Further bibliography: ICAME-LEXA: Corpus Processing Software
04.01.08
2nd short LANGUAGE RESOURCE REVIEW
CREA (RAE)
REAL ACADEMIA ESPAÑOLA: Banco de datos (CREA) [en línea]. Corpus de referencia del español actual. <http://www.rae.es> [1 de abril del 2008]
Crea is a corpus of reference of the actual Spanish. It is used to show the way in which words work and how they appear in the texts. It is a sophisticated tool that professionals use to show, for example, what follows or by what is followed a common noun or an article. Nevertheless, it can be accessed by any one that owns a computer. At the same time it provides some contexts for the words that can give the reader an idea of the meaning of that word, its concordance and its situation within the text.
The system of consult has three main windows.
The first one, and the most important one allow people to choose his or her preferences to search. It is what is called: “El perfil de consulta”. In it you have several gaps in which you add your consultation and other selective criteria like author, work, chronological and geographical matters, subject or medium.
The second window offers stadistic results of the research: how many cases and in how many documents do they appear, it also has somefilters and you can see the make them up. In order to have some examples of it there is also a chart in which you can minimize and concrete the list of examples. This is, it has some filters such as classification, grouping, tag, and the way you want to obtain those examples: in documents, concordances, or whatever.
In the third window we see the same chart -to concrete the list of examples- (in case of being very long) and the list of examples in the classificatory way that has been chosen. In the examples we see that the word searched appears always in the middle and with different colour.
From my point of view, this list and the properties of research of the page are very useful to the user. Moreover, in the results, it does not only appear the sentences where the searched word is; but also in what kind of medium does it appear, the name of the author, the year, country, title of the document and where it has been published (in case anyone is interested in searching it).
I have tried with the name corpus and here are some of examples of the list of examples that have appeared:
Nº CONCORDANCIA- AÑO- AUTOR- TÍTULO- PAÍS- TEMA- PUBLICACIÓN
1.- …gados de la CEAR presentaron un recurso de habeas corpus, que no fue aceptado. ** 2001- PRENSA- El Diario Vasco, 11/01/2001 : Liberados en Irlanda los poliz ESPAÑA – 03.Protección civil- Sociedad Vascongada de Publicaciones (San Sebastián), 2001
2.- ...ienso en que el editor de Fonollosa escogió de un corpus poético inmenso, escrito a lo largo de toda su ** 1994- PRENSA- La Vanguardia, 30/09/1994 : Para salir de la clandestinidad ESPAÑA- 02.Literatura -T.I.S.A (Barcelona), 1994
3.- …todas las vueltas y vaivenes de recursos y habeas corpus, muertes reiteradas. Personaje sin desperdicio ** 1994- PRENSA- La Vanguardia, 30/09/1994 : EL TÚNEL DEL INFIERNO ESPAÑA- 02.Literatura- T.I.S.A (Barcelona), 1994
4.- ...bra tiene una voluntad de convertirse en obligado corpus informativo del especialista en el tema, más q ** 1994- PRENSA- La Vanguardia, 14/01/1994 : Historias enciclópedicas para ti ESPAÑA- 02.Historia- T.I.S.A (Barcelona), 1994