01.16.10
La Verdad
“La verdad no es anestésica, es analgésica. Es la mentira la que es anestésica, te deja en estado de atontamiento. La verdad va al corazón del problema.” (Risto Mejide)
01.12.10
La libertà
“Uno cerca la libertà solo quando si sente prigioniero” (Libro: Ho voglia di te)
02.22.09
Vermeer in Stamps
In the following links we can find a little more about Vermeer’s life, as well as some of his most famous paintings reproduced in stamps in different parts of the world:
06.02.08
Michigan Corpus of Academic Spoken English
The Michigan Corpus of Academic Spoken English is a research project of the English Language Institute (ELI) at the University of Michigan. Its main aim was to answer these and other questions:
· What are the characteristics of contemporary academic speech—its grammar, its vocabulary, its functions and purposes, its fluencies and dysfluencies?
· Are these characteristics different for different academic disciplines and for different classes of speakers?
The goal of the first phase of the project was to record and transcribe close to 200 hours (approximately 1.8 million words) of academic speech from across the university. Nowadays, there are 152 transcripts (totaling 1,848,364 words) available at this site. The digital sound recordings were transcribed with the help of a computer program called SoundScriber.
The entire corpus is available at micase.umdl.umich.edu, as it was planned as an easily available “open” project. This search engine is notable for the large number of speaker and speech-event categories that can be selected. The ELI committed resources to MICASE for a series of reasons:
-There was originally no database of this kind available.
-MICASE provides authentic material in sufficient quantity to redefine our concepts of academic speech, because we can find many divergences from those described in current grammar and vocabulary books.
-There is the hope that people would be able to track changes in speech patterns as they gain experience of university culture.
-With all this new information, people will be in a better position to develop more appropriate ESL and English for Academic Purpose teaching and testing materials, and to evaluate how best to incorporate them.
How to use it:
We can choose the option:
-Browse MICCASE: This option must be chosen to browse the corpus according to specified speaker and speech attributes, returning quick file references.
We have to choose the criteria using the menus and then click the button to see transcripts that fit the criteria. We can choose between some speaker’s attributes, such as: academic position/role, native speaker status and the first language. And we also have some transcript attributes, such as: speech event types, Academic Division, Academic Discipline, Participant Level and Interactivity Rating.
-Search MICASE: This option must be chosen to search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers.
We have to enter the exact word or phrase we wish to find in the box. The wildcard character * may be used at the end (but not the beginning) of a search word or phrase to represent zero or more characters (e.g. typing in walk* will give you walk, walks, walked, and walking). If we wish to search the entire corpus, we have to use the default settings on the speaker and transcript attributes. If we wish to do a more specific search, we have to choose the speaker and transcript level criteria using the menus on the right. When we click the button, utterances by speakers that fit the speaker-level criteria within transcripts that fit the transcript-level criteria will be found. We can choose between some speaker’s attributes: that added to the ones found in “Browse Micase” we also find age and gender; and transcript attributes, with the same options to choose as in Browse MICASE”.
Information taken from the main page of the “Michigan corpus of Academic Spoken English“.
04.29.08
Time Corpus (Mark Davies)
This website created by Mark Davies is very useful to find words of American English from 1923 to the present, just by choosing the date and the word to look up. This corpus has more than 100 million words, as found in TIME magazine.
Some of the advantages of it, is that we can see how words and phrases have increased and decreased in usage, and how they have changed their meaning over time, by looking at changes in collocates (co-occurring words). We can also have the corpus generate a list of words that were used more in one period than another, even when you don’t know what the specified words might be.
To look up a word, we can choose the type of display:
- CHART: This option presents “bar charts” that indicate the overall frequency for all matching words or phrases in each section of the corpus. This is probably the best option for comparing between different genres, or to compare time blocks.
- LIST: With this option, we see a listing of each individual word or string that matches the query.
- COMPARE WORDS: This allows us to compare the collocates (nearby words) for two different words. When selected, we will see the frequency of each matching string for the following nine groupings: [genres] spoken, fiction, magazines, newspapers, academic; [time blocks] 1990-1994, 1995-1999, 2000-2004, 2005-2007. When it is not selected, you will see the overall frequency in the entire corpus.
In the search string:
- You enter the basic search string (words). We can also enter “context” words and indicate how many words away this is with. We can use parts of speech as part of our query. For example, [j*] eyes in [1] would find a two word string, composed of a form of eyes immediately preceded by an adjective.
- We can also create “User lists” or “customized lists”, relating to a certain topic, words that are grammatically related, or any other listing.
In the section:
- We can chose the date, or dates to compare and the minimun frequency.
Now that we have seen all the option of search and we have given a little introduction of how to use them, we can try and see all the possibilities this program offers us, wich is very useful for doing any research or statistics.
04.22.08
CLUVI
CLUVI is the Corpus Lingüístico da Universidade de Vigo.
It is a group of parallel textal corpus of registers, specialized in contemporary Galician language. It has been made by SLI (Seminario de Lingüística Informática) in 2003. It has an extension of 22 million words, and its main components are the TECTRA corpus of English- Galician literary texts, the FEGA corpus of French- Galician literary texts, the LEGA corpus of Galician- Spanish legal and administrative texts, the English- Galician- French- Spanish UNESCO corpus of scientific spreading, the LOGALIZA corpus of English- Galician software location and the Spanish- Galician- Catalan- Basque CONSUMER corpus of information about the consumption.
The public consultation of the texts is made through a spanish interface available in http://sli.uvigo.es/CLUVI/. It allow making simple or complex researches of words or expressions, and see the plurilingual equivalences of the searched words in the contexts of use in real or documented translations.
The number of works available in the page and the number of languages in which they are available grow regularly, as the investigation project of CLUVI is still working.
Moreover, the Corpus Paralelo CLUVI allow look for another corpora apart from TECTRA, FEGA, LEGA, LOGALIZA and UNESCO. I also has to be said that through the CLUVI interface we can access to the TURIGAL corpus of turism portuguese-english, the LEGEBIDUN corpus Basque- Spanish of legal and administrative texts developped by the group DELi of Deusto University.
Bibliography: CLUVI
The Brown Corpus Tag-set
In this page we can see a list of tags used in the Brown corpus, ordered alphabetically. It also includes description and examples of each tag. It uses ‘combined tags’ for word such as won’t and I’d, that come in only these two forms. If the list of examples ends with an ellipsis marker then the tag category can be assumed to be an open class. Either negated words have an asterisk appended after their tag or the plus symbol separates the tags for the different tokens that make up the complete combined word. This makes it a trivial task to split combined tags. AMALGAM’s version of the Brown tagger annotates with combined tags only if the tokeniser is switched off. If the tokeniser is used the combined words are split into their constituent parts and the tags applied to each part. So, won’t (MD*) becomes will (MD) plus n’t (*) and I’d (PPSS+HVD) becomes I (PPSS) plus ‘d (HVD).
The intention behind the present set of programmes is to put at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions, like lexical analysis. The main programme, Lexa, allows to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what possible words are to be assigned to what lemmas. The rest is taken care of by the programme.
It is assumed that the user is acquainted with the basics of computer hardware and software and that one has at least some experience with word processing if not with database management. Those users who have no basic notions, are strongly advised to acquire the necessary background knowledge in these relevant areas before embarking on linguistic data processing.
Lexa is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later.
Article based on the page AMALGAM.
Further bibliography: ICAME-LEXA: Corpus Processing Software
04.01.08
2nd short LANGUAGE RESOURCE REVIEW
CREA (RAE)
REAL ACADEMIA ESPAÑOLA: Banco de datos (CREA) [en línea]. Corpus de referencia del español actual. <http://www.rae.es> [1 de abril del 2008]
Crea is a corpus of reference of the actual Spanish. It is used to show the way in which words work and how they appear in the texts. It is a sophisticated tool that professionals use to show, for example, what follows or by what is followed a common noun or an article. Nevertheless, it can be accessed by any one that owns a computer. At the same time it provides some contexts for the words that can give the reader an idea of the meaning of that word, its concordance and its situation within the text.
The system of consult has three main windows.
The first one, and the most important one allow people to choose his or her preferences to search. It is what is called: “El perfil de consulta”. In it you have several gaps in which you add your consultation and other selective criteria like author, work, chronological and geographical matters, subject or medium.
The second window offers stadistic results of the research: how many cases and in how many documents do they appear, it also has somefilters and you can see the make them up. In order to have some examples of it there is also a chart in which you can minimize and concrete the list of examples. This is, it has some filters such as classification, grouping, tag, and the way you want to obtain those examples: in documents, concordances, or whatever.
In the third window we see the same chart -to concrete the list of examples- (in case of being very long) and the list of examples in the classificatory way that has been chosen. In the examples we see that the word searched appears always in the middle and with different colour.
From my point of view, this list and the properties of research of the page are very useful to the user. Moreover, in the results, it does not only appear the sentences where the searched word is; but also in what kind of medium does it appear, the name of the author, the year, country, title of the document and where it has been published (in case anyone is interested in searching it).
I have tried with the name corpus and here are some of examples of the list of examples that have appeared:
Nº CONCORDANCIA- AÑO- AUTOR- TÍTULO- PAÍS- TEMA- PUBLICACIÓN
1.- …gados de la CEAR presentaron un recurso de habeas corpus, que no fue aceptado. ** 2001- PRENSA- El Diario Vasco, 11/01/2001 : Liberados en Irlanda los poliz ESPAÑA – 03.Protección civil- Sociedad Vascongada de Publicaciones (San Sebastián), 2001
2.- ...ienso en que el editor de Fonollosa escogió de un corpus poético inmenso, escrito a lo largo de toda su ** 1994- PRENSA- La Vanguardia, 30/09/1994 : Para salir de la clandestinidad ESPAÑA- 02.Literatura -T.I.S.A (Barcelona), 1994
3.- …todas las vueltas y vaivenes de recursos y habeas corpus, muertes reiteradas. Personaje sin desperdicio ** 1994- PRENSA- La Vanguardia, 30/09/1994 : EL TÚNEL DEL INFIERNO ESPAÑA- 02.Literatura- T.I.S.A (Barcelona), 1994
4.- ...bra tiene una voluntad de convertirse en obligado corpus informativo del especialista en el tema, más q ** 1994- PRENSA- La Vanguardia, 14/01/1994 : Historias enciclópedicas para ti ESPAÑA- 02.Historia- T.I.S.A (Barcelona), 1994
02.25.08
ELRA
The term language resources refers to a set of speech or language data and descriptions in machine readable form, used as core resources for the software localisation and language services industries, for language studies, electronic publishing, international transactions, subject-area specialists and end users.
Examples of language resources are written and spoken corpora, computational lexicons, terminology databases, speech collection and processing, etc. Basic software tools are also important for the acquisition, preparation, collection, management, customisation and use of these language resources and other resources.
To know more about The Language Resources and their Applications visit the place of ELRALRs Applications
The European Language Resources Association (ELRA) was established in 1995. It is the driving force to make available the language resources for language engineering and to evaluate language engineering technologies. In order to achieve this goal, ELRA is active in identification, distribution, collection, validation, standardisation, improvement, in promoting the production of language resources, in supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources and evaluation, etc. These activities are achieved through ELRA’s operational body ELDA (Evaluation & Language resources Distribution Agency).
ELRA’s missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, it offers a range of services described in the “Services around Language Resources” section:
- Identification, production, validation and distribution of language resources
- Promotion of the production of language resources
- Evaluation of systems, products, tools, etc., related to language resources
- Standardisation
ELRA also regularly conducts market studies and surveys in the field of HLT, and publishes a quarterly newsletter, distributed not only to its members but also to a large number of people in the HLT community. In doing so, ELRA participates in the development of HLT and promotes HLT among the players on national, European and international levels.
Services: main missions and tasks
● The catalogue of Resources : The databases are divided into 3 groups : Speech, Written Lexica/Corpora & Terminology.
● Identification and distribution of LRs : Apart from the wide range of resources in the catalogue, members can at any time turn to ELRA for information on other databases available or databases being developed.
● Legal assistance : ELRA enjoys close cooperation with legal experts in the field. Assistance can be offered when acquiring or distributing databases, or in negotiations with parties when LRs are involved.
● Production of new LRs on demand : ELRA can produce and/or package LRs at favorable pricing conditions.
● Evaluation : ELRA participates in evaluation campaigns both by supplying the language resources appropriate for evaluation and testing, and by getting involved in the evaluation process itself (evaluation of tools, systems, applications, etc.).
● Information : The ELRA Newsletter is published on a quarterly basis, with updates on the association activities and articles reflecting subjects from the different HLT areas. Other ELRA publications are also disseminated, including publications on legal and commercial aspects of the distribution of Language Resources and market surveys. ELRA collects information, facts and figures on the HLT/LR markets throughout the world, capitalizing on the experience of the ELRA members, along with that of other sources. Regular market surveys provide useful information on specific topics for members of the organization.
● LREC : Registration to the Language Resources and Evaluation Conference is offered at a discounted price for ELRA members.
Information taken from ELRA home page
02.11.08
Más sobre la Web 2.0
Web 2.0. El Negocio de las Redes Sociales es el título de la última publicación del Future Trends Forum (FTF), foro de expertos internacionales de La Fundación de la Innovación Bankinter, cuyas conclusiones se presentaron el 7 de febrero en La Comercial
La Web 2.0. generará múltiples oportunidades de negocio y un flujo de conocimiento ilimitado
Web 2.0 no es una nueva versión de la web, ni un protocolo de comunicaciones, ni un nuevo lenguaje de programación; es una web participativa y eficaz que nos ahorrará tiempo y proporcionará un flujo de conocimiento ilimitado y numerosas oportunidades de negocio, tanto para crear nuevas empresas como para generar mayores beneficios en las ya existentes. La web 2.0 surge gracias a que la evolución de la tecnología (ancho de banda y arquitectura modular) ha posibilitado que el usuario, además de acceder a la información, cree contenidos y aporte valor. La idea principal es que ‘lo que no se comparte se pierde’, y, en este sentido, cuantos más usuarios aporten contenidos, mayor será el valor percibido del servicio.
Estas son algunas de las conclusiones que se desprenden de la última publicación de la Fundación de la Innovación Bankinter, ?Web 2.0. El Negocio de las Redes Sociales?, que se presentaron el 7 de febrero en la Facultad de CC.EE. y Empresariales, La Comercial, de la Universidad de Deusto, seleccionada para este fin por dicho forum por ser institución de referencia en el País Vasco. Dicha presentación corrió a cargo de Profesor Michael Schrage, Center for Digital Business, MIT Sloan School.
El estudio tiene un triple objetivo: analizar las principales implicaciones, tanto en la sociedad como en la educación, que trae consigo esta nueva tecnología; identificar las industrias que se verán más afectadas; analizar los modelos de negocio existentes y las nuevas oportunidades; y por último, plantear la adaptación al marco legal del nuevo entorno, de forma que no se frene el flujo de conocimiento y a la vez se fomente el ambiente participativo.
Definición y campos de impacto de la web 2.0.
Tres serían los principios que podrían definir la web 2.0.: Comunidad, el usuario aporta contenidos, interactúa con otros usuarios y crea redes de conocimiento; Tecnología: Un mayor ancho de banda permite transferir información a una velocidad antes inimaginable. En lugar de paquetes de software es posible tener servicios web, y cada terminal puede ser cliente y servidor a la vez; Arquitectura modular: Favorece la creación de aplicaciones complejas de forma más rápida y a un menor coste.
Noticia publicada en la web de la Universidad de Deusto con fecha 07/02/2008