02.22.09
Vermeer in Stamps
In the following links we can find a little more about Vermeer’s life, as well as some of his most famous paintings reproduced in stamps in different parts of the world:
06.02.08
Michigan Corpus of Academic Spoken English
The Michigan Corpus of Academic Spoken English is a research project of the English Language Institute (ELI) at the University of Michigan. Its main aim was to answer these and other questions:
· What are the characteristics of contemporary academic speech—its grammar, its vocabulary, its functions and purposes, its fluencies and dysfluencies?
· Are these characteristics different for different academic disciplines and for different classes of speakers?
The goal of the first phase of the project was to record and transcribe close to 200 hours (approximately 1.8 million words) of academic speech from across the university. Nowadays, there are 152 transcripts (totaling 1,848,364 words) available at this site. The digital sound recordings were transcribed with the help of a computer program called SoundScriber.
The entire corpus is available at micase.umdl.umich.edu, as it was planned as an easily available “open” project. This search engine is notable for the large number of speaker and speech-event categories that can be selected. The ELI committed resources to MICASE for a series of reasons:
-There was originally no database of this kind available.
-MICASE provides authentic material in sufficient quantity to redefine our concepts of academic speech, because we can find many divergences from those described in current grammar and vocabulary books.
-There is the hope that people would be able to track changes in speech patterns as they gain experience of university culture.
-With all this new information, people will be in a better position to develop more appropriate ESL and English for Academic Purpose teaching and testing materials, and to evaluate how best to incorporate them.
How to use it:
We can choose the option:
-Browse MICCASE: This option must be chosen to browse the corpus according to specified speaker and speech attributes, returning quick file references.
We have to choose the criteria using the menus and then click the button to see transcripts that fit the criteria. We can choose between some speaker’s attributes, such as: academic position/role, native speaker status and the first language. And we also have some transcript attributes, such as: speech event types, Academic Division, Academic Discipline, Participant Level and Interactivity Rating.
-Search MICASE: This option must be chosen to search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers.
We have to enter the exact word or phrase we wish to find in the box. The wildcard character * may be used at the end (but not the beginning) of a search word or phrase to represent zero or more characters (e.g. typing in walk* will give you walk, walks, walked, and walking). If we wish to search the entire corpus, we have to use the default settings on the speaker and transcript attributes. If we wish to do a more specific search, we have to choose the speaker and transcript level criteria using the menus on the right. When we click the button, utterances by speakers that fit the speaker-level criteria within transcripts that fit the transcript-level criteria will be found. We can choose between some speaker’s attributes: that added to the ones found in “Browse Micase” we also find age and gender; and transcript attributes, with the same options to choose as in Browse MICASE”.
Information taken from the main page of the “Michigan corpus of Academic Spoken English“.
04.29.08
Time Corpus (Mark Davies)
This website created by Mark Davies is very useful to find words of American English from 1923 to the present, just by choosing the date and the word to look up. This corpus has more than 100 million words, as found in TIME magazine.
Some of the advantages of it, is that we can see how words and phrases have increased and decreased in usage, and how they have changed their meaning over time, by looking at changes in collocates (co-occurring words). We can also have the corpus generate a list of words that were used more in one period than another, even when you don’t know what the specified words might be.
To look up a word, we can choose the type of display:
- CHART: This option presents “bar charts” that indicate the overall frequency for all matching words or phrases in each section of the corpus. This is probably the best option for comparing between different genres, or to compare time blocks.
- LIST: With this option, we see a listing of each individual word or string that matches the query.
- COMPARE WORDS: This allows us to compare the collocates (nearby words) for two different words. When selected, we will see the frequency of each matching string for the following nine groupings: [genres] spoken, fiction, magazines, newspapers, academic; [time blocks] 1990-1994, 1995-1999, 2000-2004, 2005-2007. When it is not selected, you will see the overall frequency in the entire corpus.
In the search string:
- You enter the basic search string (words). We can also enter “context” words and indicate how many words away this is with. We can use parts of speech as part of our query. For example, [j*] eyes in [1] would find a two word string, composed of a form of eyes immediately preceded by an adjective.
- We can also create “User lists” or “customized lists”, relating to a certain topic, words that are grammatically related, or any other listing.
In the section:
- We can chose the date, or dates to compare and the minimun frequency.
Now that we have seen all the option of search and we have given a little introduction of how to use them, we can try and see all the possibilities this program offers us, wich is very useful for doing any research or statistics.
04.22.08
CLUVI
CLUVI is the Corpus Lingüístico da Universidade de Vigo.
It is a group of parallel textal corpus of registers, specialized in contemporary Galician language. It has been made by SLI (Seminario de Lingüística Informática) in 2003. It has an extension of 22 million words, and its main components are the TECTRA corpus of English- Galician literary texts, the FEGA corpus of French- Galician literary texts, the LEGA corpus of Galician- Spanish legal and administrative texts, the English- Galician- French- Spanish UNESCO corpus of scientific spreading, the LOGALIZA corpus of English- Galician software location and the Spanish- Galician- Catalan- Basque CONSUMER corpus of information about the consumption.
The public consultation of the texts is made through a spanish interface available in http://sli.uvigo.es/CLUVI/. It allow making simple or complex researches of words or expressions, and see the plurilingual equivalences of the searched words in the contexts of use in real or documented translations.
The number of works available in the page and the number of languages in which they are available grow regularly, as the investigation project of CLUVI is still working.
Moreover, the Corpus Paralelo CLUVI allow look for another corpora apart from TECTRA, FEGA, LEGA, LOGALIZA and UNESCO. I also has to be said that through the CLUVI interface we can access to the TURIGAL corpus of turism portuguese-english, the LEGEBIDUN corpus Basque- Spanish of legal and administrative texts developped by the group DELi of Deusto University.
Bibliography: CLUVI
The Brown Corpus Tag-set
In this page we can see a list of tags used in the Brown corpus, ordered alphabetically. It also includes description and examples of each tag. It uses ‘combined tags’ for word such as won’t and I’d, that come in only these two forms. If the list of examples ends with an ellipsis marker then the tag category can be assumed to be an open class. Either negated words have an asterisk appended after their tag or the plus symbol separates the tags for the different tokens that make up the complete combined word. This makes it a trivial task to split combined tags. AMALGAM’s version of the Brown tagger annotates with combined tags only if the tokeniser is switched off. If the tokeniser is used the combined words are split into their constituent parts and the tags applied to each part. So, won’t (MD*) becomes will (MD) plus n’t (*) and I’d (PPSS+HVD) becomes I (PPSS) plus ‘d (HVD).
The intention behind the present set of programmes is to put at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions, like lexical analysis. The main programme, Lexa, allows to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what possible words are to be assigned to what lemmas. The rest is taken care of by the programme.
It is assumed that the user is acquainted with the basics of computer hardware and software and that one has at least some experience with word processing if not with database management. Those users who have no basic notions, are strongly advised to acquire the necessary background knowledge in these relevant areas before embarking on linguistic data processing.
Lexa is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later.
Article based on the page AMALGAM.
Further bibliography: ICAME-LEXA: Corpus Processing Software
04.01.08
2nd short LANGUAGE RESOURCE REVIEW
CREA (RAE)
REAL ACADEMIA ESPAÑOLA: Banco de datos (CREA) [en línea]. Corpus de referencia del español actual. <http://www.rae.es> [1 de abril del 2008]
Crea is a corpus of reference of the actual Spanish. It is used to show the way in which words work and how they appear in the texts. It is a sophisticated tool that professionals use to show, for example, what follows or by what is followed a common noun or an article. Nevertheless, it can be accessed by any one that owns a computer. At the same time it provides some contexts for the words that can give the reader an idea of the meaning of that word, its concordance and its situation within the text.
The system of consult has three main windows.
The first one, and the most important one allow people to choose his or her preferences to search. It is what is called: “El perfil de consulta”. In it you have several gaps in which you add your consultation and other selective criteria like author, work, chronological and geographical matters, subject or medium.
The second window offers stadistic results of the research: how many cases and in how many documents do they appear, it also has somefilters and you can see the make them up. In order to have some examples of it there is also a chart in which you can minimize and concrete the list of examples. This is, it has some filters such as classification, grouping, tag, and the way you want to obtain those examples: in documents, concordances, or whatever.
In the third window we see the same chart -to concrete the list of examples- (in case of being very long) and the list of examples in the classificatory way that has been chosen. In the examples we see that the word searched appears always in the middle and with different colour.
From my point of view, this list and the properties of research of the page are very useful to the user. Moreover, in the results, it does not only appear the sentences where the searched word is; but also in what kind of medium does it appear, the name of the author, the year, country, title of the document and where it has been published (in case anyone is interested in searching it).
I have tried with the name corpus and here are some of examples of the list of examples that have appeared:
Nº CONCORDANCIA- AÑO- AUTOR- TÍTULO- PAÍS- TEMA- PUBLICACIÓN
1.- …gados de la CEAR presentaron un recurso de habeas corpus, que no fue aceptado. ** 2001- PRENSA- El Diario Vasco, 11/01/2001 : Liberados en Irlanda los poliz ESPAÑA – 03.Protección civil- Sociedad Vascongada de Publicaciones (San Sebastián), 2001
2.- ...ienso en que el editor de Fonollosa escogió de un corpus poético inmenso, escrito a lo largo de toda su ** 1994- PRENSA- La Vanguardia, 30/09/1994 : Para salir de la clandestinidad ESPAÑA- 02.Literatura -T.I.S.A (Barcelona), 1994
3.- …todas las vueltas y vaivenes de recursos y habeas corpus, muertes reiteradas. Personaje sin desperdicio ** 1994- PRENSA- La Vanguardia, 30/09/1994 : EL TÚNEL DEL INFIERNO ESPAÑA- 02.Literatura- T.I.S.A (Barcelona), 1994
4.- ...bra tiene una voluntad de convertirse en obligado corpus informativo del especialista en el tema, más q ** 1994- PRENSA- La Vanguardia, 14/01/1994 : Historias enciclópedicas para ti ESPAÑA- 02.Historia- T.I.S.A (Barcelona), 1994
02.25.08
ELRA
The term language resources refers to a set of speech or language data and descriptions in machine readable form, used as core resources for the software localisation and language services industries, for language studies, electronic publishing, international transactions, subject-area specialists and end users.
Examples of language resources are written and spoken corpora, computational lexicons, terminology databases, speech collection and processing, etc. Basic software tools are also important for the acquisition, preparation, collection, management, customisation and use of these language resources and other resources.
To know more about The Language Resources and their Applications visit the place of ELRALRs Applications
The European Language Resources Association (ELRA) was established in 1995. It is the driving force to make available the language resources for language engineering and to evaluate language engineering technologies. In order to achieve this goal, ELRA is active in identification, distribution, collection, validation, standardisation, improvement, in promoting the production of language resources, in supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources and evaluation, etc. These activities are achieved through ELRA’s operational body ELDA (Evaluation & Language resources Distribution Agency).
ELRA’s missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, it offers a range of services described in the “Services around Language Resources” section:
- Identification, production, validation and distribution of language resources
- Promotion of the production of language resources
- Evaluation of systems, products, tools, etc., related to language resources
- Standardisation
ELRA also regularly conducts market studies and surveys in the field of HLT, and publishes a quarterly newsletter, distributed not only to its members but also to a large number of people in the HLT community. In doing so, ELRA participates in the development of HLT and promotes HLT among the players on national, European and international levels.
Services: main missions and tasks
● The catalogue of Resources : The databases are divided into 3 groups : Speech, Written Lexica/Corpora & Terminology.
● Identification and distribution of LRs : Apart from the wide range of resources in the catalogue, members can at any time turn to ELRA for information on other databases available or databases being developed.
● Legal assistance : ELRA enjoys close cooperation with legal experts in the field. Assistance can be offered when acquiring or distributing databases, or in negotiations with parties when LRs are involved.
● Production of new LRs on demand : ELRA can produce and/or package LRs at favorable pricing conditions.
● Evaluation : ELRA participates in evaluation campaigns both by supplying the language resources appropriate for evaluation and testing, and by getting involved in the evaluation process itself (evaluation of tools, systems, applications, etc.).
● Information : The ELRA Newsletter is published on a quarterly basis, with updates on the association activities and articles reflecting subjects from the different HLT areas. Other ELRA publications are also disseminated, including publications on legal and commercial aspects of the distribution of Language Resources and market surveys. ELRA collects information, facts and figures on the HLT/LR markets throughout the world, capitalizing on the experience of the ELRA members, along with that of other sources. Regular market surveys provide useful information on specific topics for members of the organization.
● LREC : Registration to the Language Resources and Evaluation Conference is offered at a discounted price for ELRA members.
Information taken from ELRA home page
02.11.08
Más sobre la Web 2.0
Web 2.0. El Negocio de las Redes Sociales es el título de la última publicación del Future Trends Forum (FTF), foro de expertos internacionales de La Fundación de la Innovación Bankinter, cuyas conclusiones se presentaron el 7 de febrero en La Comercial
La Web 2.0. generará múltiples oportunidades de negocio y un flujo de conocimiento ilimitado
Web 2.0 no es una nueva versión de la web, ni un protocolo de comunicaciones, ni un nuevo lenguaje de programación; es una web participativa y eficaz que nos ahorrará tiempo y proporcionará un flujo de conocimiento ilimitado y numerosas oportunidades de negocio, tanto para crear nuevas empresas como para generar mayores beneficios en las ya existentes. La web 2.0 surge gracias a que la evolución de la tecnología (ancho de banda y arquitectura modular) ha posibilitado que el usuario, además de acceder a la información, cree contenidos y aporte valor. La idea principal es que ‘lo que no se comparte se pierde’, y, en este sentido, cuantos más usuarios aporten contenidos, mayor será el valor percibido del servicio.
Estas son algunas de las conclusiones que se desprenden de la última publicación de la Fundación de la Innovación Bankinter, ?Web 2.0. El Negocio de las Redes Sociales?, que se presentaron el 7 de febrero en la Facultad de CC.EE. y Empresariales, La Comercial, de la Universidad de Deusto, seleccionada para este fin por dicho forum por ser institución de referencia en el País Vasco. Dicha presentación corrió a cargo de Profesor Michael Schrage, Center for Digital Business, MIT Sloan School.
El estudio tiene un triple objetivo: analizar las principales implicaciones, tanto en la sociedad como en la educación, que trae consigo esta nueva tecnología; identificar las industrias que se verán más afectadas; analizar los modelos de negocio existentes y las nuevas oportunidades; y por último, plantear la adaptación al marco legal del nuevo entorno, de forma que no se frene el flujo de conocimiento y a la vez se fomente el ambiente participativo.
Definición y campos de impacto de la web 2.0.
Tres serían los principios que podrían definir la web 2.0.: Comunidad, el usuario aporta contenidos, interactúa con otros usuarios y crea redes de conocimiento; Tecnología: Un mayor ancho de banda permite transferir información a una velocidad antes inimaginable. En lugar de paquetes de software es posible tener servicios web, y cada terminal puede ser cliente y servidor a la vez; Arquitectura modular: Favorece la creación de aplicaciones complejas de forma más rápida y a un menor coste.
Noticia publicada en la web de la Universidad de Deusto con fecha 07/02/2008
01.23.08
Wikilengua
La wikilengua es un sitio abierto y participativo sobre las dudas prácticas del castellano y un medio para reflejar su diversidad. Es un recurso sobre el uso del castellano, donde se pueden consultar, con una orientación esencialmente práctica, dudas frecuentes y que se puede ir extendiendo y corrigiendo con la colaboración de la propia comunidad.
Al estar abierta y accesible a personas de todo el mundo, la Wikilengua puede ser también un medio para reflejar la diversidad y la riqueza del español en sus múltiples variantes habladas en más de una veintena de países.
La consulta del sitio es libre y gratuita. No es necesario registrarse ni para consultar ni para hacer observaciones o propuestas en las páginas de comentarios. También se pueden proponer artículos.La Wikilengua es un sitio vivo con constantes cambios y aportaciones, por lo que es posible que en ocasiones haya información imprecisa, con lagunas o que no está bien organizada, sobre todo en estos primeros momentos.
La Wikilengua se organiza en varias categorías. En cada categoría hay una lista de subcategorías y artículos, que se pueden consultar simplemente pulsando sobre el elemento deseados.
Para escribir
Muchos medios de comunicación están dando la información de que las colaboraciones en la Wikilengua pasan por un filtro previo antes de su publicación. Esta información es incorrecta.El modelo es más cercano a la Wikipedia, donde una serie de supervisores revisarán los cambios que se van haciendo a los artículos. Los supervisores, naturalmente, saldrán de la propia comunidad de la Wikilengua.
La Wikilengua crece y se construye mediante una comunidad en la que pueden participar todas las personas interesadas en la lengua, individualmente o como parte de una entidad, que quieran compartir sus conocimientos con cientos de millones de hispanohablantes. Incluso la simple corrección de erratas o de lapsus, puede ser una valiosa aportación.
Ir a la página de registro para colaborar
Objetivos
La Wikilengua ni es una fuente normativa ni busca crear normas. Su objetivo es exponer la norma, sea cual sea su origen, así como reflejar el uso y explicar en su caso en qué medida se aparta de ella; también busca exponer las objeciones que se plantean a las normas. Todo ello procurando dar un punto de vista neutral.
Tampoco es una wiki sobre lingüística, sino que está enfocada a los hablantes que quieren información práctica. Por tanto, los articulos no deben contener explicaciones demasiado complejas, con terminología especializada, que entren en detalles que no dan una solución o que se centren en teoría lingüística.
Por tanto, la Wikilengua es:
- un punto de encuentro de correctores, periodistas, lingüistas, etc., así como usuarios en general, para compartir las soluciones que se han encontrado a problemas concretos;
- una guía de estilo que ayude en los detalles de la aplicación de las normas;
- un medio para avisar sobre los errores que los propios miembros encuentren con frecuencia en la corrección o la traducción, los medios de comunicación, los textos escritos en general…
Pero no es:
- una obra de referencia, como un diccionario (como el Wikcionario), un conjugador de verbos, etc.;
- un sitio donde se puede hacer preguntas y se contestan en breve;
- una obra normativa;
- una bitácora de opinión; más bien al contrario, su propósito es la objetividad, mediante la puesta en común de diferentes puntos de vista;
- un foro de debate, donde los mensajes se suceden sin llegar a dar información de forma coherente y organizada;
- una referencia de lingüística o filología teórica.
Licencia
El contenido de la Wikilengua está bajo la licencia Creative Commons en la modalidad BY-SA. Esta es la fórmula recomendada por la propia organización de Creative Commons para wikis, y su objetivo es proteger los derechos de propiedad intelectual que la comunidad de la Wikilengua tiene de su contenido, de forma que nadie pueda abusar del esfuerzo voluntario y colectivo de las personas que la integran.
Ello no implica que los autores de forma individual no conserven los derechos de sus colaboraciones en la Wikilengua.
Fuentes:
01.15.08
Creative commons
(CC) is a non-profit organization devoted to expanding the range of creative work available for others legally to build upon and share. The organization has released several copyright licenses known as Creative Commons licenses. These licenses, depending on the one chosen, restrict only certain rights (or none) of the work.
Aim
The Creative Commons licenses enable copyright holders to grant some or all of their rights to the public while retaining others through a variety of licensing and contract schemes including dedication to the public domain or open content licensing terms. The intention is to avoid the problems copyright laws create for the sharing of information.
The project provides several free licenses that copyright owners can use when releasing their works on the Web. It also provides metadata that describes the license and the work, making it easier to process and locate licensed works.
All these efforts are done to counter the effects of what Creative Commons considers to be a dominant and increasingly restrictive permission culture. In the words of Lawrence Lessig, -founder of Creative Commons and former Chairman of the Board-, it
is “a culture in which creators get to create only with the permission of the powerful, or of creators from the past”. Lessig maintains that modern culture is dominated by traditional content distributors in order to maintain and strengthen their monopolies on cultural products such as popular music and popular cinema, and that Creative Commons can provide alternatives to these restrictions.
History
The Creative Commons licenses were pre-dated by the Open Publication License and the GNU Free Documentation License (GFDL). The GFDL was intended mainly as a license for software documentation, but is also in active use by non-software projects such as Wikipedia. Both licenses contained optional parts that, in the opinions of critics, made them less “free”. The GFDL differs from the CC licenses in its requirement that the licensed work be distributed in a form which is “transparent”, i.e., not in a proprietary and/or confidential format.
Creative Commons was officially launched in 2001 by Lawrence Lessig. The initial set of Creative Commons licenses was published on December 16, 2002 and were written with the U.S. legal system in mind, so the wording could be incompatible within different local legislations and render the licenses unenforceable in various jurisdictions. To address this issue, Creative Commons International has started to port the various licenses to accommodate local copyright and private law.
The Creative Commons was first tested in court in early 2006, when podcaster Adam Curry sued a Dutch tabloid who published photos without permission from his Flickr page. The photos were licensed under the Creative Commons NonCommercial license. While the verdict was in favour of Curry, the tabloid avoided having to pay restitution to him as long as they did not repeat the offense. An analysis of the decision states, “The decision is especially noteworthy because it confirms that the conditions of a Creative Commons license automatically apply to the content licensed under it, and bind users of such content even without expressly agreeing to, or having knowledge of, the conditions of the license.”
Common Content was set up by Jeff Kramer with cooperation from Creative Commons, and is currently maintained by volunteers.
Tools for discovering CC-licensed content
- Creative Commons’ Search Page
- Creative Commons’ Content Directories
- Yahoo’s Creative Commons Search
- Google Advanced Search – select an option under Usage Rights, to search for CC content.
- Common Content – now offline (accessed 16 November 2007).
- Mozilla Firefox web browser with default Creative Commons search functionality
- The Internet Archive – Project dedicated to maintaining an archive of multimedia resources, among which Creative Commons-licensed content
- Ourmedia – Media archive supported by the Internet Archive
- ccHost – Server web software used by ccmixter and Open Clip Art Library
- MusiCC – “Your Free Social Booking”
Photos and images
- Everystockphoto.com – Search engine and member bookmarking for Creative Commons Photos
- Open Clip Art Library
Criticism
During its first year as an organization, Creative Commons experienced a “honeymoon” period with very little criticism. Recently, however, critical attention has focused on the Creative Commons movement and how well it is living up to its perceived values and goals. The critical positions taken can be roughly divided up into complaints of a lack of:
- A political position – Where the object is to critically analyze the foundations of the Creative Commons movement and offer an eminent critique. One of the more notable concerns to be found in this vein of criticism is on the role the Creative Commons plays as an unconcerned corporate filter. As mentioned in Martin Hardie and “Creative License Fetishism”, “When one examines closely just exactly what sort of ‘freedom’ is ultimately to be had within these licenses, one is quick to discover that they are primarily set up as tools meant to feed directly into corporate co-option.”
- A common sense position – These usually fall into the category of “it is not needed” or “it takes away user rights”
- A pro-copyright position – These are usually marshalled by the content industry and argue either that Creative Commons is not useful, or that it undermines copyright (Nimmer 2005).
Another criticism is that it worsens license proliferation, by providing multiple licenses that are incompatible. Most notably, ‘attribution-sharealike’ and ‘attribution-noncommercial-sharealike’ are incompatible, meaning works cannot be created that combine material from both.