Corpus of Time Expressions in Spanish
The HourGlass corpus is collection of 348 documents (short texts) in Spanish tagged with temporal expressions following the TimeML standard. Since it was concieved as a test bed for temporal taggers, each document has an attached a tag and a registy as classification.
The corpus is divided in two parts, depending of the source of the texts.
- The SYNTHETIC part is a collection of 285 documents specifically designed to test some functionalities a temporal tagger should cover, such as detecting basic expressions like "It is five o'clock". Tags such as "Hour", "Dates" or "False" (this is, sentences where some confussing expression should not be tagged) were added to each document in order to facilitate their use (e.g., in the case the coverage of just a some type of expression needs to be tested).
- The PEOPLE part is a collection of 67 documents (although just 63 were added to the final HourGlass corpus due to their ambiguity) proposed by people foreign to the temporal annotation task. They were asked to write sentences with what they considered to be temporal expressions, and these sentences were afterwards analyzed and annotated. Besides the tags, also the register of the sentence (e.g. "normal", "Latin American" or "colloquial") were added.
In the corpus zip you will find:
- annotated_TML: a folder with 348 .tml files annotated with TIMEX3 tags following the TimeML standard. Docs starting with 0 are from the SYNTHATIC part of the corpus, while the ones starting with 9 are from the PEOPLE part.
- plain_TXT: a folder with 348 .txt files with the raw text. Docs starting with 0 are from the SYNTHATIC part of the corpus, while the ones starting with 9 are from the PEOPLE part.
- plain_TML: a folder with 348 .tml files in the TimeML standard format, but without annotations. Docs starting with 0 are from the SYNTHATIC part of the corpus, while the ones starting with 9 are from the PEOPLE part.
- HOURGLASS-PEOPLE.xlsx: Excel file with all the info (Id, Text, Tag, Register) about each document in the PEOPLE part of the corpus + the ones not added to the corpus (these have the tag "Ambiguous"). The id is also the name of the file in the corpus.
- HOURGLASS-SYNTHETIC.xlsx: Excel file with all the info (Id, Text, Tag, Register) about each document in the SYNTHETIC part of the corpus. The id is also the name of the file in the corpus.
- HOURGLASS-TOTAL.xlsx: Excel file with all the info (ID, TEXT, PROVENANCE, TAG, REGISTER, TEST, AMBIGUOUS VALUE) about the 348 files in the final corpus. The ID is also the name of the file in the corpus. PROVENANCE indicates whether the file comes from the SYNTHACTIC part of the corpus or from the PEOPLE part. TEST is the annotated text.
Additionally, we make available the result of three different temporal taggers on this corpus. The metrics obtained by these taggers (calculated using the software GATE) against the key set of annotations for each file and feature (extent, type and value of each tag) can be found next to each tagger below:
- Annotador: the output of Annotador - [ extent | type | value ]
- HeidelTime: the output of HeidelTime - [ extent | type | value]
- SUTime: the output of SUTime - [ extent | type | value ]
- Result: the key annotations.
- Annotador: the annotations by Annotador.
- HeidelTime: the annotations by HeidelTime.
- SUTime: the annotations by SUTime
- Original markups: the original information of the document.
These files can be loaded into GATE as a corpus to facilitate visualization and comparison. This was the software that generated the previous statistics.
The corpus is freely downloadable under a GNU General Public License v3.0 license.
If you plan to publish a work using this resource please refer to this webpage and use its DOI: https://zenodo.org/deposit?page=1&size=20. This work has been accepted in the LKE conference, so we will have sson a paper to cite it!
We would also want to thank the contributors of the PEOPLE part of the corpus.