1st Semester 2018/19: Semantic drift in multilingual representations
If you are interested in this project, please contact the instructor(s) by email.
In natural language processing, words are now commonly represented as vectors in high-dimensional semantic space. These vectors are learned based on co-occurrence patterns in corpora with the objective that similar words should be represented by neighbouring vectors.
Recent work on multilingual embeddings aim at projecting monolingual word representations into a joint multilingual semantic space (Ruder et al. 2017). Words that are translations of each other should thus be represented by similar vectors.
As an effect of the multilingual projection, the semantic relations within one language are also affected. Faruqui and Dyer (2014) find that multilingual projection can contribute to word sense disambiguation (e.g., the English word “table” is translated to “tafel” or “tabel” in Dutch depending on the context) and helps to better separate synonyms and antonyms. Dinu et al. (2015), on the other hand, analyse that fine-grained semantic properties tend to be washed out in multilingual semantic space.
In this project, we will analyse semantic drifts from monolingual to multilingual representations. In the first week, we will have two introductory talks that are followed by a theoretical and a practical assignment. These assignments guide the students to acquire the fundamentals of the topic and develop their own research question. With the beginning of the second week, students will choose a research project. Ideally, students will work in groups (depending on the number of participants and their shared interests). In the end, the project should be documented in a report.
Potential projects could involve:
- building and analysing multilingual representations for low-resource languages
- evaluating multilingual representations using cognitive data (e.g. fmri, eeg, eye tracking)
- comparing semantic drifts for etymologically close vs distant languages
- developing multilingual sentence representations
- quantitative analysis of semantic drift phenomena (e.g., semantic relations, selectional preferences, compositionality)
- ideally, participants have taken NLP1
- ability to analyse high-dimensional data (e.g., using python)
1 theoretical and 1 practical assignment in the first week. Report of the project
Dinu, G., Lazaridou, A., & Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem. In Proceedings of ICLR (Workshop Track).Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of EACL, pp. 462–471.Ruder, S., Vúlic, I., Søgaard, A. (2017). A Survey Of Cross-lingual Word Embedding Models. Published on arxive: https://arxiv.org/abs/1706.04902