Chats, Tweets and SMS in the SoNaR Corpus: Social Media Collection

DOI: 10.5176/2251-3566_L312149

Authors: Maaske Treurniet and Eric Sanders

Abstract: In this paper we discuss the compilation of a social media corpus with chats, tweets and SMS text messages as part of the SoNaR corpus, a 500-million word reference corpus of written Dutch, comprising many different text categories. Social media are more and more becoming part of everyday life, which makes the need for social media corpora an urgent matter for research. Special focus was addressed to the collection of metadata and intellectual property rights (IPR). IPR was obtained both through licenses with platform owners, and by consent of individual contributors. Recruitment of participants was done by means of free publicity. The data will be available for research and commercial use.

Keywords: SoNaR, social media, chats, tweets, SMS, corpus building

simplr_role_lock:

Updating...