Linguistic Information in Word Embeddings Chapitre d’ouvrage - Décembre 2019

Ali Basirat, Marc Tang

Ali Basirat, Marc Tang, « Linguistic Information in Word Embeddings  », in Agents and artificial intelligence, 2019, pp. 492-513


We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.

Voir la notice complète sur HAL