Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс:
http://repository.hneu.edu.ua/handle/123456789/26122
Название: | Using Word2vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language |
Авторы: | Savytska L. V. Vnukova N. M. Bezugla I. V. Pyvovarov V. Sübay M. T. |
Ключевые слова: | word2vec NLP cosine similarity semantic relations morphologicaword vectorsl (linguistics) relations word vectors word embedding Ukrainian language |
Дата публикации: | 2021 |
Библиографическое описание: | Savytska L. V. Using Word2vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language / L. V. Savytska, N. M. Vnukova, I. V. Bezugla at el. ‒ CEUR Workshop Proceedings, 2021. ‒ Р. 235–248. |
Краткий осмотр (реферат): | The study presents the word translation into vectors of real numbers (word embeddings), one of the most important topics in natural language processing. Word2vec is the latest techniques developed by Tomas Mikolov to study high quality vectors. The majority of studies on clustering the word vectors were made in English. Dmitry Chaplinsky has already counted and published vectors for the Ukrainian language by using LexVec, Word2vec and GloVe techniques, obtained from fiction, newswire and ubercorpus texts, for VESUM dictionary and other related NLP tools for the Ukrainian language. There was no research done on the vectors by using Word2vec technique to create Ukrainian corpus, obtained from Wikipedia dump as the main source. The collection contains more than two hundred and sixty one million words. The dictionary of words (unique words) obtained from the corpus is more than seven hundred and nine thousand. The research using machine technology Word2vec is of great practical importance to computerise many areas of linguistic analysis. The open-source Python programming language was used to obtain word vectors with Word2vec techniques and to calculate the cosine proximity of the vectors. In order to do machine learning with Word2vec techniques on Python, a resource containing open source licensed software libraries called "Gensim" was used. Calculations regarding the cosine affinities of the obtained vectors were made using "Gensim" libraries. The research examining the clustering of the word vectors obtained from the Ukrainian corpus was made considering the two sub-branches of linguistics, semantics and morphology (language morphology). Firstly, it was investigated how accurately the vectors are obtained from the Ukrainian corpus and how the words represent the cluster they belong to. Secondly, it was investigated how word vectors are clustered and associated respectively to the morphological features of the suffixes of the Ukrainian language. |
URI (Унифицированный идентификатор ресурса): | http://repository.hneu.edu.ua/handle/123456789/26122 |
Располагается в коллекциях: | Статті (МТМСФТ) |
Файлы этого ресурса:
Файл | Описание | Размер | Формат | |
---|---|---|---|---|
paper21.pdf | 1,34 MB | Adobe PDF | Просмотреть/Открыть |
Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.