diff --git a/README.md b/README.md index 950d60d7824529540703a6dc32da3f82beecf6d1..6c621792f8778b784ebcdd88d48656bfcc3865c7 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,44 @@ -# Embedding WordNet Graph +# Learning Word Representations by Embedding the WordNet Graph +--- + +*Topic: Machine Learning, Natural Language Processing, Word Embeddings, Graph Embeddings* + +*Category: Review, implementation* + +*Authors:* + +- Thibault Cordier (thibault.cordier@ens-paris-saclay.fr) +- Antoine Tadros (antoine.tadros@ens-paris-saclay.fr) + +*Supervisors:* + +- Pascal Denis (pascal.denis@inria.fr) +- Rémi Gilleron (remi.gilleron@univ-lille.fr) +- Nathalie Vauquier (nathalie.vauquier@inria.fr) + +--- + +## Description: + +### Context + +How to adequately represent words as vectors is a long-standing and crucial problem in the fields of Text Mining and Natural Language Processing (NLP). This question has recently re-surfaced due to the recent surge of research in “deep” neural networks, and the development of algorithms for learning distributed word representations –or “word embeddings”– (the best known of which is probably word2vec [6, 7]). Typically, these approaches directly construct word representations from large amounts of unannotated texts, and don’t make use of any linguistic resource. + +Due to the ubiquity of networks in many real world applications, and the need for better graph analytics tools, another recent trend of research has been in developing graph embedding techniques [3, 5]. One specific problem is node embedding, where the goal is to encode the graph nodes as low-dimensional vectors that faithfully summarize their graph position and the topology of their local neighborhood. Several new deep learning algorithms have been proposed for node embedding (e.g., node2vec [4], deepwalk [8]), in addition to already well-established matrix factorization-based approaches like Local Linear Embedding (LLE) [9] or Laplacian Eigenmaps [1]. These approaches have been used for different types of graphs such as knowledge graphs and semantic graphs. + +### Objectives + +The overall objective is to improve word representations with the help of existing lexical databases like WordNet (https://wordnet.princeton.edu/). For this, we aim to combine word embedding techniques from texts with node embedding techniques over WordNet. This internship is a preliminary step in this direction. The goal is to explore recent node embedding algorithms, in particular node2vec and deepwalk, to learn synset embeddings from the WordNet lexical database. + +The tentative work-plan is as follows: + +1. Review the relevant literature on word and graph embedding methods. + +2. Construct similarity graphs over the 80k WordNet noun synsets, using various synset similarity algorithms [2]. + +3. Apply node2vec and deepwalk on these similarity graphs to derive noun synset representations. + +4. Map these synset representations into word representations, and evaluate these against standard word similarity datasets. + +5. If time permits, investigate a new similarity algorithm that would incorporate WordNet edge labels (e.g., hypernym vs. antonym link relations). diff --git a/code.py b/code.py new file mode 100644 index 0000000000000000000000000000000000000000..84b69ebf664d66ff2563a4efb34826a764fe9b0c --- /dev/null +++ b/code.py @@ -0,0 +1,118 @@ +''' +################################################################################################### + +# Project: Learning Word Representations by Embedding the WordNet Graph + +# Topic: Machine Learning, Natural Language Processing, Word Embeddings, Graph Embeddings + +# Autors: Thibault Cordier & Antoine Tadros + +################################################################################################### +''' + +''' +################################################################################################### +# Step 1: Review the relevant literature on word and graph embedding methods. +################################################################################################### +''' + +''' +################################################################################################### +# Step 2: Construct similarity graphs over the 80k WordNet noun synsets, using various synset similarity algorithms. +################################################################################################### +''' + +''' +## Importing NLTK and WordNet +''' + +import nltk + +download_brown_ic = True +download_semcor_ic = False +download_genesis_ic = False + +# WordNet: Import the NLTK corpus reader. +nltk.download('wordnet') +from nltk.corpus import wordnet as wn + +# Information Content: Load an information content file from the wordnet_ic corpus. +# Download Brown IC +if download_brown_ic or download_semcor_ic : + nltk.download('wordnet_ic') + from nltk.corpus import wordnet_ic + ic = wordnet_ic.ic('ic-brown.dat') + +# Information Content: Load an information content file from the wordnet_ic corpus. +# Download Semcor IC +if download_semcor_ic : + nltk.download('wordnet_ic') + from nltk.corpus import wordnet_ic + ic = wordnet_ic.ic('ic-semcor.dat') + +# Or you can create an information content dictionary from a corpus (or anything that has a words() method) +# Create Genesis IC +if download_genesis_ic : + nltk.download('genesis') + from nltk.corpus import genesis + ic = wn.ic(genesis, False, 0.0) + +''' +## Importing NetworkX +## and Construct similarity graphs +''' + +import networkx as nx +import numpy as np +from tqdm import tqdm +from time import time + +noun_list = list(wn.all_synsets('n')) +N_noun = len(noun_list) + +t0 = time() +t1 = time() + +nx_G = nx.Graph() + +print("Add Nodes:") +for i, synset in enumerate(noun_list): + nx_G.add_node(i) + +print("Add Edges:") +''' +for i1, synset1 in enumerate(noun_list): + if i1%round(5*N_noun/100) == 0 : + percent = 100.*i1/N_noun + print(percent,"%","Time :",time()-t1) + t1 = time() + for i2, synset2 in enumerate(noun_list): + if i2>=i1 : + #x_G.add_edge(i1, i2, weight=synset1.path_similarity(synset2)) +''' +cur_noun_list = list() +for i1, synset1 in enumerate(noun_list): + if i1%round(5*N_noun/100) == 0 : + percent = 100.*i1/N_noun + print(percent,"%","Time :",time()-t1) + t1 = time() + for i2, synset2 in enumerate(cur_noun_list): + nx_G.add_edge(i1, i2, weight=synset1.res_similarity(synset2, ic)) + cur_noun_list.append(synset1) + +# synset1.path_similarity(synset2) # Hirst and St-Onge Similarity +# synset1.lch_similarity(synset2) # Leacock-Chodorow Similarity +# synset1.wup_similarity(synset2) # Wu-Palmer Similarity +# synset1.res_similarity(synset2, ic) # Resnik Similarity (brown_ic or genesis_ic) +# synset1.jcn_similarity(synset2, ic) # Jiang-Conrath Similarity (brown_ic or genesis_ic) +# synset1.lin_similarity(synset2, ic) # Lin Similarity (semcor_ic) + +print(time()-t0) + +nx.write_weighted_edgelist(nx_G,'graph/wordnet.graph') + +''' +################################################################################################### +Step 2 and 3 bis: Direct implementation +################################################################################################### +''' diff --git a/papers/(1) Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.pdf b/papers/(1) Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.pdf new file mode 100644 index 0000000000000000000000000000000000000000..b9a31a6c51ce6c0619cd616e4c3c90519536c772 Binary files /dev/null and b/papers/(1) Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.pdf differ diff --git a/papers/(2) Evaluating WordNet-based Measures of Lexical Semantic Relatedness.pdf b/papers/(2) Evaluating WordNet-based Measures of Lexical Semantic Relatedness.pdf new file mode 100644 index 0000000000000000000000000000000000000000..2fc8f8d66b5c47fdb243aa7c94714adf8f3b749c Binary files /dev/null and b/papers/(2) Evaluating WordNet-based Measures of Lexical Semantic Relatedness.pdf differ diff --git a/papers/(3) A Comprehensive Survey of Graph Embedding Problems, Techniques and Applications.pdf b/papers/(3) A Comprehensive Survey of Graph Embedding Problems, Techniques and Applications.pdf new file mode 100644 index 0000000000000000000000000000000000000000..c52b0e08706703a15029254130a3cdcc155b250a Binary files /dev/null and b/papers/(3) A Comprehensive Survey of Graph Embedding Problems, Techniques and Applications.pdf differ diff --git a/papers/(4) node2vec Scalable Feature Learning for Networks.pdf b/papers/(4) node2vec Scalable Feature Learning for Networks.pdf new file mode 100644 index 0000000000000000000000000000000000000000..56bedddb7723a6ae5410d491906b6d3757ef8762 Binary files /dev/null and b/papers/(4) node2vec Scalable Feature Learning for Networks.pdf differ diff --git a/papers/(5) Representation Learning on Graphs Methods and Applications.pdf b/papers/(5) Representation Learning on Graphs Methods and Applications.pdf new file mode 100644 index 0000000000000000000000000000000000000000..30fadf05ba0735523bbd3526a3bd88213fd6040b Binary files /dev/null and b/papers/(5) Representation Learning on Graphs Methods and Applications.pdf differ diff --git a/papers/(6) Efficient Estimation of Word Representations in Vector Space.pdf b/papers/(6) Efficient Estimation of Word Representations in Vector Space.pdf new file mode 100644 index 0000000000000000000000000000000000000000..aa17ab06a044d4af0121b97911f0b98b37f801f7 Binary files /dev/null and b/papers/(6) Efficient Estimation of Word Representations in Vector Space.pdf differ diff --git a/papers/(7) Distributed Representations of Words and Phrases and their Compositionality.pdf b/papers/(7) Distributed Representations of Words and Phrases and their Compositionality.pdf new file mode 100644 index 0000000000000000000000000000000000000000..8094d15c5a05a11b6111db2dac35003dcfa0f9c4 Binary files /dev/null and b/papers/(7) Distributed Representations of Words and Phrases and their Compositionality.pdf differ diff --git a/papers/(8) DeepWalk Online Learning of Social Representations.pdf b/papers/(8) DeepWalk Online Learning of Social Representations.pdf new file mode 100644 index 0000000000000000000000000000000000000000..7554ffaeff16bac17c0c480667c2bc982800db45 Binary files /dev/null and b/papers/(8) DeepWalk Online Learning of Social Representations.pdf differ diff --git a/papers/(9) Nonlinear Dimensionality Reduction by Locally Linear Embedding.pdf b/papers/(9) Nonlinear Dimensionality Reduction by Locally Linear Embedding.pdf new file mode 100644 index 0000000000000000000000000000000000000000..61891b78445f40365f4145f325223d7e58346197 Binary files /dev/null and b/papers/(9) Nonlinear Dimensionality Reduction by Locally Linear Embedding.pdf differ diff --git a/subject.md b/subject.md new file mode 100644 index 0000000000000000000000000000000000000000..76a82e7908ca7494aff1a5a68bb891e815567cdb --- /dev/null +++ b/subject.md @@ -0,0 +1,47 @@ +# Learning Word Representations by Embedding the WordNet Graph + +--- + +*Topic: Machine Learning, Natural Language Processing, Word Embeddings, Graph Embeddings* + +*Category: Review, implementation* + +*Contact:* + +- Pascal Denis (pascal.denis@inria.fr) +- Rémi Gilleron (remi.gilleron@univ-lille.fr) +- Nathalie Vauquier (nathalie.vauquier@inria.fr) + +--- + +## Description: + +### Context + +How to adequately represent words as vectors is a long-standing and crucial problem in the fields of Text Mining and Natural Language Processing (NLP). This question has recently re-surfaced due to the recent surge of research in “deep” neural networks, and the development of algorithms for learning distributed word representations –or “word embeddings”– (the best known of which is probably word2vec [6, 7]). Typically, these approaches directly construct word representations from large amounts of unannotated texts, and don’t make use of any linguistic resource. + +Due to the ubiquity of networks in many real world applications, and the need for better graph analytics tools, another recent trend of research has been in developing graph embedding techniques [3, 5]. One specific problem is node embedding, where the goal is to encode the graph nodes as low-dimensional vectors that faithfully summarize their graph position and the topology of their local neighborhood. Several new deep learning algorithms have been proposed for node embedding (e.g., node2vec [4], deepwalk [8]), in addition to already well-established matrix factorization-based approaches like Local Linear Embedding (LLE) [9] or Laplacian Eigenmaps [1]. These approaches have been used for different types of graphs such as knowledge graphs and semantic graphs. + +### Objectives + +The overall objective is to improve word representations with the help of existing lexical databases like WordNet (https://wordnet.princeton.edu/). For this, we aim to combine word embedding techniques from texts with node embedding techniques over WordNet. This internship is a preliminary step in this direction. The goal is to explore recent node embedding algorithms, in particular node2vec and deepwalk, to learn synset embeddings from the WordNet lexical database. + +The tentative work-plan is as follows: + +1. Review the relevant literature on word and graph embedding methods. + +2. Construct similarity graphs over the 80k WordNet noun synsets, using various synset similarity algorithms [2]. + +3. Apply node2vec and deepwalk on these similarity graphs to derive noun synset representations. + +4. Map these synset representations into word representations, and evaluate these against standard word similarity datasets. + +5. If time permits, investigate a new similarity algorithm that would incorporate WordNet edge labels (e.g., hypernym vs. antonym link relations). + +Experiments will be done with the help of the Mangoes toolbox (https://gitlab.inria.fr/magnet/mangoes). Node embedding algorithms could be integrated in the toolbox. + +### Skills + +Basics in machine learning, graph algorithms and complexity, linear algebra. Familiarity with NLP is a plus. + +http://www.nltk.org/howto/wordnet.html diff --git a/subject.pdf b/subject.pdf new file mode 100644 index 0000000000000000000000000000000000000000..e28d3664745e6422e3334f1fafde9e946264fe11 Binary files /dev/null and b/subject.pdf differ