-
-
Save aparrish/2f562e3737544cf29aaf1af30362f469 to your computer and use it in GitHub Desktop.
Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!
I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.
Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!
I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.
Will do. I've looked at a number of text similarity approaches and they all seem to either rely on iteration or semantic word graphs with a pre-calculated one to one similarity relationship between all the nodes which means 1M nodes = 1M x 1M relationships which is also clearly untennable and very slow to process. I'm sure I must be missing something obvious, but I'm not sure what!
The only thing I can think of at the moment is pre-processing the similarity by saving the vectors for each sentence against their database record, then iterating for each sentence against each other sentence in the way described in this article (or any other similarity distance function), and then saving a one to one graph similarity relationship only for items that are highly similar (to reduce the number of similarity relationships created only to relevent ones). i.e. don't do the iteration at run-time, but rather do the iteration once and save the resulting similarities where of high similarity in node relationships which would be quick to query at runtime.
I'd welcome any guidance from others on this though! Has anyone tried this approach?
Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!
I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.
It looks like a pre-trained approximate nearest neighbour approach may be a good option where you have large numbers of vectors. I've not yet tried this, but here is the logic https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1.html and here is an implementation https://medium.com/@kevin_yang/simple-approximate-nearest-neighbors-in-python-with-annoy-and-lmdb-e8a701baf905
Using the Annoy library, essentially the approach here is to create an lmdb map and an Annoy index with the word embeddings. Then save those to disk. At runtime, load these, vectorise your query text and use Annoy to look up n nearest neighbours and return their IDs.
Anyone have experience of this with sentences rather than just words?
Awesome good!
Vey intuitive tutorial. Thank you!
Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-2-090b6e832a74> in <module>() 3 # It creates a list of unique words in the text 4 tokens = list(set([w.text for w in doc if w.is_alpha])) ----> 5 print nlp.vocab['cheese'].vector lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__() ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: https://spacy.io/usage/models
replace nlp.vocab['cheese'].vector
with nlp('cheese').vector
and
def vec(s):
return nlp.vocab[s].vector
with
def vec(s):
return nlp(s).vector
very good explanation
Enjoyed reading this. Thank you!
One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks
I agree! I thought I had deleted many cells and downloaded it again looking for the gap.
When I ran snippets of code that access a library, it gave me errors like this: "FileNotFoundError: [Errno 2] No such file or directory: 'pg345.txt'". And same thing with the color file: "FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json'"
I ran those on jupyter notebook. Do you know what's wrong?
Note: I tried doing it in Visual Code but it gave me the same problem, even after saving it in the same directory. Also i've read online to use the absolute path, but it still would not work.
Great, Thank You!
Great, well-explained tutorial, thank you!
Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-2-090b6e832a74> in <module>() 3 # It creates a list of unique words in the text 4 tokens = list(set([w.text for w in doc if w.is_alpha])) ----> 5 print nlp.vocab['cheese'].vector lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__() ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: https://spacy.io/usage/models
You want to download 'en_core_web_lg' model
OMG !! Really had a great time reading this beautiful gist. Very well explained.
Thanks!
I was led here by a tutorial on word vectors from youtube. Thanks for the simplicity!
very good
Thank you for sharing this. Excelent job!
this is amazing, thank you for explanation!!
Thanks!!
Very nice tutorial!
One question:
A word near the origin (0,0,0 ...) in the n-space has less possibility to be the result of an addition among words. As opposite, a word very distant of the origin could be the result of many possible additions among many words. Does this mean that complex concepts are far for the origin and basic concepts are near?
Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!