Graph-based search_Hands-On Graph Analytics with Neo4j-QQ阅读中文历史网

上QQ阅读APP看书，第一时间看更新

Graph-based search

Graph-based search emerged in 2012, when Google announced its new graph-based search algorithm. It promised more accurate search results, that were closer to a human response to a human question than before. In this section, we are going to talk about the different search methods to understand how graph-based search can be a big improvement for a search engine. We will then discuss the different ways to implement a graph-based search using Neo4j and machine learning.

Search methods

Several search methods have been used since search engines exist in web applications. We can, for instance, think of tags assigned to a blog article that help in classifying the articles and allow to search for articles with a given tag. This method is also used when you assign keywords to a given document. This method is quite simple to implement, but is also very limited: what if you forget an important keyword?

Fortunately, one can also use full-text search, which consists of matching documents whose text contains the pattern entered by the user. In that case, no need to manually annotate documents with keywords, the full text of the document can be used to index it. Tools such as Elasticsearch are extremely good at indexing text documents and performing full-text searches within them.

But this method is still not perfect. What if the user chooses a different wording to the one you use, but with a similar meaning? Let's say you write about machine learning. Wouldn't a user typing machine learning be interested in your text? We all remember the times where we had to redefine the search keywords on Google until we get the desired result.

That's where graph-based search enters into the game. By adding context to your data, you will be able to identify that data science and machine learning are actually related, even if not the same thing, and that a user looking for one of those terms might be interested in articles using the other expression.

To understand better what graph-based search is, let's take a look at the definition given by Facebook in 2013:

With Graph Search, you simply enter phrases such as "My friends who live in San Francisco", "Photos of my family taken in Copenhagen", or "Dentists my friends like", and Facebook quickly displays a page of the content you've requested.

(Source: https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-graph-search-beta/10151240856103920)

The graph-based search was actually first implemented by Google back in 2012. Since then, you have been able to ask questions such as the following:

How far is New York from Australia?
And you directly get the answer:

Movies with Leonardo DiCaprio.
And you can see at the top of the result page, a list of popular movies Leonardo DiCaprio acted in:

How can Neo4j help in implementing a graph-based search? We will first learn how Cypher enables it to answer complex questions like the preceding one.

Manually building Cypher queries

Firstly, and in order to understand how this search works, we are going to write some Cypher queries manually.

The next table summarizes several kinds of questions together with a possible Cypher query to get to the answer:

You can see that Cypher allows us to answer many different types of questions in quite a few characters. The knowledge we have built in the previous section, based on other data sources such as Wikidata, is also important.

However, so far, this process assumes a human being is reading the question and able to translate it to Cypher. This is a solution that is not scalable, as you can imagine. That's why we are now going to investigate some techniques to automate this translation, via NLP and state-of-the-art machine learning techniques used in the context of translation.

Automating the English to Cypher translation

In order to automate the English to Cypher translation, we can either use some logic based on language understanding or go even further and use machine learning techniques used for language translation.

Using NLP

In the previous section, we used some NLP techniques to enhance our knowledge graph. The same techniques can be applied in order to analyze a question written by a user and extract its meaning. Here we are going to use a small Python script to help us convert a user question to a Cypher query.

In terms of NLP, the Python ecosystem contains several packages that can be used. For our needs here, we are going to use spaCy (https://spacy.io/). It is very easy to use, especially if you don't want to bother with technical implementations. It can be easily installed via the Python package manager, pip:

pip install -U spacy

It is also available on conda-forge if you prefer to use conda.

Let's now see how spaCy can help us in building a graph-based search engine. Starting from an English sentence such as Leonardo DiCaprio is born in Los Angeles, spaCy can identify the different parts of the sentence and the relationship between them:

The previous diagram was generated from the following simple code snippet:

import spacy
// load English model
nlp = spacy.load("en_core_web_sm")

text = "Leonardo DiCaprio is born in Los Angeles."

// analyze text
document = nlp(text)

// generate svg image
svg = spacy.displacy.render(document, style="dep")
with open("dep.svg", "w") as f:
    f.write(svg)

On top of these relationships, we can also extract named entities, as we did with GraphAware and Stanford NLP tools in the previous section. The result of the preceding text is as follows:

This information can be accessed within spaCy in the following way:

for ent in document.ents:
    print(ent.text, ":", ent.label_)

This piece of code prints the following results:

Leonardo DiCaprio : PERSON
Los Angeles : GPE

Leonardo DiCaprio is well identified as a PERSON. And according to the documentation at https://spacy.io/api/annotation#named-entities, GPE stands for Countries, cities, states; so Los Angles was also identified correctly.

How does that help? Well, now that we have entities, we have node labels:

MATCH (:PERSON {name: "Leonardo DiCaprio"})
MATCH (:GPE {name: "Los Angeles"})

The two preceding Cypher queries can be generated from Python:

for ent in document.ents:
    q = f"MATCH (n:{ent.label_} {{name: '{ent.text}' }})"
    print(q)

Python f-strings will replace {var} by the value of the var variable in the string scope. In order for the curly brackets needed in Cypher not to be interpreted, we have to double them, hence the {{ }} syntax in the code, which will be printed as valid Cypher at the end.

In order to identify which relationship we should use to relate the two entities, we are going to use the verb in the sentence:

for token in document:
    if token.pos_ == "VERB":
        print(token.text)

The only printed result will be born, since this is the only verb in our sentence. We can update the preceding code to print the Cypher relationship:

for token in document:
    if token.pos_ == "VERB":
        print(f"[:{token.text.upper()}]")

Putting all the pieces together, we can write a query to check whether the statement is true or not:

MATCH (n0:PERSON {name: 'Leonardo DiCaprio' })
MATCH (n1:GPE {name: 'Los Angeles' })
RETURN EXISTS((n1)-[:BORN]-(n2))

This query returns true if the pattern we are looking for exists in our working graph, and false otherwise.

As you can see, NLP is very powerful and a lot can be done from it if we want to push the analysis further. But the amount of work required is incredibly high, especially if we want to cover several fields (not only people and locations but also gardening products or GitHub repositories). That's the reason why, in the next section, we are going to investigate another possibility enabled by NLP and machine learning: an automatic English to Cypher translator.

Using translation-like models

As you can see from the previous paragraph, natural language understanding helps in automating the human language to a Cypher query, but it relies on some rules. These rules have to be carefully defined and you can imagine how difficult this can be when the number of rules increases. That's the reason why we can also find help in machine learning techniques, especially those related to translation, another part of NLP.

Translation consists in taking a text in a (human) language, and outputting a text in another (human) language, as illustrated in the following diagram, where the translator is a machine learning model, usually relying on artificial neural networks:

The translator's goal is to assign a value (or a vector of values) to each word, this vector carrying the meaning of the word. We will talk about this in more detail in the chapter dedicated to embedding (Chapter 10, Graph Embedding from Graphs to Matrices).

But without knowing the details of the process, can we imagine applying the same logic to translate a human language to Cypher? Indeed, using the same techniques as those used for human language translation, we can build models to convert English sentences (questions) to a Cypher query.

The Octavian-AI company worked on an implementation of such a model in their english2cypher package (https://github.com/Octavian-ai/english2cypher). It is a neural network model implemented with TensorFlow in Python. The model learned from a list of questions regarding the London Tube, together with their translations in Cypher. The training set looks like this:

english: How many stations are between King's Cross and Paddington?

cypher: MATCH (var1) MATCH (var2) 
        MATCH tmp1 = shortestPath((var1)-[*]-(var2))
        WHERE var1.id="c2b8c082-7c5b-4f70-9b7e-2c45872a6de8"
        AND var2.id="40761fab-abd2-4acf-93ae-e8bd06f1e524"
        WITH nodes(tmp1) AS var3
        RETURN length(var3) - 2

Even if we have not yet studied the shortest path methods (see Chapter 4, The Graph Data Science Library and Path Finding), we can understand the preceding query:

It starts from getting the two stations mentioned in the question.
It then finds the shortest path between those two stations.
And counts the number of nodes (stations) in the shortest path.
The answer to the question is the length of the path, minus 2 since we do not want to count the start and end station.

But the power of machine learning models lies within their prediction: from a set of known data (the train dataset), they are able to issue predictions for unknown data. The preceding model would, for instance, be able to answer questions such as "How many stations are there between Liverpool Street Station and Hyde park Corner?" even if it has never seen it before.

In order to use such a model within your business, you will have to create a training sample made of a list of English questions with the corresponding Cypher queries able to answer them. This part is similar to the one we performed in the Manually building Cypher queries section. Then you will have to train a new model. If you are not familiar with machine learning and model training, this topic will be covered in more detail in Chapter 8, Using Graph-Based Features in Machine Learning.

You now have a better overview of how graph-based search works and why Neo4j is a good structure to hold the data if user search is an important feature for your company. But knowledge graph applications are not limited to search engines. Another interesting application of the knowledge graph is recommendations, as we will discover now.