Cypherization: Optimizing Cypher Queries

Post by Alex Frieden

I have been working with genomics data and neo4j for about a year now. One of the biggest piece of advice is how you architect your queries for any moderately sized set of data. However, we need some background first:

Vcf Files

Variant Call Format files or VCF files are a common format for storing data for clinical sequencing pipelines. Below is what a file looks like:

VcfFile

What a mess! These quickly become unreadable. Luckily parsers exist to make our lives much easier and create a usable API. PyVcf is one common popular one if you are interested. The basic structure of these is

VcfStruc

Which we can easily convert to:

VcfGraph

Note: While long time bioinformaticians will know this architecture differs from the standard spec, it works for us when we assume one sample per vcfFile (which is what we are doing)

Neo4j

We have since taken this schema and loaded it into Neo4j. This story is about finding all such files that had a particular allele. Initially I wanted to find all the vcf records that had that allele. I knew the various properties of the vcf record

So I ran

MATCH (vcf:VcfRecord)--(f:VcfFile) WHERE vcf.position = {_position} AND vcf.alt = {_alt} AND vcf.ref = {_ref} 
RETURN f

This query takes roughly two hours. It also scares terribly.

So what did I do wrong?

This would be an appropriate solution if we were using mongodb or sql. However, Graph Databases aren't multi for query by restriction. Don't get me wrong, they can do it. However, its not where they shine.

How do we Optimize?

So how do we fix this? We use a trick called anchoring. Traversals are very fast, but when we have to scan tons of nodes it isn't going to be fast. So for this query, we know we are looking for a particular allele. The alleles have all been indexed and more than that they have been constrained

CREATE CONSTRAINT ON (allele:Allele) ASSERT allele.name IS UNIQUE

where name is a parsable name that allows to uniquely identify the allele. Since the number of alleles is very small compared to the number of vcf records and they are indexed. We anchor our query and then traverse.

MATCH (a:Allele) WHERE a.name = {alleleName} WITH a MATCH (a)<-[:HAS_ALLELE]-(sd:SampleData)-[:COMES_FROM]->(v:VcfRecord)-[:IN_FILE]->(f)
RETURN f;

It is imporant to note that while node labels are good, node labels get applied after the query. Relationship labels however get applied before the query. Thus, best practice is to always use relationship labels when applying any traversal and node labels.

As stated above, the former query takes approximately two hours. However, the latter query takes 104 miliseconds. That is a HUGE speed up! The speed up remember is a function of having to iterate over a small number of linked lists in Allele compared to VcfRecord. Remember, when possible we want to avoid using properties at all. For reference, we have approximately 40000 Alleles and 135,000,000 VcfRecords that we have stored. The Alleles are indexed as it is our central fact for analysis on a unique name. The VcfRecords have no index.

Lessons Learned

  • Always use indexes
  • Apply uniqueness constraints wherever possible
  • Filter on your smallest indexed set then traverse from there.
  • Try to avoid filtering on properties when you can
  • If you must filter on properties make sure you can filter on a small set of nodes

I hope this helps in your architecting efforts. Please feel free to reach out to me at Alex@Frieden.org with questions and suggestions.




comments powered by Disqus