Insights API

In this section, we will explain how to use the insights API. This will specifically look at the json contract and not the sql capabilities. To learn more about the DSL language for insights, please visit the insights filters page.

The goal of the insights API is to provide advanced searching and analytical capabilities, such as the power to search across patient and genomic data, run a variety of aggregations, and perform statistical operations. Most data that is added to LifeOmic's patient and genomic services will automatically be indexed in analytics, requiring no extra configuration or actions from the user.

Overview

This section explains the contract structure of the insights API. The goal is to provide a overall intuition and understanding of how each component works in the API. First to access the insights API, you must have authorization to the hit this endpoint. Also, once authorized, you must have permission to be able to access insights data. The insights API can be accessed through the following endpoint: https://api.us.lifeomic.com/v1/analytics/dsl

While the insights API can be called over HTTP with a JSON body, the endpoint itself is not RESTful. Instead all requests require a POST body with varied contracts. The main components of this json contract are datasetId, query, target, domain, options, and where. The datasetId, also known as the projectId, is the UUID for your PHC project. This is a parent level key in the contract. Next, query is the object that stores all of the query information. This includes the target, domain, options, and where clauses. The target can be one of three values: variant, gene, or patient. We will explain more about each of these targets in a later section. domains are specific operations that are available within the scope of a target. For example, within the variant target, we may want to look at associated samples, given a filter object, or on the other hand, want an OncoPrint representation. Domains have a discrete number of values, and the list of available domains can be seen in a later section. options are domain specific contracts that can be utilized for custom configurations out outputs, aggregations, filters, and more. Finally, where represents the filter body across genomic and patient data.

Example Skeleton

{
  "dataset_id": "UUID for project",
  "query": {
    "target": "variant|patient|gene",
    "domain": "domain specific to target",
    "options": {
      ...
    },
    "where": {
      ...
    }
  }
}

Where Clause

The where clause in the insights contract allows users to search across genomic and patient data, regardless of the specified target or corresponding domain. For example, this separation could allow a user to look at gene expression statistics for individuals below the age of 40, or a summary of patients that have a specific genetic mutation. In order to fully capture the breadth of searching combinations that is offered through the insights engine, the where object is recursive in nature, providing boolean combinations of various disparate datasets. With that in mind, the where data structure is a tree, which contains boolean options such as or, and, and xor or resource targets, such as variant, gene, and patient. Both variant and patient targets also allow recursive searching within its own resource. For now, we will focus on the composition of the top level of this data structure. To provide context, an example is provided below:

{
  "dataset_id": "UUID for project",
  "query": {
    "target": "variant|patient|gene",
    "domain": "domain specific to target",
    "options": {
      ...
    },
    "where": {
      "and": [{
        "or": [{
            "variant": {
              ...
            }
        }, {
            "variant": {
              ...
            }
        }]
      }, {
        "patient": {
          ...
        }
      }]
    }
  }
}

In the where clause in the json above, we are looking for individuals that satisfy either variant clause, and also satisfy the patient clause. The simplified hierarchy can be visualized as:

where
    and
        or
            variant clause
            another variant clause
        patient clause

With this recursive data structure, one can see the numerous combinations that can be constructed through the dsl. One thing that should be noted each resource from and, or, xor, variant, patient, and gene, only one item can be supplied per clause. In other words, implicit booleans are not supported at the top layer of filtering and require explicit operators or resources in each clause.

Filter Targets

While the above where clause example demonstrates how you could combine multiple targets to get a single filter, it has not provided any details on what belongs in those filters. In the following section, we will clarify the different options that can be added for each one of the three targets. This will allow users to better understand the capabilities of the insights engine.

Filter Variants

One of the filterable resource targets that can be utilized on the insights engine is genetic variants. Currently, the variant target only supports the sub resources and, or, samples, genes, and gnosis. Gnosis is Lifeomic's genetic knowledge base that accumulates multiple open and close source resources. This includes popular knowledge bases such as Clinvar, Cosmic, DBSnp, and more. In this section, we will describe the capabilities of the variant filter target.

To provide some context up front, it is best to demonstrate an example and explain the moving pieces.

{
    "variant": {
        "or": [
            {
                "gnosis": {
                    "gene": [
                        {
                            "operator": "eq",
                            "value": "KRAS"
                        },
                        {
                            "operator": "eq",
                            "value": "PIK3CA"
                        }
                    ],
                    "population_allele_frequency": [
                        {
                            "operator": "lte",
                            "value": 0.1
                        }
                    ]
                }
            },
            {
                "gnosis": {
                    "gene": [
                        {
                            "operator": "eq",
                            "value": "BRCA2"
                        }
                    ],
                    "cosmic_sample_count": [
                        {
                            "operator": "range",
                            "lower": 2,
                            "upper": 6
                        }
                    ]
                }
            }
        ]
    }
}

First, it is worth noting, other than the selected genes being associated with cancer research, this query does not hold any specific merit and is mostly random. In plain English, the above query is asking for return variants and samples that either have a mutation in KRAS or PIK3CA as labeled by the gnosis annotation with a population frequency less than 10 percent OR have a mutation in BRCA2 with a cosmic sample count between 2 and 6 from the gnosis annotation. Before we describe how this query is constructed, it may be worth looking at the plain English definition a couple of times and determining how the items map to the json example.

First, it is worth noting that some new capabilities are offered in the JSON above, including implicit "and's" and "or's". From previous sections, it was demonstrated how to run explicit and's and or's, but at a level higher than the target resources, implicits are not supported. In the following example, a demonstration of an implicit or is shown.

{
    "gene": [
        {
            "operator": "eq",
            "value": "KRAS"
        },
        {
            "operator": "eq",
            "value": "PIK3CA"
        }
    ]
}

This query is asking for variants and samples with a KRAS or PIK3CA mutation. Notice implicit or's can only be used within a single resource. For more complex or clauses, the user should use the explicit or sub-resource. Next, let's look at an example of an implicit and.

{
    "gnosis": {
        "gene": [
            {
                "operator": "eq",
                "value": "KRAS"
            },
            {
                "operator": "eq",
                "value": "PIK3CA"
            }
        ],
        "population_allele_frequency": [
            {
                "operator": "lte",
                "value": 0.1
            }
        ]
    }
}

First notice how both gene and population_allele_frequency are supplied under the gnosis sub-resource. It is worth mentioning that implicit and's can only utilized within a sub-resource that is not and or or. For example, this is an INVALID query:

{
  "gnosis": {...},
  "and": [{...}]
}

But this is a valid query:

{
  "gnosis": {
    "gene": [{...}],
    "cosmic_sample_count": [{...}]
  }
}

Also when supplying genes or samples, the array contains an or relationship. For example:

{
    "samples": ["first", "second", "third"]
}

These items are all or'd together.

One other thing to note is the variant target accepts and and or's within its json block. Queries that utilize this pattern when necessary, can see performance gains. A general good rule of thumb is if you are using and or or with the same targets, the and or or should live in side of the target block. If there are items from other targets, you can add the and and or to the parent level of the query.

Variant Filter Options

With all of this information, we can now see all of the options for filtering variants. Variants is unique to other data sources in that it has only one sub-resource that can have actionable parameters (that sub-resource being gnosis). Below shows all of the options that are possible within gnosis.

{
    "gnosis": {
        "chromosome": [
            {
                "operator": "eq|ne",
                "value": "string_value with chr prefix (chr1, chr2, etc)"
            }
        ],
        "position": [
            {
                "operator": "lt|gt|lte|gte|eq|ne|range",
                "value": 100000
            }
        ],
        "population_allele_frequency": [
            {
                "operator": "lt|gt|lte|gte|eq|ne|range",
                "value": 100000
            }
        ],
        "rs_id": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "id": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "clinvar_allele_id": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "clinical_disease": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "clinical_review": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "clinical_significance": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "cosmic_id": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "mutation_status": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "histology": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "tumor_site": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "ebcanon_class": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "ebcanon_group": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "impact": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "gene": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "transcript_id": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "biotype": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "amino_acid_change": [
            {
                "operator": "eq|ne",
                "value": "string_value"
            }
        ],
        "cosmic_sample_count": [
            {
                "operator": "lt|gt|lte|gte|eq|ne|range",
                "value": 100000
            }
        ]
    }
}

One thing you may have noticed is that numerical columns have a range operator, but only a single value. That is actually incorrect and only meant to show how you would use the other operators with a value. To use the range operator, the following contract is necessary:

{
    "gnosis": {
        "position": [
            {
                "operator": "range",
                "lower": 10000,
                "upper": 20000
            }
        ]
    }
}

Last update: February 1, 2020