Skip to content

Using the API

The Data Lake offers a collection of RESTful API endpoints to enable building, executing and auditing queries.

Getting Table Schemas

Before building a query the available tables and columns within those tables should be reviewed. The tables available for a particular project is dependent on what data has been ingested into that project, e.g. if only FHIR data has been ingested, no genomic tables will be available

The tables which are potentially available are:

  • Omic data tables
    • copynumber
    • fusion
    • gene
    • variant
  • FHIR data tables
    • condition
    • demographic
    • dosage
    • media
    • medication
    • observation
    • patient
    • procedure
    • sequence
    • specimen

Get All Table Schemas in a Project

Reviewing all available tables and their schemas is recommended to ensure the data available in a project aligns with expectations.

Request to fetch the schemas of each table available in a specific project

GET /api/v1/analytics/data-lake/schema?datasetId=00000000

Example Response

{
    "datasetId": "00000000",
    "schemas": [
        {
            "table": "variant",
            "created": "2020-05-04T15:31:10.000Z",
            "columns": [
                {
                    "name": "variant_id",
                    "type": "string"
                },
                {
                    "name": "variant_set_id",
                    "type": "string"
                },
                {
                    "name": "sample_id",
                    "type": "string"
                },
                {
                    "name": "chromosome",
                    "type": "string"
                },
                {
                    "name": "position",
                    "type": "int"
                },
                {
                    "name": "reference",
                    "type": "string"
                },
                {
                    "name": "alternate",
                    "type": "string"
                },
                {
                    "name": "minimum_allele_frequency",
                    "type": "float"
                },
                {
                    "name": "maximum_allele_frequency",
                    "type": "float"
                },
                {
                    "name": "population_allele_frequency",
                    "type": "float"
                },
                {
                    "name": "rs_id",
                    "type": "string"
                },
                {
                    "name": "clinvar_allele_id",
                    "type": "string"
                },
                {
                    "name": "clinical_disease",
                    "type": "string"
                },
                {
                    "name": "clinical_review",
                    "type": "string"
                },
                {
                    "name": "clinical_significance",
                    "type": "string"
                },
                {
                    "name": "clinical_submission",
                    "type": "string"
                },
                {
                    "name": "cosmic_id",
                    "type": "string"
                },
                {
                    "name": "mutation_status",
                    "type": "string"
                },
                {
                    "name": "histology",
                    "type": "string"
                },
                {
                    "name": "tumor_site",
                    "type": "string"
                },
                {
                    "name": "cosmic_sample_count",
                    "type": "int"
                },
                {
                    "name": "ebcanon_group",
                    "type": "string"
                },
                {
                    "name": "ebcanon_class",
                    "type": "string"
                },
                {
                    "name": "impact",
                    "type": "string"
                },
                {
                    "name": "gene",
                    "type": "string"
                },
                {
                    "name": "transcript_id",
                    "type": "string"
                },
                {
                    "name": "biotype",
                    "type": "string"
                },
                {
                    "name": "amino_acid_change",
                    "type": "string"
                },
                {
                    "name": "genotype",
                    "type": "string"
                },
                {
                    "name": "is_canonical",
                    "type": "boolean"
                }
            ]
        },
        {
            "table": "sequence",
            "created": "2020-05-01T22:21:03.000Z",
            "columns": [
                {
                    "name": "patient_id",
                    "type": "string"
                },
                {
                    "name": "id",
                    "type": "string"
                },
                {
                    "name": "specimen_id",
                    "type": "string"
                },
                {
                    "name": "test_type",
                    "type": "string"
                },
                {
                    "name": "sequence_type",
                    "type": "string"
                }
            ]
        },
        {
            "table": "patient",
            "created": "2020-05-01T18:23:18.000Z",
            "columns": [
                {
                    "name": "patient_id",
                    "type": "string"
                },
                {
                    "name": "data",
                    "type": "string"
                },
                {
                    "name": "dob",
                    "type": "date"
                }
            ]
        }
    ]
}

Get a Single Table's Schema

Request to fetch the schema of a single table, in this case the sequence table.

GET /api/v1/analytics/data-lake/schema/sequence?datasetId=00000000

Response

{
    "datasetId": "00000000",
    "schema": {
        "table": "sequence",
        "created": "2020-05-01T22:21:03.000Z",
        "columns": [
            {
                "name": "patient_id",
                "type": "string"
            },
            {
                "name": "id",
                "type": "string"
            },
            {
                "name": "specimen_id",
                "type": "string"
            },
            {
                "name": "test_type",
                "type": "string"
            },
            {
                "name": "sequence_type",
                "type": "string"
            }
        ]
    }
}

Executing a Query

The Data Lake query execution requires three fields to be provided in the request body.

  1. query: This is the query string to run.

  2. datasetId: Id of the dataset to run the query against.

  3. outputFileName: Name given to the query results file. This can also be a path in the context of the PHC File Service directory, e.g. my-dir/my-results.

Example query request.

POST /api/v1/analytics/data-lake/query
{
  "query": "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'",
  "datasetId": "00000000",
  "outputFileName": "my-query-results"
}

Example query response:

{
    "message": "Query execution starting",
    "queryId": "11111111"
}

Monitoring a Query

When the Data Lake executes a query, it does so in a differed manner. After receiving the query execution request a queryId will be generated and returned to the user. This id can be used to poll the status of a running query.

Example query status request:

GET /api/v1/analytics/data-lake/query/11111111

Example status response:

{
    "id": "11111111",
    "accountId": "my-account",
    "datasetId": "00000000",
    "queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
    "state": "succeeded",
    "outputFileName": "my-query-results",
    "startTime": 1589390692167,
    "endTime": 1589390696640
}

There are four states a query can be in at any given time.

  1. running. The query is currently being executed.

  2. succeeded. The query has completed and the result file is available.

  3. cancelled. The query was cancelled by a user.

  4. failed. The query failed during execution.

Query Results File

Once a query has completed successfully the results are stored in a CSV file. The file is named using the outputFileName given in the query request and is stored in the PHC File Service.

If a file already exists with the same name as outputFileName (n + 1) will be appended to the file name, where n is the number of files with the same name with or without the postfix.

Getting Query History

A running history of all queries executed for each project is maintained by the Data Lake.

In addition to the required request parameter datasetId, there are two optional parameters:

  1. pageSize: The maximum number of results to return in each page. Defaults to 25.

  2. nextPageToken: Token returned by a previous request to fetch the next page of results.

Example request to list all queries executed on a project.

GET /api/v1/analytics/data-lake/query?datasetId=00000000&pageSize=25

Example response

{
    "items": [
        {
            "id": "11111111",
            "accountId": "lifeomic",
            "datasetId": "00000000",
            "queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
            "state": "succeeded",
            "outputFileName": "my-query-results",
            "startTime": 1574278005727,
            "endTime": 1574278009481
        },
        {
            "id": "22222222",
            "accountId": "lifeomic",
            "datasetId": "00000000",
            "queryString": "U0VMRUNUIHNwZWNpbWVuLnBhdGllbnRfaWQsIHZhcmlhbnQuc2FtcGxlX2lkLCBnZW5lLCBpbXBhY3QsIGFtaW5vX2FjaWRfY2hhbmdlLCBoaXN0b2xvZ3kgRlJPTSB2YXJpYW50IExFRlQgSk9JTiBzcGVjaW1lbiBPTiB2YXJpYW50LnNhbXBsZV9pZD1zcGVjaW1lbi5zYW1wbGVfaWQgV0hFUkUgdHVtb3Jfc2l0ZT0nYnJlYXN0JyBPUiBzcGVjaW1lbi5pZD0nYXNkZic=",
            "state": "succeeded",
            "outputFileName": "my-query-results(1)",
            "startTime": 1568659794979,
            "endTime": 1568659800118
        },
        {
            "id": "33333333",
            "accountId": "lifeomic",
            "datasetId": "00000000",
            "queryString": "U0VMRUNUIHNhbXBsZV9pZCwgY2xpbmljYWxfc2lnbmlmaWNhbmNlLCBjbGluaWNhbF9yZXZpZXcgRlJPTSB2YXJpYW50IFdIRVJFIHNhbXBsZV9pZCA9ICdUQ0dBLUUyLUExSVUnIEFORCAocnNfaWQgPSAncnMxNDUwMDYzNTUnIE9SIHJzX2lkID0gJ3JzMzcwMTEyNDIwJyBPUiByc19pZCA9ICdyczU1NjAzNTAxMicgT1IgcnNfaWQgPSAncnM3NTM2NTE3NzUnIE9SIHJzX2lkID0gJ3JzNzYxNTg3NzY4JyBPUiByc19pZCA9ICdyczc2NDYxMzA0OScp",
            "state": "succeeded",
            "outputFileName": "my-query-results(2)",
            "startTime": 1582036208265,
            "endTime": 1582036231801
        }
    ],
    "links": {
        "self": "/v1/query?datasetId=00000000&pageSize=25",
        "next": "/v1/query?datasetId=00000000&pageSize=25&nextPageToken=eyJpZCI6eyJTIjoiYzVkYTFjYTAtMjY0Mi00NjJkLTkyNTMtYWY0NTY2YjJmMzc3In0sImRhdGFzZXRJZCI6eyJTIjoiMTllMzQ3ODItOTFjNC00MTQzLWFhZWUtMmJhODFlZDBiMjA2In19"
    }
}

Cancelling a Query

If a running query needs to be cancelled a request to terminate it can be sent using the queryId.

Example request to cancel a running query.

DELETE /api/v1/analytics/data-lake/query/11111111

Last update: 2020-05-15