Using the API¶
The Data Lake offers a collection of RESTful API endpoints to enable building, executing and auditing queries.
Getting Table Schemas¶
Before building a query the available tables and columns within those tables should be reviewed. The tables available for a particular project is dependent on what data has been ingested into that project, for example if only FHIR data has been ingested, no genomic tables will be available
The tables which are potentially available are:
- Omic data tables
- copynumber
- fusion
- gene
- variant
- FHIR data tables
- condition
- demographic
- dosage
- media
- medication
- observation
- patient
- procedure
- sequence
- specimen
Get All Table Schemas in a Project¶
Reviewing all available tables and their schemas is recommended to ensure the data available in a project aligns with expectations.
Request to fetch the schemas of each table available in a specific project
GET /api/v1/analytics/data-lake/schema?datasetId=00000000
Example Response
{
"datasetId": "00000000",
"schemas": [
{
"table": "variant",
"created": "2020-05-04T15:31:10.000Z",
"columns": [
{
"name": "variant_id",
"type": "string"
},
{
"name": "variant_set_id",
"type": "string"
},
{
"name": "sample_id",
"type": "string"
},
{
"name": "chromosome",
"type": "string"
},
{
"name": "position",
"type": "int"
},
{
"name": "reference",
"type": "string"
},
{
"name": "alternate",
"type": "string"
},
{
"name": "minimum_allele_frequency",
"type": "float"
},
{
"name": "maximum_allele_frequency",
"type": "float"
},
{
"name": "population_allele_frequency",
"type": "float"
},
{
"name": "rs_id",
"type": "string"
},
{
"name": "clinvar_allele_id",
"type": "string"
},
{
"name": "clinical_disease",
"type": "string"
},
{
"name": "clinical_review",
"type": "string"
},
{
"name": "clinical_significance",
"type": "string"
},
{
"name": "clinical_submission",
"type": "string"
},
{
"name": "cosmic_id",
"type": "string"
},
{
"name": "mutation_status",
"type": "string"
},
{
"name": "histology",
"type": "string"
},
{
"name": "tumor_site",
"type": "string"
},
{
"name": "cosmic_sample_count",
"type": "int"
},
{
"name": "ebcanon_group",
"type": "string"
},
{
"name": "ebcanon_class",
"type": "string"
},
{
"name": "impact",
"type": "string"
},
{
"name": "gene",
"type": "string"
},
{
"name": "transcript_id",
"type": "string"
},
{
"name": "biotype",
"type": "string"
},
{
"name": "amino_acid_change",
"type": "string"
},
{
"name": "genotype",
"type": "string"
},
{
"name": "is_canonical",
"type": "boolean"
}
]
},
{
"table": "sequence",
"created": "2020-05-01T22:21:03.000Z",
"columns": [
{
"name": "patient_id",
"type": "string"
},
{
"name": "id",
"type": "string"
},
{
"name": "specimen_id",
"type": "string"
},
{
"name": "test_type",
"type": "string"
},
{
"name": "sequence_type",
"type": "string"
}
]
},
{
"table": "patient",
"created": "2020-05-01T18:23:18.000Z",
"columns": [
{
"name": "patient_id",
"type": "string"
},
{
"name": "data",
"type": "string"
},
{
"name": "dob",
"type": "date"
}
]
}
]
}
Get a Single Table's Schema¶
Request to fetch the schema of a single table, in this case the sequence
table.
GET /api/v1/analytics/data-lake/schema/sequence?datasetId=00000000
Response
{
"datasetId": "00000000",
"schema": {
"table": "sequence",
"created": "2020-05-01T22:21:03.000Z",
"columns": [
{
"name": "patient_id",
"type": "string"
},
{
"name": "id",
"type": "string"
},
{
"name": "specimen_id",
"type": "string"
},
{
"name": "test_type",
"type": "string"
},
{
"name": "sequence_type",
"type": "string"
}
]
}
}
Executing a Query¶
The Data Lake query execution requires three fields to be provided in the request body.
-
query
: This is the query string to run. -
datasetId
: Id of the dataset to run the query against. -
outputFileName
: Name given to the query results file. This can also be a path in the context of the PHC File Service directory, such asmy-dir/my-results
.
Example query request.
POST /api/v1/analytics/data-lake/query
{
"query": "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'",
"datasetId": "00000000",
"outputFileName": "my-query-results"
}
Example query response:
{
"message": "Query execution starting",
"queryId": "11111111"
}
Monitoring a Query¶
When the Data Lake executes a query, it does so in a differed manner. After receiving the query execution request a queryId
will be generated and returned to the user. This id can be used to poll the status of a running query.
Example query status request:
GET /api/v1/analytics/data-lake/query/11111111
Example status response:
{
"id": "11111111",
"accountId": "my-account",
"datasetId": "00000000",
"queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
"state": "succeeded",
"outputFileName": "my-query-results",
"startTime": 1589390692167,
"endTime": 1589390696640
}
There are four states a query can be in at any given time.
-
running
. The query is currently being executed. -
succeeded
. The query has completed and the result file is available. -
cancelled
. The query was cancelled by a user. -
failed
. The query failed during execution.
Query Results File¶
Once a query has completed successfully the results are stored in a CSV file. The file is named using the outputFileName
given in the query request and is stored in the PHC File Service.
If a file already exists with the same name as outputFileName
(n + 1)
will be appended to the file name, where n
is the number of files with the same name with or without the postfix.
Getting Query History¶
A running history of all queries executed for each project is maintained by the Data Lake.
In addition to the required request parameter datasetId
, there are two optional parameters:
-
pageSize
: The maximum number of results to return in each page. Defaults to 25. -
nextPageToken
: Token returned by a previous request to fetch the next page of results.
Example request to list all queries executed on a project.
GET /api/v1/analytics/data-lake/query?datasetId=00000000&pageSize=25
Example response
{
"items": [
{
"id": "11111111",
"accountId": "lifeomic",
"datasetId": "00000000",
"queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
"state": "succeeded",
"outputFileName": "my-query-results",
"startTime": 1574278005727,
"endTime": 1574278009481
},
{
"id": "22222222",
"accountId": "lifeomic",
"datasetId": "00000000",
"queryString": "U0VMRUNUIHNwZWNpbWVuLnBhdGllbnRfaWQsIHZhcmlhbnQuc2FtcGxlX2lkLCBnZW5lLCBpbXBhY3QsIGFtaW5vX2FjaWRfY2hhbmdlLCBoaXN0b2xvZ3kgRlJPTSB2YXJpYW50IExFRlQgSk9JTiBzcGVjaW1lbiBPTiB2YXJpYW50LnNhbXBsZV9pZD1zcGVjaW1lbi5zYW1wbGVfaWQgV0hFUkUgdHVtb3Jfc2l0ZT0nYnJlYXN0JyBPUiBzcGVjaW1lbi5pZD0nYXNkZic=",
"state": "succeeded",
"outputFileName": "my-query-results(1)",
"startTime": 1568659794979,
"endTime": 1568659800118
},
{
"id": "33333333",
"accountId": "lifeomic",
"datasetId": "00000000",
"queryString": "U0VMRUNUIHNhbXBsZV9pZCwgY2xpbmljYWxfc2lnbmlmaWNhbmNlLCBjbGluaWNhbF9yZXZpZXcgRlJPTSB2YXJpYW50IFdIRVJFIHNhbXBsZV9pZCA9ICdUQ0dBLUUyLUExSVUnIEFORCAocnNfaWQgPSAncnMxNDUwMDYzNTUnIE9SIHJzX2lkID0gJ3JzMzcwMTEyNDIwJyBPUiByc19pZCA9ICdyczU1NjAzNTAxMicgT1IgcnNfaWQgPSAncnM3NTM2NTE3NzUnIE9SIHJzX2lkID0gJ3JzNzYxNTg3NzY4JyBPUiByc19pZCA9ICdyczc2NDYxMzA0OScp",
"state": "succeeded",
"outputFileName": "my-query-results(2)",
"startTime": 1582036208265,
"endTime": 1582036231801
}
],
"links": {
"self": "/v1/query?datasetId=00000000&pageSize=25",
"next": "/v1/query?datasetId=00000000&pageSize=25&nextPageToken=eyJpZCI6eyJTIjoiYzVkYTFjYTAtMjY0Mi00NjJkLTkyNTMtYWY0NTY2YjJmMzc3In0sImRhdGFzZXRJZCI6eyJTIjoiMTllMzQ3ODItOTFjNC00MTQzLWFhZWUtMmJhODFlZDBiMjA2In19"
}
}
Cancelling a Query¶
If a running query needs to be cancelled a request to terminate it can be sent using the queryId
.
Example request to cancel a running query.
DELETE /api/v1/analytics/data-lake/query/11111111