Skip to content

Overview

The Data Lake is a managed repository of all clinical and genomic data ingested into the PHC. A SQL-like query language is used to select, filter and join data across multiple domains into a single view.

See What Data is Available

The content of the Data Lake depends on what data has been ingested into the PHC. As such, the domains (think SQL tables) available might vary from project to project.

There are multiple ways to interrogate the Data Lake about what data is available for a specific project and what the structure of that data is.

  1. Data Lake REST API

    GET /api/v1/analytics/data-lake/schema?datasetId={dataset-id}
    
  2. LifeOmic CLI

    lo data-lake list-schemas {dataset-id}
    
  3. Python PHC SDK

    >>> from phc import Session
    >>> from phc.services import Analytics
    >>> from phc.util import DataLakeQuery
    
    >>> session = Session()
    >>> client = Analytics(session)
    
    >>> project_id = '00000000-0000-0000-0000-000000000000'
    >>> client.list_data_lake_schemas(project_id)
    

Execute a Simple Query

The Data Lake uses a SQL-like query language to select, filter and join data across multiple domains. Queries are run in a differed manner, meaning the request to execute a query will return a query id rather than a result. The query id can be used to monitor the progress of the query.

There are four states a query can be in at any given time.

  1. running. The query is currently being executed.

  2. succeeded. The query has completed and the result file is available.

  3. cancelled. The query was cancelled by a user.

  4. failed. The query failed during execution.

Once a query has successfully completed, the results will be written to a file at a location specified in the initiating query request. The results file will be written as a CSV.

There are multiple ways to execute a query against the Data Lake and poll its completion.

  1. Data Lake REST API

    a. Execute the query

    POST /api/v1/analytics/data-lake/query
    {
      "query": "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'",
      "datasetId": "00000000-0000-0000-0000-000000000000",
      "outputFileName": "my-query-results"
    }
    

    Response

    {
        "message": "Query execution starting",
        "queryId": "11111111-2222-3333-4444-000000000000"
    }
    

    b. Get the status of the query

    GET /api/v1/analytics/data-lake/query/11111111-2222-3333-4444-000000000000
    

    Response

    {
        "id": "11111111-2222-3333-4444-000000000000",
        "accountId": "my-account",
        "datasetId": "00000000-0000-0000-0000-000000000000",
        "queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
        "state": "succeeded",
        "outputFileName": "my-query-results",
        "startTime": 1589390692167,
        "endTime": 1589390696640
    }
    
  2. LifeOmic CLI

    a. Execute the query

    lo data-lake query 00000000-0000-0000-0000-000000000000 -q "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'" -o my-query-results
    

    Response

    message: Query execution starting
    queryId: 11111111-2222-3333-4444-000000000000
    

    b. Get the status of the query

    lo data-lake get-query 11111111-2222-3333-4444-000000000000
    

    Response

    "id": "11111111-2222-3333-4444-000000000000",
    "accountId": "my-account",
    "datasetId": "00000000-0000-0000-0000-000000000000",
    "queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
    "state": "succeeded",
    "outputFileName": "my-query-results",
    "startTime": 1589390692167,
    "endTime": 1589390696640
    
  3. Python PHC SDK

    a. Execute the query and await the result

    >>> from phc import Session
    >>> from phc.services import Analytics
    >>> from phc.util import DataLakeQuery
    
    >>> session = Session()
    >>> client = Analytics(session)
    
    >>> project_id = '00000000-0000-0000-0000-000000000000'
    >>> query_string = "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'"
    >>> output_file_name = 'my-query-results'
    >>> query = DataLakeQuery(project_id=project_id, query=query_string, output_file_name=output_file_name)
    
    >>> dataframe = client.execute_data_lake_query_to_dataframe(query)
    

Query Rate Limits

Each account is limited to 10 concurrent queries across all projects. If a query is executed with the account already at capacity, it will be failed.


Last update: May 20, 2020