Data Lake Service FAQ¶
What is LifeOmic Data Lake Service?¶
- LifeOmic Data Lake Service is a managed repository of all clinical and genomic data ingested into the PHC.
- The data is persisted in a semi-structured form, enabling users to query, shape and combine data from different modalities into a single model which suits their unique needs. This democratizes data exploration and allows a user to do analytics, data science, and machine learning.
What data is cataloged in the data lake when files or FHIR data is ingested?¶
- This answer depends upon what data has been brought to the project.
- To see the list of cataloged data for a specific project, run the following LifeOmic CLI command:
bash lo data-lake list-schemas <projectId>
- When available, Omic and FHIR data domains (or data pools) are possible:
Omic data¶
- copy
- number
- fusion
- gene
- variant
FHIR data¶
- condition
- demographic
- dosage
- media
- medication
- observation
- patient
- procedure
- sequence
- specimen
What format does the data lake use to store data?¶
- Apache Parquet powers the storage format of the data found in the data lake.
How can I read data from the data lake?¶
- There are currently four tools which can query the data lake and retrieve results:
-
- LifeOmic Notebook Service - where PHC SDK for Python and LifeOmic CLI are included.
What data formats are available for data lake query results?¶
- The output data format supported is: CSV
How can I explore the data available in the data lake for my project?¶
- PHC notebooks are an ideal sandbox for data exploration.
- The notebook environments are pre-installed with the PHC SDK for Python as well as modules useful for data exploration, such as Numpy. See the LifeOmic Notebook Service FAQ for more information.
Last update: 2022-03-24
Created: 2020-05-02
Created: 2020-05-02