Data Lake Service FAQ¶
What is LifeOmic Data Lake Service?¶
- LifeOmic Data Lake Service is a managed repository of all clinical and genomic data ingested into the PHC.
- The data is persisted in a semi-structured form, enabling users to query, shape and combine data from different modalities into a single model which suits their unique needs. This democratizes data exploration and allows a user to do analytics, data science, and machine learning.
What data is cataloged in the data lake when files or FHIR data is ingested?¶
- This answer depends upon what data has been brought to the project.
- To see the list of cataloged data for a specific project, run the following LifeOmic CLI command:
bash lo data-lake list-schemas <projectId>
- When available, the following data domains (or data pools) are possible:
- Omic data
- FHIR data
What format does the data lake use to store data?¶
- Apache Parquet powers the storage format of the data found in the data lake.
How can I read data from the data lake?¶
- There are currently three tools which can query the data lake and retrieve results:
- Data Lake REST API
What data formats are available for data lake query results?¶
- The output data format supported is: CSV
How can I explore the data available in the data lake for my project?¶
- Notebooks (in the PHC) are a ideal sandbox for data exploration.
- The notebook environments are pre-installed with the PHC SDK for Python as well as modules useful for data exploration, such as Numpy. See the LifeOmic Notebook Service FAQ for more information.
Last update: 2021-04-22