Data Processing

With the PHC, in addition to being able to store omics data, one can also let the PHC index it. Once indexed, the data is available for deeper analysis using the Omics Explorer and other parts of the platform.

Definitions

  • Project - A logical grouping of omics and clinical data within the PHC. A project specifies the reference genome of the data within it. Reference genomes GRCh37 and GRCh38 are currently supported in the PHC. It is not possible to mix data from different genomes within the same project.
  • Omics Test - A logical grouping of omics data records that represent a single test or event.
  • Subject - The biological source of omics data.

Overview

The basic process for adding omics data to the PHC involves the following steps:

  1. Upload the omics data file to the PHC.
  2. Initiate the proper indexing action based on the omics data type. This also involves associating the omics data with a Subject record in the PHC.
  3. The PHC processes the file. During this time, the data is transformed and indexed to make it highly available and queryable. For certain data types, external knowledge bases are brought in and the omics data is annotated with additional information. One example of this is with short variants, information from sources such as ClinVar and dbSNP are added to the data records.
  4. After the data processing completes, the omics data can be accessed via the Omics Explorer or other parts of the platform like Tasks or the Data Lake.

Processing Time Frames

The time frame for omics data to be available in the platform is dependent upon several factors.

  • Number of omics files and overall size of the data being ingested at the same time
  • Current overall size of the omics data in a project. Indexing times can increase as the size of the project increases.

Resources

The following resources can be used to perform the actions needed to add omics data to the PHC.

  • Core API - Provides resources for managing projects, files, and other omics features
  • GA4GH API - Provides resources for managing omics resources
  • PHC Python SDK - A python developer kit for interfacing with the PHC.
  • LifeOmic CLI (Command Line Interface) allows for interacting with the PHC.
  • Omics Dashboard - Coming Soon

Best Practices

  • When adding files to a PHC Project, virtual folders can be created by using the / delimiter in the name of the file. Example: Adding a file with a name of /path/file.txt will make it appear that the file file.txt exists under the path folder from the Files Web Console. This can help with organization as the number of files increase in a project.
  • When submitting the requests to process an omics data file, many of the API requests and CLI commands take a common set of fields that you should try to provide values for:
    • Name (name) - Use a descriptive name here as this is the value that will show up in many of the user interfaces like the Omics Explorer.
    • Test Type (testType|test-type) - Specify the type of test that was performed. Note: The Name and Test Type fields are used to help identify an omics test to prevent duplicates should the same omics file be ingested again for a Subject. Be sure to try to use unique values for these to identify each test for a given Subject. Example: For Foundation Medicine, this could be Heme.
    • Indexed Date (indexedDate|indexed-date) - Specify the actual date that the test was performed or the data was created. The PHC will later capture the dates when data was added to the PHC.
    • Performer ID (performerId|performer-id) - Specify the ID of a FHIR Organization resource to represent the entity that performed the test or created the data. You can filter subjects by this value later to see which ones had tests performed by a certain provider.
    • Body Site (bodySite|body-site) - Specify a code from a terminology system to identify the body site of the sample that was used to produce the test results
    • Body Site System (bodySiteSystem|body-site-system) - Specify the terminology system of the body site code
    • Body Site Display (bodySiteDisplay|body-site-display) - Specify a friendly display value for body site code

Omics File Management

Omics files can be uploaded and managed using the following methods:

Data Sources

Foundation Medicine

Foundation Medicine XML test files can be processed using the following methods:

  • Foundation Tasks API
  • lo tasks create-foundation-xml-import CLI subcommand
  • Omics Dashboard - Coming Soon

A single omics test will be created for all variant types found in the XML file. Also, note that the reportFileId|report-file-id option allows a PDF file to be linked to the PHC Subject.

NantOmics

NantOmics test files can be processed using the following methods:

  • NantOmics Tasks API
  • lo tasks create-nantomics-vcf-import CLI subcommand
  • Omics Dashboard - Coming Soon

NantOmics tests normally provide separate files for short and structural variants. A separate request has to be made to process each file. The type of data being added is denoted by using the uploadType|upload--type field of the request to specify variant or fnv. A single omics test will be created for the subject from both the short and structural variant files.

Ashion

Ashion GEM ExTra test TAR files can be processed using the following methods:

  • Ashion Tasks API
  • lo tasks create-ashion-import CLI subcommand
  • Omics Dashboard - Coming Soon

A single omics test will be created for all the data types found in the GEM ExTra TAR file.

Short Variants

VCF files can be processed by the PHC to add genomic short variants to a project. The PHC will run a normalization process on the VCF to filter out any unsupported regions. One can also specify a list of VCFs which will get combined with any duplicates being removed. VCFs can be processed using the following methods:

Reads

BAM files can be processed by the PHC to add genomic read data to a project. The PHC will create an index file for the BAM file. This allows the read data to be fetched and viewed in the web IGV. BAMs can be processed using the following methods:

RNA Expression

RNA expression data can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene_id,gene_name,expression,raw_count,attributes,is_normalized,expression_unit
sample1,MT-TP,MT-TP,37.4555,41,"{'effectiveLength':'12','length':'68'}",True,tpm
sample1,MT-CYB,MT-CYB,4862.07,455676,"{'effectiveLength':'1027.42','length':'1141'}",True,tpm

Expression files can be processed using the following methods:

Copy Number Variants

Copy number variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene,copy_number,status,attributes,chromosome,start_position,end_position,interpretation
sample1,HSD3B2,10.21,amplification,"{'SVTYPE':'<DUP>'}",chr1,119957553,119965658,N/A
sample1,HSD3B1,12.14,amplification,"{'SVTYPE':'<DUP>'}",chr1,120049825,120057677,N/A

Copy number variant files can be processed using the following methods:

Structural Variants

Structural variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene1,gene2,effect,chromosome1,start_position1,end_position1,chromosome2,start_position2,end_position2,interpretation,sequence_type,in-frame,attributes
sample1,MRGPRF,N/A,translocation,chr11,68773114,68773114,chr17,70134939,70134939,N/A,somatic,N/A,"{'sv_type':'TRANSLOCATION'}"
sample1,DHX34,FUT1,duplication,chr19,47861484,47861484,chr19,49255988,49255988,N/A,somatic,N/A,"{'sv_type':'DUPLICATION'}"

Structural variant files can be processed using the following methods:

Re-Ingesting

If any of the omics types have already been ingested for a test (reminder a unique test is identified by several fields: the file itself, the test Name and Test Type fields) then they will not be re-ingested. Only any non-ingested types will be processed. Also note, if for some reason you want to ingest data already ingested, then the optional field (reIngestFile|re-ingest-file) can be added to a request. The existing omics test will be used, but the file will be fully re-processed.

FHIR Resources

The following FHIR resources are generated as part of the variant processing workflow:


Last update: February 6, 2020