Skip to content

Data Processing

With the PHC, in addition to being able to store omics data, one can also let the PHC index it. Once indexed, the data is available for deeper analysis using the Omics Explorer and other parts of the platform.

Definitions

  • Project - A logical grouping of omics and clinical data within the PHC. A project specifies the reference genome of the data within it. Reference genomes GRCh37 and GRCh38 are currently supported in the PHC. It is not possible to mix data from different genomes within the same project.
  • Omics Test - A logical grouping of omics data records that represent a single test or event.
  • Subject - The biological source of omics data.

Overview

The basic process for adding omics data to the PHC involves the following steps:

  1. Upload the omics data file to the PHC.
  2. Initiate the proper indexing action based on the omics data type (This involves associating the omics data with a Subject record in the PHC) or upload a manifest file that defines the genomic test and associates the files with a subject. See this section for more information.
  3. The PHC processes the file. During this time, the data is transformed and indexed to make it highly available and queryable. For certain data types, external knowledge bases are brought in and the omics data is annotated with additional information. One example of this is with short variants, information from sources such as ClinVar and dbSNP are added to the data records.
  4. After the data processing completes, the omics data can be accessed via the Omics Explorer or other parts of the platform like Tasks or the Data Lake.

Processing Time Frames

The time frame for omics data to be available in the platform is dependent upon several factors.

  • Number of omics files and overall size of the data being ingested at the same time
  • Current overall size of the omics data in a project. Indexing times can increase as the size of the project increases.

Resources

The following resources can be used to perform the actions needed to add omics data to the PHC.

  • Core API - Provides resources for managing projects, files, and other omics features
  • GA4GH API - Provides resources for managing omics resources
  • PHC Python SDK - A python developer kit for interfacing with the PHC.
  • LifeOmic CLI (Command Line Interface) allows for interacting with the PHC.
  • Omics Dashboard - Coming Soon

Omics Test Manifest File

Using a manifest file allows one to initiate the indexing process by just uploading a file to a project versus having to start it by issuing API commands as described above. The manifest file describes the omics test by listing the files included in the test and referencing a patient. The name of the manifest file must end with a .ga4gh.yml extension and be located within the .lifeomic folder in the PHC project. The manifest file can be located within a subfolder in .lifeomic. Examples of valid file names are: .lifeomic/manifests/test1.ga4gh.yml and .lifeomic/test2.ga4gh.yml.

Patient Matching

Matching a manifest file to a patient in the PHC can be done in one of two ways:

  • By providing the Patient FHIR resource ID in the patientId property.
  • By providing at least two of the following three properties: patientDOB, patientLastName, and patientIdentifier. If all three are provided, a match is attempted using all three. If a match with all three cannot be found, then a match using a combination of two of the properties is attempted. If only two properties are provided in the manifest file, then a match is attempted using just those.

The schema of the manifest file must match the following:

  • tests - List of genomic tests. It is possible to have multiple in order to index a batch at one time
    • name - A friendly name for the genomic test (required)
    • testType - The type of test that was performed (required)
    • reference - The reference build. Must be GRCh37 or GRCh38 and must match the reference build configured on the PHC project. You cannot add tests from different reference builds in the same project. (required)
    • patientIdentifier - The patient identifier for the test. This can be a MRN or some other ID associated with the patient record (required with patientDOB or patientLastName)
    • patientLastName - The patient's last name (required with patientDOB or patientIdentifier)
    • patientDOB - The patient's date of birth (required with patientLastName or patientIdentifier)
    • patientId - The patient's FHIR resource ID (required if patientLastName, patientIdentifier, and patientDOB are not present)
    • patientInfo - Additional patient information that can be used to create the Patient FHIR resource if it does not already exist
      • firstName - Patient first name
      • lastName - Patient last name
      • dob - Patient date of birth
      • gender - Patient gender
      • identifiers - List of patient identifiers
        • value - The identifier value
        • system - The namespace for the identifier value
        • codingSystem - Identity of the terminology system
        • codingCode - Code symbol defined by the system
    • indexedDate - The date the test was performed (defaults to current date if not provided)
    • performerId - The ID of the FHIR Organization resource that
    • bodySite - Coded value for the body site where the test specimen was taken from
    • bodySiteSystem - Identity of the terminology system for the body site value
    • bodySiteDisplay - Friendly display name for body site value
    • sourceFile - The name of the file that was used to generate the genomic test files
    • reportFile - The name of a report file to associated with the genomic test
    • msi - For a somatic test, the Microsatellite Instability result. Must be low, stable, high, or indeterminate
    • tmb - For a somatic test, the Tumor Mutation Burden result. Must be low, high, indeterminate, or unknown
    • tmbScore - For a somatic test, the Tumor Mutation Burden numeric score
    • files - List of genomic files to include with the test (required)
      • type - Valid values are shortVariant, read, expression, copyNumberVariant, or structuralVariant (required)
      • sequenceType - Valid values are germline, somatic, metastatic, ctDNA, or rna (required)
      • fileName - The name of the file in the PHC project (required)
      • name - A display name to give the test file. Defaults to the name of the file
      • normalize - For a shortVariant VCF file, set to true if the data should be normalized by the PHC
      • passOnly - For a shortVariant VCF file, set to true to exclude variants that do not have a filter value of PASS.
      • updateSample - For a shortVariant VCF file, set to true to generate a unique sample name. Defaults to the value in the VCF

Here is an example manifest file for a test that includes files for all omics types.

---
tests:
    - name: Big Genomic Test
      testType: Germline/Somatic Combo
      patientIdentifier: '10005'
      indexedDate: 2020/01/01
      reference: GRCh37
      bodySite: brain
      bodySiteDisplay: Brain
      bodySiteSystem: http://lifeomic.com/codes
      msi: stable
      tmb: low
      tmbScore: 10
      reportFile: reports/test1.pdf
      files:
          - type: copyNumberVariant
            sequenceType: somatic
            fileName: omics/test1.copynumber.csv
          - type: shortVariant
            sequenceType: somatic
            fileName: omics/test1.somatic.vcf.gz
            normalize: true
            passFilter: true
            updateSample: true
          - type: expression
            sequenceType: somatic
            fileName: omics/test1.expression.rgel
          - type: structuralVariant
            sequenceType: somatic
            fileName: omics/test1.structural.csv
          - type: read
            sequenceType: somatic
            fileName: omics/test1.somatic.bam
          - type: read
            sequenceType: somatic
            fileName: omics/test1.rna.bam
          - type: shortVariant
            sequenceType: germline
            fileName: omics/test1.vcf.gz
          - type: read
            sequenceType: germline
            fileName: omics/test1.germline.bam

Best Practices

  • When adding files to a PHC Project, virtual folders can be created by using the / delimiter in the name of the file. Example: Adding a file with a name of /path/file.txt will make it appear that the file file.txt exists under the path folder from the Files Web Console. This can help with organization as the number of files increase in a project.
  • When submitting the requests to process an omics data file, many of the API requests and CLI commands take a common set of fields that you should try to provide values for:
    • Name (name) - Use a descriptive name here as this is the value that will show up in many of the user interfaces like the Omics Explorer.
    • Test Type (testType|test-type) - Specify the type of test that was performed. Note: The Name and Test Type fields are used to help identify an omics test to prevent duplicates should the same omics file be ingested again for a Subject. Be sure to try to use unique values for these to identify each test for a given Subject. Example: For Foundation Medicine, this could be Heme.
    • Indexed Date (indexedDate|indexed-date) - Specify the actual date that the test was performed or the data was created. The PHC will later capture the dates when data was added to the PHC.
    • Performer ID (performerId|performer-id) - Specify the ID of a FHIR Organization resource to represent the entity that performed the test or created the data. You can filter subjects by this value later to see which ones had tests performed by a certain provider.
    • Body Site (bodySite|body-site) - Specify a code from a terminology system to identify the body site of the sample that was used to produce the test results
    • Body Site System (bodySiteSystem|body-site-system) - Specify the terminology system of the body site code
    • Body Site Display (bodySiteDisplay|body-site-display) - Specify a friendly display value for body site code

Omics File Management

Omics files can be uploaded and managed using the following methods:

Data Sources

Foundation Medicine

Foundation Medicine XML test files can be processed using the following methods:

  • Foundation Tasks API
  • lo tasks create-foundation-xml-import CLI subcommand
  • Omics Dashboard - Coming Soon

A single omics test will be created for all variant types found in the XML file. Also, note that the reportFileId|report-file-id option allows a PDF file to be linked to the PHC Subject.

NantOmics

NantOmics test files can be processed using the following methods:

  • NantOmics Tasks API
  • lo tasks create-nantomics-vcf-import CLI subcommand
  • Omics Dashboard - Coming Soon

NantOmics tests normally provide separate files for short and structural variants. A separate request has to be made to process each file. The type of data being added is denoted by using the uploadType|upload--type field of the request to specify variant or fnv. A single omics test will be created for the subject from both the short and structural variant files.

Ashion

Ashion GEM ExTra test TAR files can be processed using the following methods:

  • Ashion Tasks API
  • lo tasks create-ashion-import CLI subcommand
  • Omics Dashboard - Coming Soon

A single omics test will be created for all the data types found in the GEM ExTra TAR file.

Short Variants

VCF files can be processed by the PHC to add genomic short variants to a project. The PHC will run a normalization process on the VCF to filter out any unsupported regions. One can also specify a list of VCFs which will get combined with any duplicates being removed. VCFs can be processed using the following methods:

Reads

BAM files can be processed by the PHC to add genomic read data to a project. The PHC will create an index file for the BAM file. This allows the read data to be fetched and viewed in the web IGV. BAMs can be processed using the following methods:

RNA Expression

RNA expression data can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene_id,gene_name,expression,raw_count,attributes,is_normalized,expression_unit
sample1,MT-TP,MT-TP,37.4555,41,"{'effectiveLength':'12','length':'68'}",True,tpm
sample1,MT-CYB,MT-CYB,4862.07,455676,"{'effectiveLength':'1027.42','length':'1141'}",True,tpm

Expression files can be processed using the following methods:

Copy Number Variants

Copy number variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene,copy_number,status,attributes,chromosome,start_position,end_position,interpretation
sample1,HSD3B2,10.21,amplification,"{'SVTYPE':'<DUP>'}",chr1,119957553,119965658,N/A
sample1,HSD3B1,12.14,amplification,"{'SVTYPE':'<DUP>'}",chr1,120049825,120057677,N/A
File Schema
  • sample_id - a required string value
  • gene - a required string value
  • copy_number - a required double value
  • status - an optional string value
  • attributes - an optional string value representing a JSON object to store meta data
  • chromosome - an optional string value
  • start_position - an optional long value, representing start position of the chromosome
  • end_position - an optional long value, representing end position of the chromosome
  • interpretation - an optional string value

    ** NOTE: For any optional string values, N/A or . may be used to indicate a missing value is acceptable. However this is not required.

Copy number variant files can be processed using the following methods:

Structural Variants

Structural variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene1,gene2,effect,chromosome1,start_position1,end_position1,chromosome2,start_position2,end_position2,interpretation,sequence_type,in-frame,attributes
sample1,MRGPRF,N/A,translocation,chr11,68773114,68773114,chr17,70134939,70134939,N/A,somatic,N/A,"{'sv_type':'TRANSLOCATION'}"
sample1,DHX34,FUT1,duplication,chr19,47861484,47861484,chr19,49255988,49255988,N/A,somatic,N/A,"{'sv_type':'DUPLICATION'}"
File Schema
  • sample_id - a required string value
  • gene1 - a required string value, N/A a logical substitute for when one does not exist
  • gene2 - a required string value, N/A a logical substitute for when one does not exist
  • effect - an optional string value
  • chromosome1 - an optional string value
  • start_position1 - an optional long value, representing start position of chromosome1
  • end_position1 - an optional long value, representing end position of chromosome1
  • chromosome2 - an optional string value
  • start_position2 - an optional long value, representing start position of chromosome2
  • end_position2 - an optional long value, representing end position of chromosome2
  • interpretation - an optional string value
  • sequence_type - an optional string value
  • in-frame - an optional string value
  • attributes - an optional string value representing a JSON object to store meta data ** NOTE: For any optional string values, N/A or . may be used to indicate a missing value is acceptable. However this is not required.

Structural variant files can be processed using the following methods:

Re-Ingesting

If any of the omics types have already been ingested for a test (reminder a unique test is identified by several fields: the file itself, the test Name and Test Type fields) then they will not be re-ingested. Only any non-ingested types will be processed. Also note, if for some reason you want to ingest data already ingested, then the optional field (reIngestFile|re-ingest-file) can be added to a request. The existing omics test will be used, but the file will be fully re-processed.

FHIR Resources

The following FHIR resources are generated as part of the variant processing workflow:


Last update: 2020-10-02