Skip to content

Analyzed Document Data

After your document is analyzed and visible in PrecisionOCR, it may be helpful to pull all of the predictions down for combing through the sheer number of suggestions or perhaps building an integration.

For this example, the goal is to extract an analyzed OCR document related to a fictional Dan Dennis and see all of the suggested medical codes as well as the detailed textual elements of the uploaded PDF.

The first step is to simply get the matching patients by name.

phc.Patient.get_data_frame(term={
    "name.family.keyword": "Dennis"
})
id account resourceType name_use name_given_0 name_family birthDate.tz birthDate.local gender meta.tag_lastUpdated.tz meta.tag_lastUpdated.local
0c23c681-eeb7-491d-bb99-5ab77df53941 sample Patient official Dan Dennis 0 1983-02-24T00:00:00+00:00 male 0 2021-03-18T02:43:51.064000+00:00

Fetching Documents from PrecisionOCR

All documents in PrecisionOCR are stored as DocumentReference resources in the FHIR format. Since they have a unique code that differentiates them from other documents in the account, the phc.Ocr.Document class provides seamless access to just PrecisionOCR documents. The documents for Dan Dennis are retrieved using the patient_id parameter.

phc.Ocr.Document.get_data_frame(patient_id="0c23c681-eeb7-491d-bb99-5ab77df53941")
id account resourceType meta.tag_system__lifeomic.com/fhir/dataset__code meta.tag_system__lifeomic.com/fhir/source__code meta.tag_lastUpdated.tz meta.tag_lastUpdated.local type.coding_system__loinc.org__code type.coding_system__loinc.org__display indexed status meta.tag_system__lifeomic.com/ocr/document/status__code docStatus content description
ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample DocumentReference 637475e1-3b26-4d78-87eb-5df66ab9ef59 PrecisionOCR Service 0 2021-03-18T02:44:57.061000+00:00 11488-4 Consult note 2021-03-18T02:43:41.846Z current SUCCESS preliminary [...] ocr-uploads/D D Notes.pdf

If the document ID is already known, a single record can be retrieved.

phc.Ocr.Document.get("ebb2ae5a-6563-4bfd-bcf8-de095bb203b1")

In addition to the columns seen above, the DocumentReference resources include a content JSON column that links to various files.

[{'attachment': {'contentType': 'application/pdf',
   'url': 'https://api.dev.lifeomic.com/v1/files/d2b334dd-5461-4480-b289-9ddb66717360'},
  'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
   'code': 'ocr-file-id',
   'display': 'OCR File Identifier'}},
 {'attachment': {'contentType': 'application/x-jsonlines',
   'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
  'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
   'code': 'ocr-text-file-id',
   'display': 'OCR Text File Identifier'}}]

The PDF file coded as 'ocr-file-id' refers to the original file in the file service while the line-by-line JSON file coded as 'ocr-text-file-id' contains the extracted textual elements from the PDF. This second file is discussed more in the Fetching Extracted Textual Elements section.

Fetching Page Data for Documents

phc.Ocr.DocumentComposition.get_data_frame(
  document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
  all_results=True
)
resourceType title id meta.tag_system__lifeomic.com/fhir/source__code meta.tag_system__lifeomic.com/ocr/documents/page-number__code meta.tag_lastUpdated.tz meta.tag_lastUpdated.local date.tz date.local status subject.reference author.reference type.coding_system__loinc.org__code type.coding_system__loinc.org__display extension.url__lifeomic.com/fhir/ocr/page-rotation__valueInteger extension.url__lifeomic.com/fhir/ocr/page-dates__valueString extension.url__lifeomic.com/fhir/ocr/page-aspect-ratio__valueString extension.url__lifeomic.com/fhir/ocr/page-image__valueString extension.url__lifeomic.com/fhir/ocr/masked-word-ids__valueString text.status text.div relatesTo.code relatesTo.targetReference_reference account
Composition ... 83daf7bf-6468-45e1-a021-d634ef116521 PrecisionOCR Service 0 0 2021-03-18 02:44:53.520000+00:00 0 2021-03-18 02:44:50.787000+00:00 final user@example.com 34765-8 General medicine Note 0 nan 595.28 x 841.89 f14e940a-2690-406c-a916-c61ca935a71f nan generated Sinus bradycardia. L... transforms DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample
Composition There is no pleural ... 3124cbc7-fc06-47d0-9636-48efc0a59a21 PrecisionOCR Service 1 0 2021-03-18 02:44:53.522000+00:00 0 2021-03-18 02:44:50.817000+00:00 final user@example.com 34765-8 General medicine Note 0 [{"isRelative":false,"wordIds":["1385f0c... 595.28 x 841.89 b9c32f92-b3c0-452d-9cfd-6b2146fff097 29c02c00-f6ef-46bc-9a38-5ebd08d7ff94,f2b2ff2b... generated 2004-12-16 1:01 PM C... transforms DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample
Composition AORTIC VALVE: Normal... 1d194ecd-9ec8-4b22-9e48-b9efa42c9660 PrecisionOCR Service 2 0 2021-03-18 02:44:53.526000+00:00 0 2021-03-18 02:44:50.955000+00:00 final user@example.com 34765-8 General medicine Note 0 nan 595.28 x 841.89 a6781a42-6435-4104-b341-2661a600e80e 669ca3b9-d3c1-47b1-b412-a12b542dd3a4... generated PATIENT/TEST INFORMA... transforms DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample
Composition The left ventricular... 848c3cd3-38e1-4663-8394-0fd317c8fe6b PrecisionOCR Service 3 0 2021-03-18 02:44:53.523000+00:00 0 2021-03-18 02:44:50.840000+00:00 final user@example.com 34765-8 General medicine Note 0 nan 595.28 x 841.89 9775ee58-221a-450a-9fe6-854d5d508d2b nan generated Conclusions: PRE-BYP... transforms DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample
Composition QS deflections in le... 597ba7a0-85b0-4bc1-9055-6c497ea7ad19 PrecisionOCR Service 4 0 2021-03-18 02:44:53.524000+00:00 0 2021-03-18 02:44:50.876000+00:00 final user@example.com 34765-8 General medicine Note 0 [{"isRelative":false,"wordIds":["7e4c0fc... 595.28 x 841.89 901844d1-fc53-4b6f-8a2a-35c299a547a1 7e4c0fc4-1f65-46d5-a3a6-1eaa2ecd33dc... generated Sinus rhythm. Left a... transforms DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 sample

Fetching Extracted Textual Elements

Aside from the medical code suggestions, documents contain a wealth of graphical and textual coordinates whether scanned from a physical piece of paper or computer generated with text metadata. As seen in the previous section on DocumentReference resources, the linked JSONL file contains the extracted text and layout information including the pages, lines, words, and even tables in a given document.

[...{'attachment': {'contentType': 'application/x-jsonlines',
   'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
  'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
   'code': 'ocr-text-file-id',
   'display': 'OCR Text File Identifier'}}]

The SDK refers to each of these elements as Block resources and automatically pulls the file down and converts it to a pandas DataFrame.

phc.Ocr.Block.get_data_frame(
  document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1"
)
BlockType Relationships Page Confidence Text TextType RowIndex ColumnIndex RowSpan ColumnSpan Polygon.X Polygon.Y Polygon.X_1 Polygon.Y_1 Polygon.X_2 Polygon.Y_2 Polygon.X_3 Polygon.Y_3 BoundingBox.Width BoundingBox.Height BoundingBox.Left BoundingBox.Top
PAGE [{'Type': 'CHILD', 'Ids': ['2076b805-39d3-4d78-ba9... 1 nan nan nan nan nan nan nan 0 0 0.9998 0 0.9998 1 0 1 0.9998 1 0 0
LINE [{'Type': 'CHILD', 'Ids': ['840f6da1-3e59-4b8f-989... 1 98.999 Sinus bradycardia. Lateral T w... nan nan nan nan nan 0.0527 0.038 0.8421 0.038 0.8421 0.0503 0.0527 0.0503 0.7894 0.0122 0.0527 0.038
WORD nan 1 99.6008 Sinus... PRINTED nan nan nan nan 0.0527 0.0382 0.0943 0.0382 0.0943 0.048 0.0527 0.048 0.0416 0.0098 0.0527 0.0382
WORD nan 1 97.2182 bradycardia.... PRINTED nan nan nan nan 0.0991 0.0382 0.1909 0.0382 0.1909 0.0501 0.0991 0.0501 0.0919 0.0119 0.0991 0.0382
WORD nan 1 99.7431 Lateral... PRINTED nan nan nan nan 0.1963 0.0384 0.2469 0.0384 0.2469 0.0479 0.1963 0.0479 0.0507 0.0095 0.1963 0.0384


Some of these elements contain children references such as the case of the line "Sinus bradycardia. Lateral T w..." containing the words "Sinus" and "bradycardia." Each block has an associated coordinate based on the upper-left corner.

Fetching FHIR Suggestions

Of course, the most interesting data is in the extracted suggestions. While the interface suggests a list of medications, procedures, and other resources that is condensed to obviously distinct codes, the underlying data contains numerous slight variations that the user only discovers after picking a distinct suggestion and then seeing the variations in date ranges, code systems, and other features.

In contrast to the PHC interface, the SDK builds a data frame of all permutations of suggestions so that the data can be easily filtered. In the first two rows seen below, for example, the suggestions for observation look very similar but the code is actually different. In other words, some id values will be replicated, but one or more columns of the table will be different.

phc.Ocr.Suggestion.get_data_frame(
  document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
  all_results=True
)
id type account project documentReference status originalText anchorDate suggestionId documentPage date_value date_confidence date_isPHI date_dataSource_source value_value value_confidence value_dataSource_source code_confidence code_dataSource_source code_value_system code_value_code code_value_display date_sourceText value_sourceText code_sourceText onsetDate_value onsetDate_confidence onsetDate_isPHI onsetDate_dataSource_source bodySite__item bodySite_confidence bodySite_dataSource_source bodySite_value_system bodySite_value_code bodySite_value_display onsetDate_sourceText bodySite_sourceText status_value status_confidence status_dataSource_source status_sourceText
b424ee52-5a75-4df2-b077-1e2096ffc529 observation sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN 2004-12-26 08:00AM BLOOD WBC-1... 2021-03-18T12:44:35.969Z 00015-00004-00001 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 2004-12-26T12:00:00.000Z 0.999996 1 comprehend {'value': 2.68} 0.967934 comprehend 0.967934 loinc-codes http://loinc.org 33229-6 RBC casts [#/area] in Urine by Computer assisted method 2004-12-26 2.68 RBC nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
b424ee52-5a75-4df2-b077-1e2096ffc529 observation sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN 2004-12-26 08:00AM BLOOD WBC-1... 2021-03-18T12:44:35.969Z 00015-00004-00001 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 2004-12-26T12:00:00.000Z 0.999996 1 comprehend {'value': 2.68} 0.967934 comprehend 0.967934 loinc-codes http://loinc.org 88970-9 RBC casts [#/area] in Urine sediment 2004-12-26 2.68 RBC nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
b30c3405-ed20-4c04-a730-c17e5b7e777b condition sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN Admission VS72 12 180/90 64"20... 2021-03-18T12:44:35.577Z 00014-00015-00000 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014 nan nan nan nan nan nan nan 0.835827 comprehend http://hl7.org/fhir/sid/icd-10 R09.89 Other specified symptoms and signs involving the circulatory and respiratory systems nan nan carotid bruits nan nan nan nan nan nan nan nan nan nan nan nan nan nan
b30c3405-ed20-4c04-a730-c17e5b7e777b condition sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN Admission VS72 12 180/90 64"20... 2021-03-18T12:44:35.577Z 00014-00015-00000 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014 nan nan nan nan nan nan nan 0.835827 comprehend http://hl7.org/fhir/sid/icd-10 I65.29 Occlusion and stenosis of unspecified carotid artery nan nan carotid bruits nan nan nan nan nan nan nan nan nan nan nan nan nan nan
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9 procedure sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN MCV-87 MCH-30.0 MCHC-34.7 RDW-... 2021-03-18T12:44:35.970Z 00015-00005-00000 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 nan nan nan nan nan nan nan 0.976198 loinc-codes http://loinc.org 30428-7 MCV [Entitic volume] nan MCV nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9 procedure sandbox 637475e1-3b26-4d78-87eb-5df66ab9ef59 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 OPEN MCV-87 MCH-30.0 MCHC-34.7 RDW-... 2021-03-18T12:44:35.970Z 00015-00005-00000 ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 nan nan nan nan nan nan nan 0.976198 loinc-codes http://loinc.org 787-2 MCV [Entitic volume] by Automated count nan MCV nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan


From this data frame of suggestions, the power of pandas can be leveraged.

# =========================================================
# Get permutation counts of different types of suggestions
# =========================================================
suggestion_df.type.value_counts()
# =>
# condition                   9283
# medicationAdministration     931
# procedure                    914
# observation                  348
# Name: type, dtype: int64

# =====================================================
# Get medication source texts for the suggested codes
# =====================================================
suggestion_df[suggestion_df.type == "medicationAdministration"].code_sourceText.unique()
# =>
# ['K-3.9 CI-99 HCO3-29 AnGap-13', 'ASA', 'Lopressor XL', 'Lipitor',
#  'Moexpril', 'Glucosamine/chondroitin Gaviscon', 'Docusate Sodium',
#  'Ranitidine HCI', 'Aspirin', 'Oxycodone-Acetaminophen',
#  'Atorvastatin', 'Amiodarone', 'Insulin Lispro', 'Insulin Glargine',
#  'insulin', 'Furosemide', 'Warfarin', 'Insulin lantus', 'humalog',
#  'beta blockade', 'amiodarone', 'coumadin', 'Toprol XL', 'Moexipril',
#  'COumadin', 'Coumadin']

Last update: 2021-03-23