{"id":702091,"date":"2020-10-28T14:17:53","date_gmt":"2020-10-28T21:17:53","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=702091"},"modified":"2021-09-24T10:59:47","modified_gmt":"2021-09-24T17:59:47","slug":"an-integrated-approach-of-deep-learning-and-symbolic-analysis-for-digital-pdf-table-extraction","status":"publish","type":"msr-research-item","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/an-integrated-approach-of-deep-learning-and-symbolic-analysis-for-digital-pdf-table-extraction\/","title":{"rendered":"An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction"},"content":{"rendered":"<p>Deep learning has shown great success at interpreting unstructured data such as object recognition in images. Symbolic\/logical-reasoning techniques have shown great success in interpreting structured data such as table extraction in webpages, custom text files, spreadsheets. The tables in PDF documents are often generated from such structured sources (text-based Word\/Latex documents, spreadsheets, webpages) but end up being unstructured. We thus explore novel combinations of deep learning and symbolic reasoning techniques to build an effective solution for PDF table extraction. We evaluate effectiveness without granting partial credit for matching part of a table (which may cause silent errors in downstream data processing). Our method achieves a 0.725 F1 score (vs. 0.339 for the state-of-the-art) on detecting correct table bounds\u2014a much stricter metric than the common one of detecting characters within tables\u2014in a well known public benchmark (ICDAR 2013) and a 0.404 F1 score (vs. 0.144 for the state-of-the-art) on our private benchmark with more widely varied table structures.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep learning has shown great success at interpreting unstructured data such as object recognition in images. Symbolic\/logical-reasoning techniques have shown great success in interpreting structured data such as table extraction in webpages, custom text files, spreadsheets. The tables in PDF documents are often generated from such structured sources (text-based Word\/Latex documents, spreadsheets, webpages) but end [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"IEEE","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"International Conference on Pattern Recognition","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"3096126694","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":null,"msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-10-27","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"http:\/\/www.icpr2020.it\/","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13563,13560],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[246694,251923,246691,246658,248692,256189,246808,260152,259540,251767,250582],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-702091","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-data-platform-analytics","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-field-of-study-artificial-intelligence","msr-field-of-study-benchmark-computing","msr-field-of-study-computer-science","msr-field-of-study-deep-learning","msr-field-of-study-f1-score","msr-field-of-study-matching-statistics","msr-field-of-study-natural-language-processing","msr-field-of-study-symbolic-data-analysis","msr-field-of-study-table-database","msr-field-of-study-unstructured-data","msr-field-of-study-web-page"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-10-27","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"IEEE","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/10\/An-Integrated-Approach-of-Deep-Learning-and-Symbolic-Analysis-for-Digital-PDF-Table-Extraction.pdf","label_id":"243132","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/an-integrated-approach-of-deep-learning-and-symbolic-analysis-for-digital-pdf-table-extraction\/","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":702097,"url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/10\/An-Integrated-Approach-of-Deep-Learning-and-Symbolic-Analysis-for-Digital-PDF-Table-Extraction.pdf"}],"msr-author-ordering":[{"type":"guest","value":"mengshi-zhang","user_id":778402,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=mengshi-zhang"},{"type":"user_nicename","value":"Daniel Perelman","user_id":39453,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Daniel Perelman"},{"type":"user_nicename","value":"Vu Le","user_id":39174,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Vu Le"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":33755,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sumit Gulwani"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[663303],"msr_project":[817234,672951],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":817234,"post_title":"PROSE-Powered Data Ingestion (a.k.a. Table Extraction)","post_name":"prose-powered-data-ingestion","post_type":"msr-project","post_date":"2022-01-31 17:52:00","post_modified":"2022-04-19 13:39:04","post_status":"publish","permalink":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/prose-powered-data-ingestion\/","post_excerpt":"\u200bWhen the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain.&nbsp;This data appeared in web pages, CSV files, JSON files, and more.&nbsp;There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it.&nbsp;The task of extracting tables from the varied&hellip;","_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/817234"}]}},{"ID":672951,"post_title":"Predictive Program Synthesis","post_name":"predictive-program-synthesis","post_type":"msr-project","post_date":"2020-07-07 20:17:25","post_modified":"2022-04-19 15:47:55","post_status":"publish","permalink":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/predictive-program-synthesis\/","post_excerpt":"Program synthesis technologies help users to easily automate tasks that would otherwise require significant manual effort or programming skills. For instance, programming-by-example or natural language programming approaches allow the user to express intent by giving examples or natural language descriptions of the task, from which the system can synthesize a program in a formal programming language to complete the task. In this project, we are exploring the novel notion of predictive program synthesis, which is&hellip;","_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/672951"}]}}]},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/702091","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/702091\/revisions"}],"predecessor-version":[{"id":778423,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/702091\/revisions\/778423"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=702091"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=702091"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=702091"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=702091"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=702091"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=702091"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=702091"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=702091"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=702091"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=702091"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=702091"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=702091"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=702091"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}