#StackBounty: #pdf Converting Math Symbols from PDF into LaTeX

Bounty: 50

I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as epsilon, Updownarrow, simeq use non Unicode codes and others such as neq use a combination of non Unicode codes.

  • epsilon is written using the embedded font SCCPFS+CMMI10 and code 017
  • Updownarrow using the embedded font KAXSYH+CMSY10 and code 0x6d (m)
  • simeq using the embedded font KAXSYH+CMSY10 and code 0x27 (')
  • neq using the embedded font KAXSYH+CMSY10 and codes 0x36 (/) and 0x3d (=)

Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original epsilon, neq etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.

EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?

enter image description here

EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.