Wednesday, December 28, 2022

Python GIS data wrangling - Harris County Appraisal District

 The task at hand is to prepare a spreadsheet file with links to "Appraisal District Maps in Harris County, Texas".


There twenty two (22) appraisal districts as seen on the facet-maps page above. So, now we need to links to the detailed map in PDF format.

For example, if you open the Houston facet map, you see several facet numbers that leads to the detail maps.

https://public.hcad.org/maps/Houston.asp


If you click on any of the numbers (example 5156A), it should take you to a page that looks like the one below where you will see that the map is subdivided into 12 detailed maps in PDF format.

https://public.hcad.org/cgi-bin/IMap.asp?map=5156A


Now it is the links to the detailed maps in PDF that we need to collect into spreadsheet table.

https://public.hcad.org/iMaps/Tiles/Color/5156A1.pdf


Scrapping the links was not easy as the website is had anti-scrape mechanisms implemented on it. So, we can manually collect the HTML source tags that contain what we wanted then use python scripting to wrangle it into the format we wanted.

For example for each district, the facet numbers are embedded in a '<map>...</map>' tag as seen below.

iMap = ''' <map name="ISDMap">
            <area shape="rect" coords="487, 59 , 525, 86" href="/cgi-bin/IMap.asp?map=5262A" alt="">
            <area shape="rect" coords="525, 59 , 564, 86" href="/cgi-bin/IMap.asp?map=5262B" alt="">
            <area shape="rect" coords="564, 59 , 604, 85" href="/cgi-bin/IMap.asp?map=5362A" alt="">
            <area shape="rect" coords="603, 59 , 643, 86" href="/cgi-bin/IMap.asp?map=5362B" alt="">
            <area shape="rect" coords="643, 59 , 682, 85" href="/cgi-bin/IMap.asp?map=5462A" alt="">
            <area shape="rect" coords="682, 59 , 722, 86" href="/cgi-bin/IMap.asp?map=5462B" alt="">
            <area shape="rect" coords="487, 86 , 525,111" href="/cgi-bin/IMap.asp?map=5262C" alt="">
            <area shape="rect" coords="524, 85 , 565,110" href="/cgi-bin/IMap.asp?map=5262D" alt="">
            <area shape="rect" coords="564, 85 , 604,111" href="/cgi-bin/IMap.asp?map=5362C" alt="">
            <area shape="rect" coords="603, 85 , 643,111" href="/cgi-bin/IMap.asp?map=5362D" alt="">
            <area shape="rect" coords="643, 85 , 682,110" href="/cgi-bin/IMap.asp?map=5462C" alt="">
            <area shape="rect" coords="681, 85 , 722,111" href="/cgi-bin/IMap.asp?map=5462D" alt="">
            <area shape="rect" coords="486,110 , 525,137" href="/cgi-bin/IMap.asp?map=5261A" alt="">
            <area shape="rect" coords="525,110 , 565,137" href="/cgi-bin/IMap.asp?map=5261B" alt="">
            <area shape="rect" coords="564,110 , 604,137" href="/cgi-bin/IMap.asp?map=5361A" alt="">
            <area shape="rect" coords="603,110 , 643,137" href="/cgi-bin/IMap.asp?map=5261B" alt="">
            <area shape="rect" coords="643,110 , 682,137" href="/cgi-bin/IMap.asp?map=5461A" alt="">
            <area shape="rect" coords="681,110 , 722,137" href="/cgi-bin/IMap.asp?map=5461B" alt="">
            <area shape="rect" coords="368,137 , 409,163" href="/cgi-bin/IMap.asp?map=5061D" alt="">
            <area shape="rect" coords="408,137 , 447,163" href="/cgi-bin/IMap.asp?map=5161C" alt="">
            <area shape="rect" coords="446,136 , 487,162" href="/cgi-bin/IMap.asp?map=5161D" alt="">
            <area shape="rect" coords="486,137 , 525,163" href="/cgi-bin/IMap.asp?map=5261C" alt="">
            <area shape="rect" coords="524,137 , 565,163" href="/cgi-bin/IMap.asp?map=5261D" alt="">
            <area shape="rect" coords="564,137 , 604,162" href="/cgi-bin/IMap.asp?map=5361C" alt="">
            <area shape="rect" coords="603,137 , 643,163" href="/cgi-bin/IMap.asp?map=5361D" alt="">
            <area shape="rect" coords="642,137 , 682,163" href="/cgi-bin/IMap.asp?map=5461C" alt="">
            <area shape="rect" coords="682,136 , 722,162" href="/cgi-bin/IMap.asp?map=5461D" alt="">
            <area shape="rect" coords="721,137 , 761,163" href="/cgi-bin/IMap.asp?map=5561C" alt="">
            <area shape="rect" coords="760,137 , 800,163" href="/cgi-bin/IMap.asp?map=5561D" alt="">
            <area shape="rect" coords="368,163 , 408,188" href="/cgi-bin/IMap.asp?map=5060B" alt="">
            <area shape="rect" coords="407,163 , 447,188" href="/cgi-bin/IMap.asp?map=5160A" alt="">
            <area shape="rect" coords="446,162 , 487,189" href="/cgi-bin/IMap.asp?map=5160B" alt="">
            <area shape="rect" coords="486,163 , 525,188" href="/cgi-bin/IMap.asp?map=5260A" alt="">
            <area shape="rect" coords="525,163 , 565,189" href="/cgi-bin/IMap.asp?map=5260B" alt="">
            <area shape="rect" coords="564,162 , 604,188" href="/cgi-bin/IMap.asp?map=5360A" alt="">
            <area shape="rect" coords="603,163 , 644,189" href="/cgi-bin/IMap.asp?map=5360B" alt="">
            <area shape="rect" coords="642,163 , 682,189" href="/cgi-bin/IMap.asp?map=5460A" alt="">
            <area shape="rect" coords="682,163 , 722,189" href="/cgi-bin/IMap.asp?map=5460B" alt="">
            <area shape="rect" coords="721,162 , 760,188" href="/cgi-bin/IMap.asp?map=5560A" alt="">
            <area shape="rect" coords="760,163 , 800,189" href="/cgi-bin/IMap.asp?map=5560B" alt="">
            <area shape="rect" coords="368,189 , 408,214" href="/cgi-bin/IMap.asp?map=5060D" alt="">
            <area shape="rect" coords="407,189 , 447,214" href="/cgi-bin/IMap.asp?map=5160C" alt="">
            <area shape="rect" coords="446,188 , 487,214" href="/cgi-bin/IMap.asp?map=5160D" alt="">
            <area shape="rect" coords="486,189 , 525,214" href="/cgi-bin/IMap.asp?map=5260C" alt="">
            <area shape="rect" coords="524,189 , 565,214" href="/cgi-bin/IMap.asp?map=5260D" alt="">
            <area shape="rect" coords="564,189 , 604,214" href="/cgi-bin/IMap.asp?map=5360C" alt="">
            <area shape="rect" coords="603,189 , 643,214" href="/cgi-bin/IMap.asp?map=5360D" alt="">
            <area shape="rect" coords="642,189 , 682,214" href="/cgi-bin/IMap.asp?map=5460C" alt="">
            <area shape="rect" coords="681,189 , 722,214" href="/cgi-bin/IMap.asp?map=5460D" alt="">
            <area shape="rect" coords="721,189 , 761,214" href="/cgi-bin/IMap.asp?map=5560C" alt="">
            <area shape="rect" coords="761,188 , 800,214" href="/cgi-bin/IMap.asp?map=5560D" alt="">
            <area shape="rect" coords="838,189 , 879,214" href="/cgi-bin/IMap.asp?map=5660D" alt="">
            <area shape="rect" coords="877,189 , 917,214" href="/cgi-bin/IMap.asp?map=5760C" alt="">
            <area shape="rect" coords="408,214 , 447,240" href="/cgi-bin/IMap.asp?map=5159A" alt="">
            <area shape="rect" coords="446,214 , 487,240" href="/cgi-bin/IMap.asp?map=5159B" alt="">
            <area shape="rect" coords="486,214 , 525,240" href="/cgi-bin/IMap.asp?map=5259A" alt="">
            <area shape="rect" coords="524,214 , 565,240" href="/cgi-bin/IMap.asp?map=5259B" alt="">
            <area shape="rect" coords="565,214 , 604,240" href="/cgi-bin/IMap.asp?map=5359A" alt="">
            <area shape="rect" coords="604,214 , 643,239" href="/cgi-bin/IMap.asp?map=5359B" alt="">
            <area shape="rect" coords="643,213 , 682,240" href="/cgi-bin/IMap.asp?map=5459A" alt="">
            <area shape="rect" coords="682,214 , 722,240" href="/cgi-bin/IMap.asp?map=5459B" alt="">
            <area shape="rect" coords="722,214 , 761,240" href="/cgi-bin/IMap.asp?map=5559A" alt="">
            <area shape="rect" coords="761,214 , 800,240" href="/cgi-bin/IMap.asp?map=5559B" alt="">
            <area shape="rect" coords="800,214 , 839,240" href="/cgi-bin/IMap.asp?map=5659A" alt="">
            <area shape="rect" coords="839,214 , 878,240" href="/cgi-bin/IMap.asp?map=5659B" alt="">
            <area shape="rect" coords="878,214 , 917,240" href="/cgi-bin/IMap.asp?map=5759A" alt="">
            <area shape="rect" coords="917,213 , 957,240" href="/cgi-bin/IMap.asp?map=5759B" alt="">
            <area shape="rect" coords="447,240 , 487,266" href="/cgi-bin/IMap.asp?map=5159D" alt="">
            <area shape="rect" coords="487,240 , 525,266" href="/cgi-bin/IMap.asp?map=5259C" alt="">
            <area shape="rect" coords="525,239 , 565,266" href="/cgi-bin/IMap.asp?map=5259D" alt="">
            <area shape="rect" coords="565,240 , 604,266" href="/cgi-bin/IMap.asp?map=5359C" alt="">
            <area shape="rect" coords="604,239 , 643,266" href="/cgi-bin/IMap.asp?map=5359D" alt="">
            <area shape="rect" coords="643,240 , 682,266" href="/cgi-bin/IMap.asp?map=5459C" alt="">
            <area shape="rect" coords="682,240 , 722,266" href="/cgi-bin/IMap.asp?map=5459D" alt="">
            <area shape="rect" coords="721,240 , 761,266" href="/cgi-bin/IMap.asp?map=5559C" alt="">
            <area shape="rect" coords="760,240 , 800,265" href="/cgi-bin/IMap.asp?map=5559D" alt="">
            <area shape="rect" coords="800,240 , 839,266" href="/cgi-bin/IMap.asp?map=5659C" alt="">
            <area shape="rect" coords="840,240 , 879,266" href="/cgi-bin/IMap.asp?map=5659D" alt="">
            <area shape="rect" coords="878,240 , 917,266" href="/cgi-bin/IMap.asp?map=5759C" alt="">
            <area shape="rect" coords="917,240 , 957,266" href="/cgi-bin/IMap.asp?map=5759D" alt="">
            <area shape="rect" coords="448,266 , 487,292" href="/cgi-bin/IMap.asp?map=5158B" alt="">
            <area shape="rect" coords="487,266 , 525,292" href="/cgi-bin/IMap.asp?map=5258A" alt="">
            <area shape="rect" coords="525,265 , 565,292" href="/cgi-bin/IMap.asp?map=5258B" alt="">
            <area shape="rect" coords="565,266 , 604,292" href="/cgi-bin/IMap.asp?map=5358A" alt="">
            <area shape="rect" coords="604,266 , 643,292" href="/cgi-bin/IMap.asp?map=5358B" alt="">
            <area shape="rect" coords="643,266 , 682,292" href="/cgi-bin/IMap.asp?map=5458A" alt="">
            <area shape="rect" coords="682,265 , 722,292" href="/cgi-bin/IMap.asp?map=5458B" alt="">
            <area shape="rect" coords="722,266 , 761,292" href="/cgi-bin/IMap.asp?map=5558A" alt="">
            <area shape="rect" coords="761,266 , 800,292" href="/cgi-bin/IMap.asp?map=5558B" alt="">
            <area shape="rect" coords="800,266 , 839,292" href="/cgi-bin/IMap.asp?map=5658A" alt="">
            <area shape="rect" coords="839,266 , 878,292" href="/cgi-bin/IMap.asp?map=5658B" alt="">
            <area shape="rect" coords="877,266 , 917,292" href="/cgi-bin/IMap.asp?map=5758A" alt="">
            <area shape="rect" coords="916,265 , 957,292" href="/cgi-bin/IMap.asp?map=5758B" alt="">
            <area shape="rect" coords="957,266 , 996,292" href="/cgi-bin/IMap.asp?map=5858A" alt="">
            <area shape="rect" coords="408,291 , 447,317" href="/cgi-bin/IMap.asp?map=5158C" alt="">
            <area shape="rect" coords="447,292 , 487,317" href="/cgi-bin/IMap.asp?map=5158D" alt="">
            <area shape="rect" coords="487,292 , 525,317" href="/cgi-bin/IMap.asp?map=5258C" alt="">
            <area shape="rect" coords="525,292 , 565,317" href="/cgi-bin/IMap.asp?map=5258D" alt="">
            <area shape="rect" coords="565,292 , 604,317" href="/cgi-bin/IMap.asp?map=5358C" alt="">
            <area shape="rect" coords="604,292 , 643,317" href="/cgi-bin/IMap.asp?map=5358D" alt="">
            <area shape="rect" coords="643,291 , 682,317" href="/cgi-bin/IMap.asp?map=5458C" alt="">
            <area shape="rect" coords="682,292 , 722,317" href="/cgi-bin/IMap.asp?map=5458D" alt="">
            <area shape="rect" coords="722,292 , 761,317" href="/cgi-bin/IMap.asp?map=5558C" alt="">
            <area shape="rect" coords="761,292 , 800,317" href="/cgi-bin/IMap.asp?map=5558D" alt="">
            <area shape="rect" coords="800,292 , 839,317" href="/cgi-bin/IMap.asp?map=5658C" alt="">
            <area shape="rect" coords="839,292 , 878,317" href="/cgi-bin/IMap.asp?map=5658D" alt="">
            <area shape="rect" coords="878,292 , 917,317" href="/cgi-bin/IMap.asp?map=5758C" alt="">
            <area shape="rect" coords="917,292 , 957,317" href="/cgi-bin/IMap.asp?map=5758D" alt="">
            <area shape="rect" coords="957,291 , 996,317" href="/cgi-bin/IMap.asp?map=5858C" alt="">
            <area shape="rect" coords=" 94,317 , 133,343" href="/cgi-bin/IMap.asp?map=4757A" alt="">
            <area shape="rect" coords="132,317 , 173,343" href="/cgi-bin/IMap.asp?map=4757B" alt="">
            <area shape="rect" coords="211,317 , 252,343" href="/cgi-bin/IMap.asp?map=4857B" alt="">
            <area shape="rect" coords="250,317 , 290,342" href="/cgi-bin/IMap.asp?map=4957A" alt="">
            <area shape="rect" coords="369,316 , 408,343" href="/cgi-bin/IMap.asp?map=5057B" alt="">
            <area shape="rect" coords="408,317 , 447,343" href="/cgi-bin/IMap.asp?map=5157A" alt="">
            <area shape="rect" coords="447,317 , 487,343" href="/cgi-bin/IMap.asp?map=5157B" alt="">
            <area shape="rect" coords="486,316 , 525,343" href="/cgi-bin/IMap.asp?map=5257A" alt="">
            <area shape="rect" coords="525,317 , 566,343" href="/cgi-bin/IMap.asp?map=5257B" alt="">
            <area shape="rect" coords="565,317 , 605,343" href="/cgi-bin/IMap.asp?map=5357A" alt="">
            <area shape="rect" coords="605,317 , 644,343" href="/cgi-bin/IMap.asp?map=5357B" alt="">
            <area shape="rect" coords="643,316 , 683,343" href="/cgi-bin/IMap.asp?map=5457A" alt="">
            <area shape="rect" coords="683,317 , 723,343" href="/cgi-bin/IMap.asp?map=5457B" alt="">
            <area shape="rect" coords="722,317 , 761,343" href="/cgi-bin/IMap.asp?map=5557A" alt="">
            <area shape="rect" coords="761,317 , 801,343" href="/cgi-bin/IMap.asp?map=5557B" alt="">
            <area shape="rect" coords="801,317 , 839,343" href="/cgi-bin/IMap.asp?map=5657A" alt="">
            <area shape="rect" coords="839,316 , 879,343" href="/cgi-bin/IMap.asp?map=5657B" alt="">
            <area shape="rect" coords="370,343 , 409,369" href="/cgi-bin/IMap.asp?map=5057D" alt="">
            <area shape="rect" coords="408,343 , 448,369" href="/cgi-bin/IMap.asp?map=5157C" alt="">
            <area shape="rect" coords="448,343 , 488,369" href="/cgi-bin/IMap.asp?map=5157D" alt="">
            <area shape="rect" coords="488,343 , 526,369" href="/cgi-bin/IMap.asp?map=5257C" alt="">
            <area shape="rect" coords="526,343 , 566,369" href="/cgi-bin/IMap.asp?map=5257D" alt="">
            <area shape="rect" coords="566,343 , 605,369" href="/cgi-bin/IMap.asp?map=5357C" alt="">
            <area shape="rect" coords="605,342 , 644,368" href="/cgi-bin/IMap.asp?map=5357D" alt="">
            <area shape="rect" coords="644,343 , 683,369" href="/cgi-bin/IMap.asp?map=5457C" alt="">
            <area shape="rect" coords="683,343 , 723,369" href="/cgi-bin/IMap.asp?map=5457D" alt="">
            <area shape="rect" coords="723,343 , 761,369" href="/cgi-bin/IMap.asp?map=5557C" alt="">
            <area shape="rect" coords="761,343 , 801,369" href="/cgi-bin/IMap.asp?map=5557D" alt="">
            <area shape="rect" coords="801,343 , 840,369" href="/cgi-bin/IMap.asp?map=5657C" alt="">
            <area shape="rect" coords="840,342 , 879,369" href="/cgi-bin/IMap.asp?map=5657D" alt="">
            <area shape="rect" coords=" 94,343 , 134,369" href="/cgi-bin/IMap.asp?map=4757C" alt="">
            <area shape="rect" coords="134,342 , 174,369" href="/cgi-bin/IMap.asp?map=4757D" alt="">
            <area shape="rect" coords="173,343 , 213,369" href="/cgi-bin/IMap.asp?map=4857C" alt="">
            <area shape="rect" coords="211,343 , 251,368" href="/cgi-bin/IMap.asp?map=4857D" alt="">
            <area shape="rect" coords="251,343 , 291,369" href="/cgi-bin/IMap.asp?map=4957C" alt="">
            <area shape="rect" coords="291,342 , 330,369" href="/cgi-bin/IMap.asp?map=4957D" alt="">
            <area shape="rect" coords=" 94,368 , 134,394" href="/cgi-bin/IMap.asp?map=4756A" alt="">
            <area shape="rect" coords="134,369 , 173,394" href="/cgi-bin/IMap.asp?map=4756B" alt="">
            <area shape="rect" coords="172,368 , 212,394" href="/cgi-bin/IMap.asp?map=4856A" alt="">
            <area shape="rect" coords="212,369 , 252,394" href="/cgi-bin/IMap.asp?map=4856B" alt="">
            <area shape="rect" coords="251,368 , 291,394" href="/cgi-bin/IMap.asp?map=4956A" alt="">
            <area shape="rect" coords="290,368 , 330,394" href="/cgi-bin/IMap.asp?map=4956B" alt="">
            <area shape="rect" coords="329,369 , 369,394" href="/cgi-bin/IMap.asp?map=5056A" alt="">
            <area shape="rect" coords="370,368 , 409,394" href="/cgi-bin/IMap.asp?map=5056B" alt="">
            <area shape="rect" coords="409,369 , 448,394" href="/cgi-bin/IMap.asp?map=5156A" alt="">
            <area shape="rect" coords="447,369 , 488,394" href="/cgi-bin/IMap.asp?map=5156B" alt="">
            <area shape="rect" coords="487,369 , 526,394" href="/cgi-bin/IMap.asp?map=5256A" alt="">
            <area shape="rect" coords="526,369 , 566,394" href="/cgi-bin/IMap.asp?map=5256B" alt="">
            <area shape="rect" coords="566,369 , 605,394" href="/cgi-bin/IMap.asp?map=5356A" alt="">
            <area shape="rect" coords="604,368 , 643,393" href="/cgi-bin/IMap.asp?map=5356B" alt="">
            <area shape="rect" coords="644,369 , 683,394" href="/cgi-bin/IMap.asp?map=5456A" alt="">
            <area shape="rect" coords="683,369 , 723,394" href="/cgi-bin/IMap.asp?map=5456B" alt="">
            <area shape="rect" coords="723,369 , 762,394" href="/cgi-bin/IMap.asp?map=5556A" alt="">
            <area shape="rect" coords="761,369 , 801,394" href="/cgi-bin/IMap.asp?map=5556B" alt="">
            <area shape="rect" coords="801,369 , 840,394" href="/cgi-bin/IMap.asp?map=5656A" alt="">
            <area shape="rect" coords="840,369 , 879,394" href="/cgi-bin/IMap.asp?map=5656B" alt="">
            <area shape="rect" coords=" 95,394 , 134,419" href="/cgi-bin/IMap.asp?map=4756C" alt="">
            <area shape="rect" coords="134,393 , 173,419" href="/cgi-bin/IMap.asp?map=4756D" alt="">
            <area shape="rect" coords="173,393 , 213,419" href="/cgi-bin/IMap.asp?map=4856C" alt="">
            <area shape="rect" coords="212,393 , 252,419" href="/cgi-bin/IMap.asp?map=4856D" alt="">
            <area shape="rect" coords="251,394 , 290,419" href="/cgi-bin/IMap.asp?map=4956C" alt="">
            <area shape="rect" coords="289,393 , 329,419" href="/cgi-bin/IMap.asp?map=4956D" alt="">
            <area shape="rect" coords="330,393 , 370,419" href="/cgi-bin/IMap.asp?map=5056C" alt="">
            <area shape="rect" coords="369,393 , 408,420" href="/cgi-bin/IMap.asp?map=5056D" alt="">
            <area shape="rect" coords="409,394 , 447,420" href="/cgi-bin/IMap.asp?map=5156C" alt="">
            <area shape="rect" coords="448,395 , 487,420" href="/cgi-bin/IMap.asp?map=5156D" alt="">
            <area shape="rect" coords="487,394 , 525,420" href="/cgi-bin/IMap.asp?map=5256C" alt="">
            <area shape="rect" coords="525,394 , 565,420" href="/cgi-bin/IMap.asp?map=5256D" alt="">
            <area shape="rect" coords="565,394 , 604,419" href="/cgi-bin/IMap.asp?map=5356C" alt="">
            <area shape="rect" coords="605,394 , 644,420" href="/cgi-bin/IMap.asp?map=5356D" alt="">
            <area shape="rect" coords="643,394 , 682,420" href="/cgi-bin/IMap.asp?map=5456C" alt="">
            <area shape="rect" coords="682,394 , 723,420" href="/cgi-bin/IMap.asp?map=5456D" alt="">
            <area shape="rect" coords="722,394 , 761,420" href="/cgi-bin/IMap.asp?map=5556C" alt="">
            <area shape="rect" coords="762,394 , 801,420" href="/cgi-bin/IMap.asp?map=5556D" alt="">
            <area shape="rect" coords="801,393 , 839,420" href="/cgi-bin/IMap.asp?map=5656C" alt="">
            <area shape="rect" coords="839,394 , 879,420" href="/cgi-bin/IMap.asp?map=5656D" alt="">
            <area shape="rect" coords="878,393 , 918,419" href="/cgi-bin/IMap.asp?map=5756C" alt="">
            <area shape="rect" coords="291,419 , 330,446" href="/cgi-bin/IMap.asp?map=4955B" alt="">
            <area shape="rect" coords="329,419 , 369,445" href="/cgi-bin/IMap.asp?map=5055A" alt="">
            <area shape="rect" coords="370,420 , 408,445" href="/cgi-bin/IMap.asp?map=5055B" alt="">
            <area shape="rect" coords="408,420 , 447,446" href="/cgi-bin/IMap.asp?map=5155A" alt="">
            <area shape="rect" coords="447,419 , 487,446" href="/cgi-bin/IMap.asp?map=5155B" alt="">
            <area shape="rect" coords="487,420 , 526,446" href="/cgi-bin/IMap.asp?map=5255A" alt="">
            <area shape="rect" coords="526,419 , 566,445" href="/cgi-bin/IMap.asp?map=5255B" alt="">
            <area shape="rect" coords="565,420 , 605,446" href="/cgi-bin/IMap.asp?map=5355A" alt="">
            <area shape="rect" coords="604,419 , 643,446" href="/cgi-bin/IMap.asp?map=5355B" alt="">
            <area shape="rect" coords="644,420 , 683,446" href="/cgi-bin/IMap.asp?map=5455A" alt="">
            <area shape="rect" coords="682,420 , 723,446" href="/cgi-bin/IMap.asp?map=5455B" alt="">
            <area shape="rect" coords="723,419 , 762,445" href="/cgi-bin/IMap.asp?map=5555A" alt="">
            <area shape="rect" coords="761,420 , 801,446" href="/cgi-bin/IMap.asp?map=5555B" alt="">
            <area shape="rect" coords="801,420 , 839,446" href="/cgi-bin/IMap.asp?map=5655A" alt="">
            <area shape="rect" coords="839,420 , 879,446" href="/cgi-bin/IMap.asp?map=5655B" alt="">
            <area shape="rect" coords="879,420 , 918,446" href="/cgi-bin/IMap.asp?map=5755A" alt="">
            <area shape="rect" coords="290,446 , 329,472" href="/cgi-bin/IMap.asp?map=4955D" alt="">
            <area shape="rect" coords="329,445 , 370,472" href="/cgi-bin/IMap.asp?map=5055C" alt="">
            <area shape="rect" coords="369,445 , 409,472" href="/cgi-bin/IMap.asp?map=5055D" alt="">
            <area shape="rect" coords="408,445 , 448,472" href="/cgi-bin/IMap.asp?map=5155C" alt="">
            <area shape="rect" coords="447,446 , 488,472" href="/cgi-bin/IMap.asp?map=5155D" alt="">
            <area shape="rect" coords="487,446 , 526,472" href="/cgi-bin/IMap.asp?map=5255C" alt="">
            <area shape="rect" coords="525,445 , 566,472" href="/cgi-bin/IMap.asp?map=5255D" alt="">
            <area shape="rect" coords="565,446 , 605,472" href="/cgi-bin/IMap.asp?map=5355C" alt="">
            <area shape="rect" coords="605,446 , 644,472" href="/cgi-bin/IMap.asp?map=5355D" alt="">
            <area shape="rect" coords="643,445 , 683,471" href="/cgi-bin/IMap.asp?map=5455C" alt="">
            <area shape="rect" coords="683,446 , 723,472" href="/cgi-bin/IMap.asp?map=5455D" alt="">
            <area shape="rect" coords="723,445 , 762,472" href="/cgi-bin/IMap.asp?map=5555C" alt="">
            <area shape="rect" coords="762,446 , 801,472" href="/cgi-bin/IMap.asp?map=5555D" alt="">
            <area shape="rect" coords="800,446 , 840,472" href="/cgi-bin/IMap.asp?map=5655C" alt="">
            <area shape="rect" coords="840,446 , 878,472" href="/cgi-bin/IMap.asp?map=5655D" alt="">
            <area shape="rect" coords="878,446 , 917,472" href="/cgi-bin/IMap.asp?map=5755C" alt="">
            <area shape="rect" coords="291,472 , 330,497" href="/cgi-bin/IMap.asp?map=4954B" alt="">
            <area shape="rect" coords="329,472 , 370,497" href="/cgi-bin/IMap.asp?map=5054A" alt="">
            <area shape="rect" coords="369,472 , 408,497" href="/cgi-bin/IMap.asp?map=5054B" alt="">
            <area shape="rect" coords="409,472 , 448,497" href="/cgi-bin/IMap.asp?map=5154A" alt="">
            <area shape="rect" coords="448,472 , 488,497" href="/cgi-bin/IMap.asp?map=5154B" alt="">
            <area shape="rect" coords="487,472 , 525,497" href="/cgi-bin/IMap.asp?map=5254A" alt="">
            <area shape="rect" coords="525,472 , 566,497" href="/cgi-bin/IMap.asp?map=5254B" alt="">
            <area shape="rect" coords="566,472 , 605,497" href="/cgi-bin/IMap.asp?map=5354A" alt="">
            <area shape="rect" coords="605,472 , 644,497" href="/cgi-bin/IMap.asp?map=5354B" alt="">
            <area shape="rect" coords="644,472 , 683,497" href="/cgi-bin/IMap.asp?map=5454A" alt="">
            <area shape="rect" coords="682,472 , 723,497" href="/cgi-bin/IMap.asp?map=5454B" alt="">
            <area shape="rect" coords="722,471 , 762,497" href="/cgi-bin/IMap.asp?map=5554A" alt="">
            <area shape="rect" coords="761,472 , 801,497" href="/cgi-bin/IMap.asp?map=5554B" alt="">
            <area shape="rect" coords="800,471 , 840,497" href="/cgi-bin/IMap.asp?map=5654A" alt="">
            <area shape="rect" coords="840,472 , 879,497" href="/cgi-bin/IMap.asp?map=5654B" alt="">
            <area shape="rect" coords="879,472 , 917,497" href="/cgi-bin/IMap.asp?map=5754A" alt="">
            <area shape="rect" coords="290,497 , 330,523" href="/cgi-bin/IMap.asp?map=4954D" alt="">
            <area shape="rect" coords="329,496 , 370,523" href="/cgi-bin/IMap.asp?map=5054C" alt="">
            <area shape="rect" coords="369,497 , 409,523" href="/cgi-bin/IMap.asp?map=5054D" alt="">
            <area shape="rect" coords="409,496 , 448,523" href="/cgi-bin/IMap.asp?map=5154C" alt="">
            <area shape="rect" coords="448,497 , 488,523" href="/cgi-bin/IMap.asp?map=5154D" alt="">
            <area shape="rect" coords="488,498 , 525,523" href="/cgi-bin/IMap.asp?map=5254C" alt="">
            <area shape="rect" coords="525,497 , 565,522" href="/cgi-bin/IMap.asp?map=5254D" alt="">
            <area shape="rect" coords="565,496 , 604,523" href="/cgi-bin/IMap.asp?map=5354C" alt="">
            <area shape="rect" coords="604,497 , 644,523" href="/cgi-bin/IMap.asp?map=5354D" alt="">
            <area shape="rect" coords="644,497 , 682,522" href="/cgi-bin/IMap.asp?map=5454C" alt="">
            <area shape="rect" coords="682,497 , 723,522" href="/cgi-bin/IMap.asp?map=5454D" alt="">
            <area shape="rect" coords="723,497 , 762,523" href="/cgi-bin/IMap.asp?map=5554C" alt="">
            <area shape="rect" coords="762,497 , 800,523" href="/cgi-bin/IMap.asp?map=5554D" alt="">
            <area shape="rect" coords="801,497 , 840,523" href="/cgi-bin/IMap.asp?map=5654C" alt="">
            <area shape="rect" coords="839,497 , 879,523" href="/cgi-bin/IMap.asp?map=5654D" alt="">
            <area shape="rect" coords="878,497 , 918,523" href="/cgi-bin/IMap.asp?map=5754C" alt="">
            <area shape="rect" coords="290,523 , 330,549" href="/cgi-bin/IMap.asp?map=4953B" alt="">
            <area shape="rect" coords="329,523 , 370,549" href="/cgi-bin/IMap.asp?map=5053A" alt="">
            <area shape="rect" coords="370,523 , 409,549" href="/cgi-bin/IMap.asp?map=5053B" alt="">
            <area shape="rect" coords="409,523 , 447,549" href="/cgi-bin/IMap.asp?map=5153A" alt="">
            <area shape="rect" coords="447,522 , 488,549" href="/cgi-bin/IMap.asp?map=5153B" alt="">
            <area shape="rect" coords="488,523 , 526,549" href="/cgi-bin/IMap.asp?map=5253A" alt="">
            <area shape="rect" coords="526,523 , 566,549" href="/cgi-bin/IMap.asp?map=5253B" alt="">
            <area shape="rect" coords="565,522 , 605,549" href="/cgi-bin/IMap.asp?map=5353A" alt="">
            <area shape="rect" coords="605,523 , 644,549" href="/cgi-bin/IMap.asp?map=5353B" alt="">
            <area shape="rect" coords="643,522 , 683,549" href="/cgi-bin/IMap.asp?map=5453A" alt="">
            <area shape="rect" coords="682,522 , 723,549" href="/cgi-bin/IMap.asp?map=5453B" alt="">
            <area shape="rect" coords="723,522 , 761,549" href="/cgi-bin/IMap.asp?map=5553A" alt="">
            <area shape="rect" coords="761,523 , 800,549" href="/cgi-bin/IMap.asp?map=5553B" alt="">
            <area shape="rect" coords="800,523 , 839,549" href="/cgi-bin/IMap.asp?map=5653A" alt="">
            <area shape="rect" coords="839,523 , 878,549" href="/cgi-bin/IMap.asp?map=5653B" alt="">
            <area shape="rect" coords="878,522 , 918,549" href="/cgi-bin/IMap.asp?map=5753A" alt="">
            <area shape="rect" coords="251,548 , 291,575" href="/cgi-bin/IMap.asp?map=4953C" alt="">
            <area shape="rect" coords="290,547 , 330,574" href="/cgi-bin/IMap.asp?map=4953D" alt="">
            <area shape="rect" coords="329,548 , 369,574" href="/cgi-bin/IMap.asp?map=5053C" alt="">
            <area shape="rect" coords="369,548 , 408,575" href="/cgi-bin/IMap.asp?map=5053D" alt="">
            <area shape="rect" coords="409,548 , 448,575" href="/cgi-bin/IMap.asp?map=5153C" alt="">
            <area shape="rect" coords="448,547 , 488,574" href="/cgi-bin/IMap.asp?map=5153D" alt="">
            <area shape="rect" coords="488,547 , 526,575" href="/cgi-bin/IMap.asp?map=5253C" alt="">
            <area shape="rect" coords="526,548 , 566,575" href="/cgi-bin/IMap.asp?map=5253D" alt="">
            <area shape="rect" coords="566,547 , 605,575" href="/cgi-bin/IMap.asp?map=5353C" alt="">
            <area shape="rect" coords="605,547 , 644,575" href="/cgi-bin/IMap.asp?map=5353D" alt="">
            <area shape="rect" coords="644,547 , 682,575" href="/cgi-bin/IMap.asp?map=5453C" alt="">
            <area shape="rect" coords="682,547 , 722,575" href="/cgi-bin/IMap.asp?map=5453D" alt="">
            <area shape="rect" coords="723,547 , 762,575" href="/cgi-bin/IMap.asp?map=5553C" alt="">
            <area shape="rect" coords="762,548 , 801,574" href="/cgi-bin/IMap.asp?map=5553D" alt="">
            <area shape="rect" coords="801,547 , 840,575" href="/cgi-bin/IMap.asp?map=5653C" alt="">
            <area shape="rect" coords="840,547 , 879,575" href="/cgi-bin/IMap.asp?map=5653D" alt="">
            <area shape="rect" coords="252,574 , 291,599" href="/cgi-bin/IMap.asp?map=4952A" alt="">
            <area shape="rect" coords="291,573 , 330,600" href="/cgi-bin/IMap.asp?map=4952B" alt="">
            <area shape="rect" coords="330,574 , 370,600" href="/cgi-bin/IMap.asp?map=5052A" alt="">
            <area shape="rect" coords="369,573 , 409,600" href="/cgi-bin/IMap.asp?map=5052B" alt="">
            <area shape="rect" coords="409,574 , 448,600" href="/cgi-bin/IMap.asp?map=5152A" alt="">
            <area shape="rect" coords="449,574 , 488,599" href="/cgi-bin/IMap.asp?map=5152B" alt="">
            <area shape="rect" coords="488,573 , 526,600" href="/cgi-bin/IMap.asp?map=5252A" alt="">
            <area shape="rect" coords="526,574 , 566,600" href="/cgi-bin/IMap.asp?map=5252B" alt="">
            <area shape="rect" coords="566,574 , 605,600" href="/cgi-bin/IMap.asp?map=5352A" alt="">
            <area shape="rect" coords="604,573 , 643,599" href="/cgi-bin/IMap.asp?map=5352B" alt="">
            <area shape="rect" coords="643,574 , 683,599" href="/cgi-bin/IMap.asp?map=5452A" alt="">
            <area shape="rect" coords="682,574 , 723,599" href="/cgi-bin/IMap.asp?map=5452B" alt="">
            <area shape="rect" coords="723,574 , 762,600" href="/cgi-bin/IMap.asp?map=5552A" alt="">
            <area shape="rect" coords="762,573 , 801,600" href="/cgi-bin/IMap.asp?map=5552B" alt="">
            <area shape="rect" coords="800,574 , 839,599" href="/cgi-bin/IMap.asp?map=5652A" alt="">
            <area shape="rect" coords="839,573 , 878,599" href="/cgi-bin/IMap.asp?map=5652B" alt="">
            <area shape="rect" coords="252,599 , 291,625" href="/cgi-bin/IMap.asp?map=4952C" alt="">
            <area shape="rect" coords="290,599 , 329,625" href="/cgi-bin/IMap.asp?map=4952D" alt="">
            <area shape="rect" coords="330,599 , 370,625" href="/cgi-bin/IMap.asp?map=5052C" alt="">
            <area shape="rect" coords="369,599 , 408,626" href="/cgi-bin/IMap.asp?map=5052D" alt="">
            <area shape="rect" coords="409,599 , 448,625" href="/cgi-bin/IMap.asp?map=5152C" alt="">
            <area shape="rect" coords="447,598 , 488,626" href="/cgi-bin/IMap.asp?map=5152D" alt="">
            <area shape="rect" coords="488,598 , 526,626" href="/cgi-bin/IMap.asp?map=5252C" alt="">
            <area shape="rect" coords="526,598 , 566,626" href="/cgi-bin/IMap.asp?map=5252D" alt="">
            <area shape="rect" coords="566,598 , 605,626" href="/cgi-bin/IMap.asp?map=5352C" alt="">
            <area shape="rect" coords="604,598 , 644,626" href="/cgi-bin/IMap.asp?map=5352D" alt="">
            <area shape="rect" coords="644,598 , 683,626" href="/cgi-bin/IMap.asp?map=5452C" alt="">
            <area shape="rect" coords="682,598 , 723,625" href="/cgi-bin/IMap.asp?map=5452D" alt="">
            <area shape="rect" coords="723,599 , 762,625" href="/cgi-bin/IMap.asp?map=5552C" alt="">
            <area shape="rect" coords="761,599 , 800,625" href="/cgi-bin/IMap.asp?map=5552D" alt="">
            <area shape="rect" coords="801,599 , 839,625" href="/cgi-bin/IMap.asp?map=5652C" alt="">
            <area shape="rect" coords="839,598 , 878,626" href="/cgi-bin/IMap.asp?map=5652D" alt="">
            <area shape="rect" coords="330,624 , 370,652" href="/cgi-bin/IMap.asp?map=5051A" alt="">
            <area shape="rect" coords="369,624 , 409,651" href="/cgi-bin/IMap.asp?map=5051B" alt="">
            <area shape="rect" coords="409,625 , 448,652" href="/cgi-bin/IMap.asp?map=5151A" alt="">
            <area shape="rect" coords="448,625 , 488,652" href="/cgi-bin/IMap.asp?map=5151B" alt="">
            <area shape="rect" coords="487,625 , 526,652" href="/cgi-bin/IMap.asp?map=5251A" alt="">
            <area shape="rect" coords="526,624 , 565,652" href="/cgi-bin/IMap.asp?map=5251B" alt="">
            <area shape="rect" coords="565,625 , 605,652" href="/cgi-bin/IMap.asp?map=5351A" alt="">
            <area shape="rect" coords="605,625 , 643,651" href="/cgi-bin/IMap.asp?map=5351B" alt="">
            <area shape="rect" coords="643,625 , 682,650" href="/cgi-bin/IMap.asp?map=5451A" alt="">
            <area shape="rect" coords="683,625 , 723,652" href="/cgi-bin/IMap.asp?map=5451B" alt="">
            <area shape="rect" coords="723,624 , 761,652" href="/cgi-bin/IMap.asp?map=5551A" alt="">
            <area shape="rect" coords="762,624 , 801,652" href="/cgi-bin/IMap.asp?map=5551B" alt="">
            <area shape="rect" coords="800,624 , 840,651" href="/cgi-bin/IMap.asp?map=5651A" alt="">
            <area shape="rect" coords="839,625 , 878,652" href="/cgi-bin/IMap.asp?map=5651B" alt="">
            <area shape="rect" coords="369,650 , 409,677" href="/cgi-bin/IMap.asp?map=5051D" alt="">
            <area shape="rect" coords="409,650 , 448,678" href="/cgi-bin/IMap.asp?map=5151C" alt="">
            <area shape="rect" coords="448,651 , 488,678" href="/cgi-bin/IMap.asp?map=5151D" alt="">
            <area shape="rect" coords="488,651 , 525,677" href="/cgi-bin/IMap.asp?map=5251C" alt="">
            <area shape="rect" coords="526,651 , 566,678" href="/cgi-bin/IMap.asp?map=5251D" alt="">
            <area shape="rect" coords="566,651 , 605,677" href="/cgi-bin/IMap.asp?map=5351C" alt="">
            <area shape="rect" coords="605,650 , 644,678" href="/cgi-bin/IMap.asp?map=5351D" alt="">
            <area shape="rect" coords="644,650 , 683,677" href="/cgi-bin/IMap.asp?map=5451C" alt="">
            <area shape="rect" coords="682,652 , 722,677" href="/cgi-bin/IMap.asp?map=5451D" alt="">
            <area shape="rect" coords="722,651 , 761,677" href="/cgi-bin/IMap.asp?map=5551C" alt="">
            <area shape="rect" coords="762,651 , 801,678" href="/cgi-bin/IMap.asp?map=5551D" alt="">
            <area shape="rect" coords="800,651 , 840,678" href="/cgi-bin/IMap.asp?map=5651C" alt="">
            <area shape="rect" coords="840,651 , 878,678" href="/cgi-bin/IMap.asp?map=5651D" alt="">
            <area shape="rect" coords="408,677 , 448,703" href="/cgi-bin/IMap.asp?map=5150A" alt="">
            <area shape="rect" coords="447,677 , 488,703" href="/cgi-bin/IMap.asp?map=5150B" alt="">
            <area shape="rect" coords="488,676 , 526,703" href="/cgi-bin/IMap.asp?map=5250A" alt="">
            <area shape="rect" coords="525,676 , 566,703" href="/cgi-bin/IMap.asp?map=5250B" alt="">
            <area shape="rect" coords="566,677 , 604,703" href="/cgi-bin/IMap.asp?map=5350A" alt="">
            <area shape="rect" coords="605,677 , 644,703" href="/cgi-bin/IMap.asp?map=5350D" alt="">
            <area shape="rect" coords="643,676 , 682,702" href="/cgi-bin/IMap.asp?map=5450A" alt="">
            <area shape="rect" coords="683,676 , 723,703" href="/cgi-bin/IMap.asp?map=5450B" alt="">
            <area shape="rect" coords="723,676 , 762,703" href="/cgi-bin/IMap.asp?map=5550A" alt="">
            <area shape="rect" coords="762,676 , 801,703" href="/cgi-bin/IMap.asp?map=5550B" alt="">
            <area shape="rect" coords="801,676 , 840,703" href="/cgi-bin/IMap.asp?map=5650A" alt="">
            <area shape="rect" coords="840,677 , 879,703" href="/cgi-bin/IMap.asp?map=5650B" alt="">
            <area shape="rect" coords="487,702 , 525,728" href="/cgi-bin/IMap.asp?map=5250C" alt="">
            <area shape="rect" coords="525,702 , 565,728" href="/cgi-bin/IMap.asp?map=5250D" alt="">
            <area shape="rect" coords="565,702 , 604,729" href="/cgi-bin/IMap.asp?map=5350C" alt="">
            <area shape="rect" coords="603,701 , 643,729" href="/cgi-bin/IMap.asp?map=5350D" alt="">
            <area shape="rect" coords="643,702 , 682,729" href="/cgi-bin/IMap.asp?map=5450C" alt="">
            <area shape="rect" coords="682,701 , 722,729" href="/cgi-bin/IMap.asp?map=5450D" alt="">
            <area shape="rect" coords="525,728 , 565,755" href="/cgi-bin/IMap.asp?map=5249B" alt="">
        </map> '''

We can use python's beautifulsoup library to extract the facet numbers, then construct the detailed map links since they all follow the same pattern as seen below.

from bs4 import BeautifulSoup

soup = BeautifulSoup(iMap, 'html.parser')
facet_maps = soup.find_all("area")

facet_number = [ f['href'].split('=')[-1] for f in facet_maps ]
detailed_map = [ 'https://public.hcad.org/iMaps/Tiles/Color/' + f['href'].split('=')[-1] + str(a) + '.pdf' for f in facet_maps for a in range(1, 13) ]


appr_dist = 'Houston'
appr_dist_link = 'https://public.hcad.org/maps/Houston.asp'


df = pd.DataFrame([])
# df['Facet Number'] = facet_number
df['Detailed map'] = detailed_map
df['Appraisal Districts'] = appr_dist
df['Link'] = appr_dist_link



The final output is similar to the screen above.

Cheers!

Saturday, November 19, 2022

Scrape online academic materials using python

 You know it can be a boring task to manually collect academic material you found online. In this blog post, I will demonstrate how I use python to collect some academic thesis, journals, and other materials for my profession.


Online Scientific Research Journals: 

Here my professor wants to have all the journals and their details published by "Scientific Research and Community Publishers" onlinescientificresearch.com neatly arranged in a spreadsheet table.

The specific details required are the journal name/title, the page URL, the description, cover image and ISSN number.

All the details should be organized in a spreadsheet as seen below.


The code:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup



# Section 1: Scrape journals page URLs and thumbnail images

url = 'https://www.onlinescientificresearch.com/journals.php'

# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
journals = soup.find_all("div", {'class':'col-12 col-sm-6 col-lg-3'})

print(len(journals))
# ---------------------------------------------------------------

# Section 2: Extract paths to journals URL and thumbnail image...

url_list = []
image_list = []

for j in journals:
    url = j.find('a')['href']
    img = j.find('img')['src']
    
    url_list.append(url)
    image_list.append(img)
    
print('Done...')


# ---------------------------------------------------------------
# Section 3: Create dataframe and construct other details...

df = pd.DataFrame([url_list, image_list]).T
df.columns = ['Journal URL', 'Journal IMAGE URL']
# -------------------------------------
####### Construct Journal Name #######
df['Journal Name'] = df['Journal URL'].apply(lambda row: row.split('/')[-1].replace('.php', '').replace('-', ' ').title())


####### Construct Journal Description #######
def get_journal_descr(url):
    # Get user-agent from: http://www.useragentstring.com/
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    html = response.text
    
    soup = BeautifulSoup(html, 'html.parser')
    journal_descr = soup.find("div", {'class':'card-body'})
    
    return journal_descr.text
# -------------------------------------
# Scrape Journal description into a list 
j_descr_list = []
i = 1

for url in df['Journal URL']:
    print(i, 'Processing...', url)
    j_descr = get_journal_descr(url)
    
    j_descr_list.append((url, j_descr))
    i = i+1

desc_df = pd.DataFrame(j_descr_list)
# -------------------------------------

# We have to access each journal url page to get its description...
# df['Journal description'] = df['Journal URL'].apply(lambda url: get_journal_descr(url))
df['Journal description'] = desc_df[1]


####### Construct Journal ISSN #######
# We have to use OCR on the journal thumb nail to get its ISSN...
# Using OCR API at: https://ocr.space/ocrapi....

headers = {
    'apikey': 'helloworld', # 'helloworld'
    'content-type': 'application/x-www-form-urlencoded',
}

issn_list = []

for thumbnail in df['Journal IMAGE URL']:
    print('Processing....', thumbnail)
    
    data = f'isOverlayRequired=true&url={thumbnail}&language=eng'

    response = requests.post('https://api.ocr.space/Parse/Image', headers=headers, data=data, verify=False)

    result = json.loads(response.content.decode()) # Convert the result to dictionary using json.loads() function
    # type(result)

    # Check the dict keys, the ISSN is in: ParsedResults >> 0 >> ParsedText
    issn = result['ParsedResults'][0]['ParsedText'].strip().split('\r\n')[-1]

    issn_list.append(issn)

df['Journal ISSN'] = issn_list

df
Extracting the journal ISSN was definitely the trickiest part as it requires working with OCR API.



M.Sc. in GIST Theses

Master of Science (Geographic Information Science and Technology) Theses by University of Southern California. 


Here our professor wants the thesis details arranged in a table seen above.

Lets start by inspecting the html tags on the web page.

Here I copied the parent div tag that contains the needed data into a local html file. With this we don't need to send request to the website.

import pandas as pd
from bs4 import BeautifulSoup

# Copy the parent div tag into a html/txt file...
html_file = r"C:\Users\Yusuf_08039508010\Documents\Jupyter_Notebook\2022\M.S. IN GIST THESES\M.S. IN GIST THESES.HTML"

# Use BeautifulSoup to read the html div tag....
with open(html_file, encoding='utf-8') as f:
    div_data = f.read()

soup = BeautifulSoup(div_data, 'html.parser')

thesis_years = soup.find_all("h3")

thesis_authors = soup.find_all("strong")
thesis_authors = [ a.text for a in thesis_authors ]

thesis_topics = soup.find_all("em")
thesis_topics = [ t.text for t in thesis_topics ]

thesis_advisor = soup.find_all("p")
thesis_advisor = [ a.text for a in thesis_advisor if 'Advisor:' in a.text ]

thesis_pdf = soup.find_all("a")
thesis_pdf = [ link.get('href') for link in thesis_pdf if 'Abstract Text' not in link.text ]

# --------------------------------------------
df = pd.DataFrame(thesis_authors, columns=['Author'])
df['Topic'] = thesis_topics
df['Advisor'] = thesis_advisor
df['PDF Link'] = thesis_pdf

df

The code below will download the PDF files to local disc using the requests library.
i = 1
for indx, row in df.iterrows():
    link = row['PDF Link']
    print('Processsing...', link)

    pdf_name = str(i) +'_'+ link.split('/')[-1]
    pdf_file = requests.get(link, timeout=10).content

    with open( f'Thesis PDF\\{pdf_name}', 'wb' ) as f:
        f.write(pdf_file)
        
    i += 1
    # break


print('Finished...')





Journal - Nigerian Institution of Surveyors



This was little bit trick because the web page had inconsistent html tags.
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://nisngr.net/journal/'
response = requests.get(url, verify=False)
html = response.text
# ----------------------------


soup = BeautifulSoup(html, 'html.parser')
div_boxes = soup.find_all("div", {'class':'wpb_text_column wpb_content_element'})
# ----------------------------


papers_dict = {}
for div in div_boxes:
    papers = div.find_all('a')
    
    for link in papers:
        papers_dict[link.text] = link['href']
# ----------------------------

df = pd.DataFrame([papers_dict]).T
df




Thank you for reading.

Thursday, November 10, 2022

Automate boring tasks in QGIS with PyQGIS

 In this post, I will use PyQGIS to automate some boring tasks I often encounter in QGIS. Hope you will find something useful to your workflow. Lets get started...

If you don't know what pyqgis is, then read this definition by hatarilabs.com: "PyQGIS is the Python environment inside QGIS with a set of QGIS libraries plus the Python tools with the potential of running other powerful libraries as Pandas, Numpy or Scikit-learn".

PyQGIS allows users to automate workflow and extend QGIS with the use of Python libraries and the documentation can be accessed here.

This means knowledge of python programming is required to understand some of the codes below.



  Task 1~ Count number of opened/loaded layers in the layer panel

I often find myself trying to count the layers in my QGIS project layer panel, so a simple pyqis script to automate the process will be ideal especially when there are many layers on the layer panel to count.

# This will return the all layers on the layer panel
all_layers = QgsProject.instance().mapLayers().values()
print('There are', len(all_layers), 'on the layer panel.')


  Task 2~ Count features in loaded vector layer

In this task, I want to get the number of features in each layer am working on. This is similar to 'Show Feature Count' function when you right-click on a vector layer.

# Get all layers into a list....
all_layers = list(QgsProject.instance().mapLayers().values())

# Get all displayed names of layer and corresponding number of features ...
ftCounts = [ (l.name(), l.featureCount()) for l in all_layers ]
print(ftCounts)


  Task 3~ Switch on/off all layers

To turn ON or OFF all layer can be frustrating when you got many layers to click through. So why not auto mate it in just a click.

# Get list of layers from the layer's panel...
qgis_prjt_lyrs = QgsProject.instance().layerTreeRoot().findLayers()

# Use index to Set layer on or off....
qgis_prjt_lyrs[20].setItemVisibilityChecked(True) # True=On, False=Off

# Do for all...
for l in qgis_prjt_lyrs:
    l.setItemVisibilityChecked(False)


  Task 4~ Identify layers that are on/off

Lets extend task3 above, so we know which layers are on (visible) and which layers are off (hidden).

# Get list of layers from the layer's panel...
qgis_prjt_lyrs = QgsProject.instance().layerTreeRoot().findLayers()

# Check if a layer is visible or not...
layer_visibility_check = [ (l.name(), l.isVisible()) for l in qgis_prjt_lyrs ]
print(layer_visibility_check)

visibility_ture = [ l.name() for l in qgis_prjt_lyrs if l.isVisible() == True ]
print('Number of visible layers:', len(visibility_ture))

visibility_false = [ l.name() for l in qgis_prjt_lyrs if l.isVisible() == False ]
print('Number of visible layers:', len(visibility_false))


  Task 5~ Read file path of layers

This is useful when you have many layers and don't know where they are located on your machine. You will also see interesting paths to other remote layer such as WMS, etc

# Returns path to every layer...
layer_paths = [layer.source() for layer in QgsProject.instance().mapLayers().values()]
print(layer_paths)


  Task 6~ Read layer type of layers

We can check the 'type' of a layer.

# Get dict of layers from the layer's panel...
layersDict = QgsProject.instance().mapLayers()


for (id, map) in layersDict.items():
    print(map.name(), '>>', map.type())


  Task 7~ Create multiple attribute fields/columns

Lets say we want to add multiple integer fields/columns to a vector layer. The code below will create attribute fields for year 2000 to 2023, that is twenty three (23) attribute columns/fields on the selected vector layer.

# Get Layer by name...
layer = QgsProject.instance().mapLayersByName("NIG LGA")[0]

# Define dataProvider for layer
layer_provider = layer.dataProvider()

# Add an Integer attribute field and update fields...
layer_provider.addAttributes([QgsField("2000", QVariant.Int)])
layer.updateFields()

# Add bulk attribute fields...
for x in range(2001, 2023):
    layer_provider.addAttributes([QgsField(str(x), QVariant.Int)])
    layer.updateFields()

print('Done...')


  Task 8~ Read/List all names of layers on layer panel

Here we just want to return the displayed names of layers.
# Get all layers into a list....
all_layers = list(QgsProject.instance().mapLayers().values())

# Get all displayed names of layer
all_layers_names = [ l.name() for l in all_layers ]
print(all_layers_names)


  Task 9~ Save attribute table to dataframe

# Save attribute table into Dataframe...

import pandas as pd

# Get Layer by name...
layer = QgsProject.instance().mapLayersByName("NIG LGA")[0]

# get attribute columns names
col_names = [ field.name() for field in layer.fields() ]

lga_list = []
state_list = []
apc_list = []
pdp_list = []
lp_list = []
nnpp_list = []
winner_list = []


for feature in layer.getFeatures():
    lga_list.append(feature['lga_name'])
    state_list.append(feature['state_name'])
    apc_list.append(feature['APC'])
    pdp_list.append(feature['PDP'])
    lp_list.append(feature['LP'])
    nnpp_list.append(feature['NNPP'])
    winner_list.append(feature['Winner'])

df = pd.DataFrame([state_list, lga_list, apc_list, pdp_list, lp_list, nnpp_list, winner_list]).T

df.to_csv(r'C:\Users\Yusuf_08039508010\Desktop\...\test.csv')

print('Done....')


  Task 10~ Select from multiple layers and attribute fields

Here we want to conduct multiple selection of given keywords from all listed layers and all attribute fields.
# Query to Select from all listed layers and all attribute fields
search_for = {'Bauchi', 'SSZ', 'Edo', 'Yobe'}

for lyr in QgsProject.instance().mapLayers().values():
    if isinstance(lyr, QgsVectorLayer):
        to_select = []
        # fieldlist = [f.name() for f in lyr.fields()]
        for f in lyr.getFeatures():
            # Check if any of the search keyword intersects to
            # feature's row attribute. If true, get the feature ID for selection...
            if len(search_for.intersection(f.attributes())) > 0:
                to_select.append(f.id())
        if len(to_select) > 0:
            lyr.select(to_select)




  Task 11~ Convert multiple GeoJSON files to shapefiles

import glob

input_files = glob.glob(r'C:\Users\Yusuf_08039508010\Desktop\Working_Files\GIS Data\US Zip Codes\*.json')
for f in input_files:
    out_filename = f.split('\\')[-1].split('.')[0]
    input_file = QgsVectorLayer(f, "polygon", "ogr")
    
    if input_file.isValid() == True:
        QgsVectorFileWriter.writeAsVectorFormat(input_file, rf"C:\Users\Yusuf_08039508010\Desktop\Working_Files\Fiverr\2021\05-May\Division_Region_Area Map\SHP\US ZipCode\{out_filename}.shp", "UTF-8", input_file.crs(), "ESRI Shapefile")
    else:
        print(f, 'is not a valid input file')
        
print('Done Processing..., ', f)



Task 12~ Display attributes of selected

From an active layer, print attributes of selected features.

# Display attributes of selected features...
layer = iface.activeLayer()
features = layer.selectedFeatures()
print(f'{len(features)} features selected.')

for f in features:
    print ( f.attributeMap() ) # dict of fieldnames:Values
	# print (f.attributes())
	# print( f['Field_Name'] )



Thank you for reading.

Friday, November 4, 2022

Search nearby places - Comparing three API (Google Places API, Geoapify API and HERE API)

 In the post, I will compare API from three different providers to search nearby places the three API to compare are: Google Places API, Geoapify API and HERE API.

For each of the platforms, you need to register and get a developer API key to use. All the platform offer a limited free API quota to start with.



Google Places API


import requests
import pandas as pd
from datetime import datetime

df = pd.read_csv('datafile.csv')


YOUR_API_KEY = 'AIza......'

i = 1
for row, col in df.iterrows():
    lat = col['Latitude']
    long = col['Longitude']
    print(i, 'Processing...', lat, long)
    
    url = f'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location={lat}%2C{long}&radius=4850&type=laundry&keyword=laundromats&key={YOUR_API_KEY}'

    payload={}
    headers = {}

    response = requests.request("GET", url, headers=headers, data=payload)

    # Get current time...
    now = datetime.now()
    current_time = now.strftime("%Y%m%d__%H%M%S")

    # Write to file....
    with open(fr'JSON folder\\GoogleAPI\\{state_folder}\\{current_time}.json', 'w') as outfile:
        json.dump(response.json(), outfile)

    i = i+1
    
    
print('Done...')


Geoapify API


GeoApify_API_KEY = '378122b08....'

url = 'https://api.geoapify.com/v2/places'

params = dict(
    categories: 'commercial',
    filter: 'rect:7.735282,48.586797,7.756289,48.574457',
    limit: 2000,
    apiKey=f'{GeoApify_API_KEY}'
)

resp = requests.get(url=url, params=params)
data = resp.json()

print(data)




HERE API


HERE_API_KEY = 'WEYn....'
coord = '27.95034271398129,-82.45670935632066' # lat, long
url = f'https://places.ls.hereapi.com/places/v1/discover/here?apiKey={HERE_API_KEY}&at={coord}&laundry'

response = requests.get(url).json()
# print(response.text)

# Get current time...
now = datetime.now()
current_time = now.strftime("%Y%m%d__%H%M%S")


# Write to file....
with open(fr'JSON folder\\{current_time}.json', 'w') as outfile:
    json.dump(response, outfile)
    
print('Done...')



Tuesday, November 1, 2022

Thursday, October 27, 2022

Convert Coordinates in United Nations Code for Trade and Transport Locations (UN/LOCODE) Code List to GIS friendly format

 The table code list at 'UN/LOCODE Code List by Country and Territory' has a column named coordinate. This column contains the geographical coordinates (latitude/longitude) in a format that is not suitable for use in GIS software. The reason is explained on this page by UN.

So, basically it say in order to avoid unnecessary use of non-standard characters and space, the following standard presentation is used: 0000lat 00000long

(lat - Latitude: N or S ; long – Longitude: W or E, only one digit, capital letter)

Where the last two rightmost digits refer to minutes and the first two or three digits refer to the degrees for latitude and longitude respectively. In addition, you must specify N or S for latitude and W or E for longitude, as appropriate.


While this may be a good format for them, it is not a good format for most GIS platforms. Hence there is need to convert it into what the GIS can easily utilize.

This means we will convert coordinate that looks like this '0507N 00722E' to decimal degrees or degree minute and seconds.


unlocode_coord = '0507S 00722E'

unlat, unlong = unlocode_coord.split(' ')


# Handling Latitide...
# --------------------------
# Remove the last characted which will always be either: N or S...
unlat = unlat.replace('N', '').replace('S', '') # .rstrip('N').rstrip('S')
lat_deg = unlat[:2]
lat_min = unlat[-2:]

# Result in DMS... Degrees Minutes
print(f"The result in Degree Munite is: {lat_deg}°{lat_min}'")


# Result in DD.... Decimal Degrees - Since 1° = 60' and 1' = 60"
lat_min_dd = round(float(lat_min)/60, 2)
# Get the fractional and integer parts, we can use: modulo (%) operator or math.modf
lat_sec_dd = int(lat_min_dd) + round(lat_min_dd % 1, 2)/60
# Add the D+M+S...
lat_dd = round(float(lat_deg) + lat_min_dd + lat_sec_dd, 3)
print(f"The result in Decimal Degree is: {lat_dd}°")




# Handling Longitude...
# --------------------------
# Remove the last characted which will always be either: N or S...
unlong = unlong.rstrip('E').rstrip('W')
long_deg = unlong[:3]
long_min = unlong[-2:]

print()
# Result in DMS... Degrees Minutes Seconds
print(f"The result in Degree Munite is: {long_deg}°{long_min}'")

# Result in DD.... Decimal Degrees - Since 1° = 60' and 1' = 60"
long_min_dd = round(float(long_min)/60, 2)
# Get the fractional and integer parts, we can use: modulo (%) operator or math.modf
long_sec_dd = int(long_min_dd) + round(long_min_dd % 1, 2)/60
# Add the D+M+S...
long_dd = round(float(long_deg) + long_min_dd + long_sec_dd, 3)
print(f"The result in Decimal Degree is: {long_dd}°")

You may use the tool on this website to learn more as seen below.





The reverse - from decimal degrees to UN/LOCODE coordinates

# ------------------- For Latitude ---------------------------------

lat = 13.893937

if lat >= 0: # Northern Hermisphere
    # Degree...
    lat_degree = int(lat)
    # Minute
    lat_minute = int((lat - lat_degree) * 60)
    
    lat_result = str(lat_degree) + str(lat_minute) + 'N'
    print(f'Latitude in UN/LOCODE is: {lat_result}')
    
else: # Southern Hermisphere
    lat = abs(lat)

    # Degree...
    lat_degree = int(lat)
    # Minute
    lat_minute = int((lat - lat_degree) * 60)
    
    lat_result = str(lat_degree) + str(lat_minute) + 'S'
    print(f'Latitude in UN/LOCODE is: {lat_result}')
    


# ------------------- For Longitude ---------------------------------

long = -123.893937

if long >= 0: # Northern Hermisphere
    # Degree...
    long_degree = int(long)
    long_degree1 = str(int(long))

    
    if len(long_degree1) == 1:
        long_degree1 = '00' + long_degree1
    elif len(long_degree1) == 2:
        long_degree1 = '0' + long_degree1
    elif len(long_degree1) == 3:
        long_degree1 = long_degree1    
        
    # Minute
    long_minute = int((long - long_degree) * 60)
    
    long_result = str(long_degree1) + str(long_minute) + 'E'
    print(f'Longitude in UN/LOCODE is: {long_result}')
    
else: # Southern Hermisphere
    long = abs(long)

    # Degree...
    long_degree = int(long)
    # Minute
    long_minute = int((long - long_degree) * 60)
    
    long_result = str(long_degree) + str(long_minute) + 'W'
    print(f'Longitude in UN/LOCODE is: {long_result}')



That is it!

Saturday, October 22, 2022

Mathematics of successful life in Python

There is this text that trends over the social media the twenty six alphabets are assigned number from one to twenty six and it was used to calculate the percentage of some word as quoted below;-

I found this to be very interesting and meaningful message to share:-
IF:
A = 1 
B = 2 
C = 3  
D = 4
E = 5  
F = 6
G = 7  
H = 8
I = 9  
J = 10  
K = 11  
L = 12
M = 13  
N = 14 
O = 15  
P = 16
Q = 17
R = 18 
S = 19
T = 20
U = 21
V = 22 
W = 23  
X = 24
Y = 25 
Z = 26

THEN,
H+A+R+D+W+O+R+K
8+1+18+4+23+15+18+11 = 98%

K+N+O+W+L+E+D+G+E
11+14+15+23+12+5+4+7+5 = 96%

L+O+V+E
12+15+22+5 = 54%

L+U+C+K
12+21+3+11 = 47%

None of them makes 100%.
Then what makes 100%?
Is it Money? NO!

M+O+N+E+Y
13+15+14+5+25 = 72%

E+D+U+C+A+T+I+O+N
5+4+21+3+1+20+9+15+14 = 92%

Leadership? NO!

L+E+A+D+E+R+S+H+I+P
12+5+1+4+5+18+19+8+9+16 = 97%

Every problem has a solution, only if we perhaps change our ATTITUDE...
A+T+T+I+T+U+D+E = 1+20+20+9+20+21+4+5  = 100%
It is therefore OUR ATTITUDE towards Life and Work that makes OUR Life 100% Successful.

Amazing mathematics
Let's change our Attitude of doing things in life.
Because it's our attitude that is our problem
Not the Devil.
Tusaai Piadin Gideon copied


Let see how we can transform this into a python script.

alphabets = {'A' : 1, 'B' : 2, 'C' : 3, 'D' : 4, 'E' : 5, 'F' : 6, 'G' : 7, 'H' : 8, 'I' : 9, 'J' : 10, 'K' : 11, 'L' : 12, 'M' : 13, 'N' : 14, 'O' : 15, 'P' : 16, 'Q' : 17, 'R' : 18, 'S' : 19, 'T' : 20, 'U' : 21, 'V' : 22, 'W' : 23, 'X' : 24, 'Y' : 25, 'Z' : 26}

solve = 'M+O+N+E+Y'
solve1 = solve.split('+')
solve2 = [ alphabets[a] for a in solve1 ]
solve3 = str(sum(solve2)) + '%'

print(solve3)

That is it!