HTML-TABLE Scraping

Question

I am trying to make a html-table scraper function it is not working as expected i tried with  wikie pedia table (https://en.wikipedia.org/wiki/List_of_crop_plants_pollinated_by_bees) it gives the output of just blank lines.
Here is the code:
```
def scrapetable(url):
    soup = BeautifulSoup(requests.get(url).text)
    polltable = soup.find('table')
    headers = [header.text for listing in polltable.findall('thead') for header in listing.findall('th')]
    rawdata = {header:[] for header in headers}
for rows in soup.find_all('tbody'):
      for row in rows.find_all('tr'):
        if len(row) != len(headers): continue
        for idx, cell in enumerate(row.find_all('td')):
            if row.find_all('td'):
                raw_data[headers[idx]].append(cell.text)
            else:
                raw_data[headers[idx]].append('')

return pd.DataFrame(raw_data) 
```

Chris Freeman · Accepted Answer

Hey @doug james (https://teamtreehouse.com/dougjames), a very interesting question. The short answer is there is no "thead" to find. 
After inspecting the page source (using show source—not page inspector) , and also dumping the value of soup and poll_table to a file for inspection, there is no "thead" to be found. Also, after inspecting the wiki markdown source, there is no explicit "thead" mechanism present. I suspect (speculate) that the thead is dynamically added using JavaScript upon page loading. This explains why "thead" is not in the scraped data.
The wiki source lists the table classes as "wikitable sortable", but the element inspector shows the table classes as "wikitable sortable jquery-tablesorter". So that might be a key to where the "thead" is being inserted.
In general, it's better to use the "page source" as the structure map when scraping (Ctrl-U using Firefox) to avoid dynamic content changes.
Post back if you have more questions. Good luck!!

Welcome to the Treehouse Community

Looking to learn something new?

doug james

doug james

HTML-TABLE Scraping

1 Answer

Chris Freeman

Chris Freeman