Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python

HTML-TABLE Scraping

I am trying to make a html-table scraper function it is not working as expected i tried with wikie pedia table it gives the output of just blank lines.

Here is the code:

def scrape_table(url):
    soup = BeautifulSoup(requests.get(url).text)
    poll_table = soup.find('table')
    headers = [header.text for listing in poll_table.find_all('thead') for header in listing.find_all('th')]
    raw_data = {header:[] for header in headers}

    for rows in soup.find_all('tbody'):
          for row in rows.find_all('tr'):
            if len(row) != len(headers): continue
            for idx, cell in enumerate(row.find_all('td')):
                if row.find_all('td'):
                    raw_data[headers[idx]].append(cell.text)
                else:
                    raw_data[headers[idx]].append('')

    return pd.DataFrame(raw_data) 

1 Answer

Chris Freeman
MOD
Chris Freeman
Treehouse Moderator 68,426 Points

Hey doug james, a very interesting question. The short answer is there is no "thead" to find.

After inspecting the page source (using show source—not page inspector) , and also dumping the value of soup and poll_table to a file for inspection, there is no "thead" to be found. Also, after inspecting the wiki markdown source, there is no explicit "thead" mechanism present. I suspect (speculate) that the thead is dynamically added using JavaScript upon page loading. This explains why "thead" is not in the scraped data.

The wiki source lists the table classes as "wikitable sortable", but the element inspector shows the table classes as "wikitable sortable jquery-tablesorter". So that might be a key to where the "thead" is being inserted.

In general, it's better to use the "page source" as the structure map when scraping (Ctrl-U using Firefox) to avoid dynamic content changes.

Post back if you have more questions. Good luck!!