在python beautifulsoup中遍历多个div,输出到df然后是csv。

0 人关注

试图为我的学校的课程目录建立一个搜刮器/解析器。第一步是将Coursicle数据库刮成csv,但我现在只能让它吐出第一行。

这是我试图解析的一个HTML片段。

<div class="card back" style="display: block;">
    <div class="addClass Back"> 
        <i class="fa clicky fa-star Back"></i> 
        <i class="fa clicky fa-star-o Back"></i>  
        <i class="clicky icon-info-sign"></i>
    <div class="courseNumberBack">
        <span class="subject">ANTH</span> <span class="number">54</span>-<span class="section">001</span>
        <div class="smallCourseInfo">
            <span class="abbrevTitle">First-Year Seminar: The Indians' New Worlds: Southeastern Histories from 1200 to 1800</span> 
    <hr class="faddedLine">
    <div class="courseNameBack"><div class="days">TuTh</div><br>
    <div class="smallCourseInfo"> <div class="instructor">Clara Scarry</div></div>
    <div class="time">3:30pm-4:45pm</div><br>
    <div class="smallCourseInfo"> <div class="building">Alumni 203 </div></div>
    <div class="genEds">HS US WB </div>

这是我的代码。

import pandas as pd
import os
import csv
import itertools
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/Users/as9934/Desktop/schedule/wb.htm"), "lxml")
cardback = (soup.find('div', class_='card back'))
for courseNumberBack in cardback.find_all('div', class_='courseNumberBack'):
    for subject in courseNumberBack.find_all('span', class_='subject'):
        for subjects in subject: 
            print (subjects.string,",", end=' ')
    for number in courseNumberBack.find_all('span', class_='number'):
        for numbers in number:
            print (numbers.string,",", end=' ')
    for section in courseNumberBack.find_all('span', class_='section'):
        for sections in section:
            print(sections.string,",", end=' ')
    for abbrevTitle in courseNumberBack.find_all('span', class_='abbrevTitle'):
        for abbrevTitles in abbrevTitle:
            print(abbrevTitles.string,",", end=' ')
for courseNameBack in cardback.find_all('div', class_='courseNameBack'):
    for day in courseNameBack.find_all('div', class_='days'):
        for days in day: 
            print(days.string,",", end=' ')
    for instructor in courseNameBack.find_all('div', class_='instructor'):
        for instructors in instructor:
            print(instructors.string,",", end=' ')
    for time in courseNameBack.find_all('div', class_='time'):
        for times in time:
            print(times.string,",", end=' ')
    for building in courseNameBack.find_all('div', class_='building'):
        for buildings in building:
            print(buildings.string,",", end=' ')
    for genEd in courseNameBack.find_all('div', class_='genEds'):
        for genEds in genEd:
            print(genEds.string, end=' ')

I tried this:

cardback = (soup.find('div', class_='card back'))
result = dict(
    [cardback.text for cardback in soup.select('span.subject')] , 
    [cardback.text for cardback in soup.select('span.number')] ,
    [cardback.text for cardback in soup.select('span.section')] , 
    [cardback.text for cardback in soup.select('span.abbrevTitle')] , 
    [cardback.text for carback in soup.select('div.days')] , 
    [cardback.text for carback in soup.select('div.instructor')] , 
    [cardback.text for carback in soup.select('div.time')] , 
    [cardback.text for carback in soup.select('div.building')] , 
    [cardback.text for carback in soup.select('div.genEds')] 
print(result) 

但是,这将返回这个错误。

ValueError: dictionary update sequence element #0 has length 9; 2 is required

有人有什么想法吗?