试图为我的学校的课程目录建立一个搜刮器/解析器。第一步是将Coursicle数据库刮成csv,但我现在只能让它吐出第一行。
这是我试图解析的一个HTML片段。
<div class="card back" style="display: block;">
<div class="addClass Back">
<i class="fa clicky fa-star Back"></i>
<i class="fa clicky fa-star-o Back"></i>
<i class="clicky icon-info-sign"></i>
<div class="courseNumberBack">
<span class="subject">ANTH</span> <span class="number">54</span>-<span class="section">001</span>
<div class="smallCourseInfo">
<span class="abbrevTitle">First-Year Seminar: The Indians' New Worlds: Southeastern Histories from 1200 to 1800</span>
<hr class="faddedLine">
<div class="courseNameBack"><div class="days">TuTh</div><br>
<div class="smallCourseInfo"> <div class="instructor">Clara Scarry</div></div>
<div class="time">3:30pm-4:45pm</div><br>
<div class="smallCourseInfo"> <div class="building">Alumni 203 </div></div>
<div class="genEds">HS US WB </div>
这是我的代码。
import pandas as pd
import os
import csv
import itertools
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/Users/as9934/Desktop/schedule/wb.htm"), "lxml")
cardback = (soup.find('div', class_='card back'))
for courseNumberBack in cardback.find_all('div', class_='courseNumberBack'):
for subject in courseNumberBack.find_all('span', class_='subject'):
for subjects in subject:
print (subjects.string,",", end=' ')
for number in courseNumberBack.find_all('span', class_='number'):
for numbers in number:
print (numbers.string,",", end=' ')
for section in courseNumberBack.find_all('span', class_='section'):
for sections in section:
print(sections.string,",", end=' ')
for abbrevTitle in courseNumberBack.find_all('span', class_='abbrevTitle'):
for abbrevTitles in abbrevTitle:
print(abbrevTitles.string,",", end=' ')
for courseNameBack in cardback.find_all('div', class_='courseNameBack'):
for day in courseNameBack.find_all('div', class_='days'):
for days in day:
print(days.string,",", end=' ')
for instructor in courseNameBack.find_all('div', class_='instructor'):
for instructors in instructor:
print(instructors.string,",", end=' ')
for time in courseNameBack.find_all('div', class_='time'):
for times in time:
print(times.string,",", end=' ')
for building in courseNameBack.find_all('div', class_='building'):
for buildings in building:
print(buildings.string,",", end=' ')
for genEd in courseNameBack.find_all('div', class_='genEds'):
for genEds in genEd:
print(genEds.string, end=' ')
I tried this:
cardback = (soup.find('div', class_='card back'))
result = dict(
[cardback.text for cardback in soup.select('span.subject')] ,
[cardback.text for cardback in soup.select('span.number')] ,
[cardback.text for cardback in soup.select('span.section')] ,
[cardback.text for cardback in soup.select('span.abbrevTitle')] ,
[cardback.text for carback in soup.select('div.days')] ,
[cardback.text for carback in soup.select('div.instructor')] ,
[cardback.text for carback in soup.select('div.time')] ,
[cardback.text for carback in soup.select('div.building')] ,
[cardback.text for carback in soup.select('div.genEds')]
print(result)
但是,这将返回这个错误。
ValueError: dictionary update sequence element #0 has length 9; 2 is required
有人有什么想法吗?