Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.

The code below is using BeautifulSoup to scrape event data and the event links:

import pandas as pd
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')
event = []
links = []
# ---Event Data---
for a in soup.find_all('a'):
    event.append(a.text)
df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]
# ---Links---
for a in soup.find_all('a', href=True): 
    if a.text: 
        links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']
# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)

At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).

Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29:  [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}

When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.

df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()

Any help would be greatly appreciated! Thank you.

Is the information provided in this question enough for you stackoverflow.com/a/62009216/12239523. I think it is similar but i may be wrong – Sebastian May 26, 2020 at 19:11
date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)

See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).

Details

  • [A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
  • - a space (replace with \s to match any whitespace)
  • [0-9]{1,2} - one or two digits
  • (?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
  • - - hyphen
  • (?:[A-Z][a-z]* )? - an optional sequence of
  • [A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
  • - a space (replace with \s to match any whitespace)
  • [0-9]{1,2} - one or two digits
  • The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.

    Thank you for explaining! This makes a lot of sense now. That site looks like a great learning tool. Thanks a lot. – Leslie Tate May 26, 2020 at 19:31 I was able to create the new date column, however, the original date is still present in the event column. How would I go about deleting that? I was assuming extract did that. – Leslie Tate May 26, 2020 at 19:46 @LeslieTate If you want to remove the dates from the Event column, you need to use str.replace on it, use df['Event'] = df['Event'].str.replace(date_reg, ''). str.extract only finds a match, it does not remove anything. – Wiktor Stribiżew May 26, 2020 at 19:59 url = 'https://www.techmeme.com/events' soup = BeautifulSoup(requests.get(url).content, 'html.parser') data = [] for row in soup.select('.rhov a'): date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False)) data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']}) df = pd.DataFrame(data) print(df)

    will create this dataframe:

              Date                                           Event          Place                                               Link
    0    May 26-29                NOW VIRTUAL:Augmented World Expo    Santa Clara      https://www.techmeme.com/gotos/www.awexr.com/
    1       May 27                               Earnings: HPQ,BOX                 https://www.techmeme.com/gotos/finance.yahoo.c...
    2       May 28                              Earnings: CRM, VMW                 https://www.techmeme.com/gotos/finance.yahoo.c...
    3    May 28-29         CANCELED:WeAreDevelopers World Congress         Berlin  https://www.techmeme.com/gotos/www.wearedevelo...
    4        Jun 2                                    Earnings: ZM                 https://www.techmeme.com/gotos/finance.yahoo.c...
    ..         ...                                             ...            ...                                                ...
    140   Dec 7-10                         NEW DATE:GOTO Amsterdam      Amsterdam         https://www.techmeme.com/gotos/gotoams.nl/
    141   Dec 8-10                 Microsoft Azure + AI Conference      Las Vegas  https://www.techmeme.com/gotos/azureaiconf.com...
    142   Dec 9-10           NEW DATE:Paris Blockchain Week Summit          Paris  https://www.techmeme.com/gotos/www.pbwsummit.com/
    143  Dec 13-16                          NEW DATE:KNOW Identity      Las Vegas  https://www.techmeme.com/gotos/www.knowidentit...
    144  Dec 15-16  NEW DATE, NEW LOCATION:Fortune Brainstorm Tech  San Francisco  https://www.techmeme.com/gotos/fortuneconferen...
    [145 rows x 4 columns]
                    This is a much easier solution! I guess it shows how this can be done in much less lines of code. I don't see the links included in the df.
    – Leslie Tate
                    May 26, 2020 at 19:52
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.