Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialLingjian Kong
6,330 PointsDon't understand word boundaries
Hello,
Could someone explain the concept of word boundary
print(re.findall(r'@[-\w\d.]*[^gov\t]', data))
print(re.findall(r'\b@[-\w\d.]*[^gov\t]\b', data))
These are two different results I got.
>> ['@teamtreehouse.com', '@kennethlove\n', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@sverik\n', '@killerrabbit.com', '@teamtreehouse.com', '@ryancarson\n', '@tardis.
co.uk', '@example.com', '@example\n', '@us.', '@potus44\n', '@teamtreehouse.com', '@chalkers\n', '@empire.', '@darthvader\n', '@spain.']
>> ['@teamtreehouse.com', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@killerrabbit.com', '@teamtreehouse.com', '@tardis.co.uk', '@example.com', '@us.', '@teamtreehouse.
com', '@empire.', '@spain.']
Could someone explain why we have to use \b in the front and the back?
3 Answers
Chris Freeman
Treehouse Moderator 68,457 PointsA word boundary \b
says "in this place, a word character is expected." In your second example, this means a word character is expected before the "@
" and the last matching character must proceed a word character.
Since some of the matches in the first group end in a newline character "\n
", they will be rejected by the second pattern.
The boundary character is an anchor that says a word character must be hear but it doesn't "consume" the character into the results. You may think of it like a word match character "\w
" that doesn't hold on to the match results.
Bronson Avila
4,160 PointsFor anyone else reading this question, I can understand how the code shown in this exercise appears confusing. When Kenneth defined a word boundary in the Escape Hatches video, he specifically said a word boundary is, quote, "It's the edges of a word, defined by white space or the edges of a screen."
This definition may be misleading because it suggests that a word boundary cannot existing between two non-white space characters in a string. However, a word boundary can in fact exist under such circumstances, as one source notes that a word boundary can occur "between two characters in the string, where one is a word character and the other is not a word character."
So in the case of an email address such as "sender@address.com", all of the characters up until the "@" symbol are word characters, while the @ symbol itself is not a word character. Thus, the "gap" between "sender" and "@" constitutes a word boundary.
Chris Freeman
Treehouse Moderator 68,457 PointsGood points. In terms of the βgapβ, I would add that a word boundary \b is a βzero lengthβ matching element that matches the condition of a word boundary, but doesnβt not consume any characters.
ds1
7,627 PointsOhhhhh, ok- I was wrong... thanks for steering me to the correct answer, Chris!!
ds1
7,627 Pointsds1
7,627 PointsHi, Chris Freeman
Based on my understanding of the python docs, I think that \b means that the expected character is not a "\w". My understanding is that \b is saying that we expect whitespace or a non-Unicode word character there. I guess people call it a word boundary because such whitespace/etc. is before or after a word.
My theory on the reason why Lingjian's second example appears to behave differently (i.e. pulling in matches where a \b is not present even though it's written in the regular expression) is because of the " * "... which says that anything before it in the raw string (and not just in the set) can be matched 0 or multiple times, making the first \b of no effect. Anyway, that's my theory... I'm still pretty new to this regex stuff : )
Thanks for asking this question, Lingjian... I was really wondering about this too!!!
Chris Freeman
Treehouse Moderator 68,457 PointsChris Freeman
Treehouse Moderator 68,457 PointsThe key is the subtle difference between \b word boundary and \s whitespace and \W non-word character.
The \b is a word boundary marker. If used before a non-word character such as @, then the only possible word boundary preceding the @ would be one caused by a word ending there. That is, a word character immediately before the @. In the case of the twitter handles, it would not match.
Granted, the problem of exactly what to put in the regex might have been tougher if the twitter handle wasn't the last item in the line. But since it is the last item, it is sufficient to anchor the pattern with a end-of-line marker $.
The issue with using \s or \W, is they would become part of the matched string unless regex groups notation is added. You'll see regex groups later on in the course.