How to select specific urls in a webpage with regexp?

1 view (last 30 days)
Hi all,
I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>
so after the link there is the string '\d* hp'. Here the code I use to detected them:
url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';
html=urlread(url);
hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');
This code works rather fine, but I'm not able to get rid of the first wrong result that is:
<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr>
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>
As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks
  1 Comment
Michael Dombrowski
Michael Dombrowski on 29 Jun 2017
When I run your code I get no results in hyperlinks. But, have you thought of adding "farm-tractors" into your regex? It would resolve your issue, and as long as all the links also go to the farm-tractors directory it would work fine.

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 29 Jun 2017
Edited: Guillaume on 29 Jun 2017
Note: avoid greedy .* particularly in complex expressions, it's bound to cause you problems. Negative classes often work better. For example, instead of <td.*>, use <td[^>]*>.
As per Michael comment, your posted regex does not work. But even with the simplified regex:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a.*?/a>', 'match')' %transposed for easy viewing in command window
you can see that there is a problem. Unfortunately for you, the problem is actually the webpage which is actually not valid html. Your whole problem comes from the fact that the spacer.gif <a hyperlink (on line 131 of the source html) is never closed. So of course, your regex captures everything up to the next a> which belongs to the next <tr><td>.
Unfortunately that makes your life rather difficult. Try:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a[^>]*>[^<]*</a>(?=</td><td[^>]*>\d+ hp</td>)', 'match')' %transposed for easy viewing in command window
And if you can report to the website owner that their page is missing a closing tag.

More Answers (0)

Categories

Find more on Search Path in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!