Comment faire pour extraire les liens d'une page web à l'aide de lxml, XPath et Python?

J'ai cette requête xpath:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

Extrait tous les liens avec l'attribut title - et donne le href dans FireFox Xpath vérificateur d'add-on.

Cependant, je n'arrive pas à l'utiliser avec lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

Ce qui ne produit pas le résultat de lxml (liste vide).

Comment pourrait-on saisir le href texte (lien) d'un lien hypertexte contenant le titre de l'attribut avec lxml sous Python?

Le document d'analyse ont un espace de noms (xmlns) ensemble?

InformationsquelleAutor torger | 2010-01-18

J'ai été capable de le faire fonctionner avec le code suivant:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

InformationsquelleAutor jkp

3

Firefox ajoute des balises html pour le html quand il rend, en faisant le xpath retourné par l'outil firebug incohérente avec le html renvoyé par le serveur (et ce urllib/2 sera de retour).

Retrait de la <tbody> balise ne fait généralement l'affaire.

InformationsquelleAutor mrmagooey

Vous devez vous connecter pour publier un commentaire.