Analyse d'un site web, obtenir les liens, analyse les liens avec PHP et XPATH

Je veux analyser l'ensemble d'un site web , j'ai lu plusieurs threads, mais je n'arrive pas à obtenir les données dans un 2ème niveau.

Qui est, je peux renvoyer le lien de la page de démarrage, mais je ne peux pas trouver un moyen d'analyser les liens et d'obtenir le contenu de chaque lien...

Le code que j'utilise est:

<?php

    // SELECT STARTING PAGE
      $url = 'http://mydomain.com/';
      $html= file_get_contents($url);

     //GET ALL THE LINKS OF EACH PAGE

         //create a dom object

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

         //run xpath for the dom

            $xPath = new DOMXPath($dom);


         //get links from starting page

            $elements = $xPath->query("//a/@href");
            foreach ($elements as $e) {
            echo $e->nodeValue. "<br />";
            }

     //Parse each page using the extracted links?

 ?>

Quelqu'un pourrait m'aider pour la dernière partie avec un exemple?

Je serai vraiment beaucoup apprécié!

Bien , merci pour vos réponses!
J'ai essayé quelques trucs mais je n'ai pas managet d'obtenir des résultats encore - je suis nouveau en programmation..

Ci-dessous, vous pouvez trouver 2 de mes tentatives - la 1ère tentative d'analyser les liens et dans la seconde, en essayant de remplacer file_get contenu avec Curl:

 1) 

<?php 
  // GET STARTING PAGE
  $url = 'http://www.capoeira.com.gr/';
  $html= file_get_contents($url);

  //GET ALL THE LINKS FROM STARTING PAGE

  //create a dom object

    $dom = new DOMDocument();
    @$dom->loadHTML($html);


    //run xpath for the dom

    $xPath = new DOMXPath($dom);

        //get specific elements from the sites

        $elements = $xPath->query("//a/@href");
//PARSE EACH LINK

    foreach($elements as $e) {
          $URLS= file_get_contents($e);
          $dom = new DOMDocument();
          @$dom->loadHTML($html);
          $xPath = new DOMXPath($dom);
          $output = $xPath->query("//div[@class='content-entry clearfix']");
         echo $output ->nodeValue;
        }                           
         ?>

Pour le code ci-dessus-je obtenir
Warning: file_get_contents() s'attend à ce paramètre 1 pour être de chaîne, objet donné dans ../example.php sur la ligne 26

    <?php
          $curl = curl_init();
          curl_setopt($curl, CURLOPT_POST, 1);
          curl_setopt($curl, CURLOPT_URL, "http://capoeira.com.gr");
          curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
          $content= curl_exec($curl);
          curl_close($curl);    



          $dom = new DOMDocument();
          @$dom->loadHTML($content);

           $xPath = new DOMXPath($dom);
           $elements = $xPath->query("//a/@href");
            foreach ($elements as $e) {
            echo $e->nodeValue. "<br />";
            }

   ?>

Je n'obtiens aucun résultat. J'ai essayé d'echo $contenu puis-je obtenir :

Vous n'avez pas la permission d'accéder à /sur ce serveur.

En outre, 413 Entité de Demande Trop Grande erreur a été rencontrée lors de la tentative d'utiliser un ErrorDocument pour répondre à la demande...

Des idées svp?? 🙂

vous pouvez envelopper le tout dans une fonction et de faire des appels récursifs pour chaque lien que vous trouverez, mais n'oubliez pas d'enregistrer les pages visitées pour éviter de courir dans les boucles infinies
Afficher le contenu ou la présentation de l'un des liens, pour commencer, et ce que vous avez essayé.
aussi, vous souhaiterez peut-être utiliser curl sur file_get_contents comme deux fois plus rapide également curl multi comme une option pour extraire plusieurs liens à la fois

OriginalL'auteur taz | 2012-04-11

Vous pouvez essayer ce qui suit. Voir ce fil pour plus de détails

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

OriginalL'auteur Team Webgalli

$doc = new DOMDocument; 
$doc->load('file.htm'); 

$items = $doc->getElementsByTagName('a'); 

foreach($items as $value) { 
    echo $value->nodeValue . "\n"; 
    $attrs = $value->attributes; 
    echo $attrs->getNamedItem('href')->nodeValue . "\n"; 
};

OriginalL'auteur DanFromGermany

Veuillez vérifier le code ci-dessous, j'espère qu'elle vous aide.

<?php
$html = new DOMDocument();
@$html->loadHtmlFile('http://www.yourdomain.com');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@class='A-CLASS-Name']/h3/a/@href" );
foreach ($nodelist as $n){
    echo $n->nodeValue."\n<br>";
}
?>

Grâce,
Roger

OriginalL'auteur Roger

trouver un lien de site web, de manière récursive avec la profondeur

<?php
$depth = 1;
print_r(getList($depth));  
function getList($depth)  
{
$lists = getDepth($depth);
return $lists; 
}
function getUrl($request_url)
{
$countValid = 0;
$brokenCount =0;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //We want to get the respone
$result = curl_exec($ch);
$regex = '|<a.*?href="(.*?)"|';
preg_match_all($regex, $result, $parts);
$links = $parts[1];
$lists = array();
foreach ($links as $link)
{
$url = htmlentities($link);
$result =getFlag($url);
if($result == true)
{
$UrlLists["clean"][$countValid] =$url;
$countValid++; 
} 
else
{
$UrlLists["broken"][$brokenCount]= "broken->".$url;
$brokenCount++;
}  
}
curl_close($ch);
return $UrlLists;
}
function ZeroDepth($list)
{
$request_url = $list;
$listss["0"]["0"] = getUrl($request_url);
$lists["0"]["0"]["clean"] = array_unique($listss["0"]["0"]["clean"]);
$lists["0"]["0"]["broken"] = array_unique($listss["0"]["0"]["broken"]);
return $lists; 
}
function getDepth($depth)
{        
//$list =OW_URL_HOME;
$list = "https://example.com";//enter the url of website 
$lists =ZeroDepth($list);
for($i=1;$i<=$depth;$i++)
{
$l= $i;
$l= $l-1;
$depthArray=1;
foreach($lists[$l][$l]["clean"] as $depthUrl)
{ 
$request_url = $depthUrl;
$lists[$i][$depthArray]["requst_url"]=$request_url;
$lists[$i][$depthArray] = getUrl($request_url);
}  
}
return $lists;   
}
function getFlag($url) 
{
$url_response = array();
$curl = curl_init();
$curl_options = array();
$curl_options[CURLOPT_RETURNTRANSFER] = true;
$curl_options[CURLOPT_URL] = $url;
$curl_options[CURLOPT_NOBODY] = true;
$curl_options[CURLOPT_TIMEOUT] = 60;
curl_setopt_array($curl, $curl_options);
curl_exec($curl);
$status = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($status == 200) 
{ 
return true;
} 
else 
{
return false;
}
curl_close($curl);
}
?>`

OriginalL'auteur Akshay bhatt

<?php
$path='http://www.hscripts.com/';
$html = file_get_contents($path);
$dom = new DOMDocument();
@$dom->loadHTML($html);
//grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++ ) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
?>

vous pouvez utiliser le code ci-dessus pour obtenir tous les liens possibles

OriginalL'auteur Thamaraiselvam

Vous devez vous connecter pour publier un commentaire.