项目 href orien properly clas htm ner com
项目地址: http://code.google.com/p/ganon / 文档: http://code.google.com/p/ganon/w/list
这个功能强大的很,使用类似 js 的标签选择器识别 DOM
The Ganon library gives access to html/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries.
Ganon 使用示例:
- // Parse the google code website into a DOM
- $html = file_get_dom('http: //code.google.com/');
AccessAccessing elements is made easy through the CSS3-like selectors and the object model.
- // Find all the paragraph tags with a class attribute and print the
- // value of the class attribute
- foreach($html('p[class]')as $element) {
- echo $element->class, "\n";
- }
- // Find the first div with ID "gc-header" and print the plain text of
- // the parent element (plain text means no HTML tags, just the text)
- echo $html('div#gc-header', 0)->parent->getPlainText();
- // Find out how many tags there are which are "ns:tag" or "div", but not
- // "a" and do not have a class attribute
- echo count($html('(ns|tag, div + !a)[!class]');
- >
ModificationElements can be easily modified after you've found them.
- // Find all paragraph tags which are nested inside a div tag, change
- // their ID attribute and print the new HTML code
- foreach($html('div p')as $index=>$element) {
- $element->id = "id$index";
- }
- echo $html;
- // Center all the links inside a document which start with "http://"
- // and print out the new HTML
- foreach($html('a[href ^= "http://"]')as $element) {
- $element->wrap('center');
- }
- echo $html;
- // Find all odd indexed "td" elements and change the HTML to make them links
- foreach($html('table td:odd')as $element) {
- $element->setInnerText(''.$element->getPlainText().'');
- }
- echo $html;
BeautifyGanon can also help you beautify your code and format it properly.
- // Beautify the old HTML code and print out the new, formatted codedom_format($html,array('attributes_case' => CASE_LOWER));
- echo $html;
ganon 抓取网页示例
来源: http://www.bubuko.com/infodetail-2031805.html