English 中文(简体)
如何找到一张基于 Web 的表格 :: 基于单元格值的剪切器?
原标题:How to find a table with Web::Scraper based on cell values?
  • 时间:2012-05-22 13:41:19
  •  标签:
  • perl

想要用表格处理多个 html 页面 。

网页:

  • contains several classless tables, the only way how to identify the correct one
  • the needed table has in the 1st cell value "Content"

问题:如何在网上找到基于其单元格值的正确表格 : Scrape 或 Scrappy 或其他工具?

示例代码:

#!/usr/bin/env perl
use 5.014;
use warnings;
use Web::Scraper;
use YAML;

my $html = do { local $/; <DATA> };

my $table = scraper {

    #the easy way - table with class, or id or any attribute
    #process  table.xxx > tr ,  rows[]  => scraper {
    #unfortunately, the table hasn t class= xxx , so :(

    process  NEED_HELP_HERE > tr ,  rows[]  => scraper {
        process  th ,  header  =>  TEXT ;
        process  td ,  cols[]  =>  TEXT ;
    };
};
my $result = $table->scrape( $html );
say Dump($result);

__DATA__
<head><title>title</title></head>
<body>
<table><tr><th class="inverted">header</th><td>value</td></tr></table>
<!-- here are several another tables (different count) -->

<table> <!-- would be easy with some class="xxx" -->
   <tr>
     <th class="inverted">Content</th> <!-- Need this table - 1st cell == "Content" -->
     <td class="inverted">col-1</td>
     <td class="inverted">col-n</td>
   </tr>
   <tr>
     <th>Date</th>
     <td>2012</td>
     <td>2001</td>
   </tr>
   <tr>
     <th>Banana</th>
     <td>val-1</td>
     <td>val-n</td>
   </tr>
</table>
</body>
</html>
最佳回答

您需要使用 XPath 表达式查看节点的文本内容。

这应该能做这个把戏

my $table = scraper {
  process  //table[tr[1]/th[1][normalize-space(text())="Content"]]/tr ,  rows[]  => scraper {
    process  th ,  header  =>  TEXT ;
    process  td ,  cols[]  =>  TEXT ;
  };
};

可能看起来很复杂,但如果你把它拆了,那也没关系。

它选择所有 元素,这些元素是任何 元素的子元素; 根底以下的 元素, 其中第一个 元素包含一个等同于以下编码 > "Content" 的文本元素, 当它被常规化时( 引导和跟踪空格被剥除) 。

<强 > 输出

---
rows:
  - cols:
      - col-1
      - col-n
    header: Content
  - cols:
      - 2012
      - 2001
    header: Date
  - cols:
      - val-1
      - val-n
    header: Banana
问题回答

HTML:表Exptract 似乎有助于解决这个问题。

试试看

#!/usr/bin/Perl 

use strict;
use warnings;
use lib qw( ..); 
use HTML::TableExtract; 
use LWP::Simple; 

my $te = HTML::TableExtract->new( headers => [qw(Content)] );
my $content = get("http://www.example.com");
 $te->parse($content);

foreach my $ts ($te->tables) {
   print "Table (", join( , , $ts->coords), "):
";
   foreach my $row ($ts->rows) {
      print join( , , @$row), "
";
   }
 }

如果您更改此直线

 my $te = HTML::TableExtract->new( headers => [qw(Content)] );

 my $te = HTML::TableExtract->new();

它会返回所有表格。 所以如果上面的代码块没有给出您正在寻找的准确信息, 您可以使用该直线来玩耍 。

和往常一样, < a href=" "http://p3rl.org/Web%3%a%3a3aQuery" rel="no follow" >web:: Query 赢得紧凑性。 与Scraper不同的是, 不需要命名结果, 但如果你想, 它只是一条额外的线 。

use Web::Query qw();
Web::Query->new_from_html($html)
->find( th:contains("Content") )
->parent->parent->find( tr )->map(sub {
    my (undef, $tr) = @_;
    +{ $tr->find( th )->text => [$tr->find( td )->text] }
})

表达式返回

[
    {Content => [ col-1 ,  col-n ]},
    {Date    => [2012,    2001]},
    {Banana  => [ val-1 ,  val-n ]}
]




相关问题
Why does my chdir to a filehandle not work in Perl?

When I try a "chdir" with a filehandle as argument, "chdir" returns 0 and a pwd returns still the same directory. Should that be so? I tried this, because in the documentation to chdir I found: "...

How do I use GetOptions to get the default argument?

I ve read the doc for GetOptions but I can t seem to find what I need... (maybe I am blind) What I want to do is to parse command line like this myperlscript.pl -mode [sth] [inputfile] I can use ...

Object-Oriented Perl constructor syntax and named parameters

I m a little confused about what is going on in Perl constructors. I found these two examples perldoc perlbot. package Foo; #In Perl, the constructor is just a subroutine called new. sub new { #I ...

Where can I find object-oriented Perl tutorials? [closed]

A Google search yields a number of results - but which ones are the best? The Perl site appears to contain two - perlboot and perltoot. I m reading these now, but what else is out there? Note: I ve ...

热门标签