Question

想要用表格处理多个 html 页面。

网页:

contains several classless tables, the only way how to identify the correct one
the needed table has in the 1st cell value "Content"

问题:如何在网上找到基于其单元格值的正确表格 : Scrape 或 Scrappy 或其他工具?

示例代码:

#!/usr/bin/env perl
use 5.014;
use warnings;
use Web::Scraper;
use YAML;

my $html = do { local $/; <DATA> };

my $table = scraper {

    #the easy way - table with class, or id or any attribute
    #process  table.xxx > tr ,  rows[]  => scraper {
    #unfortunately, the table hasn t class= xxx , so :(

    process  NEED_HELP_HERE > tr ,  rows[]  => scraper {
        process  th ,  header  =>  TEXT ;
        process  td ,  cols[]  =>  TEXT ;
    };
};
my $result = $table->scrape( $html );
say Dump($result);

__DATA__
<head><title>title</title></head>
<body>
<table><tr><th class="inverted">header</th><td>value</td></tr></table>
<!-- here are several another tables (different count) -->

<table> <!-- would be easy with some class="xxx" -->
   <tr>
     <th class="inverted">Content</th> <!-- Need this table - 1st cell == "Content" -->
     <td class="inverted">col-1</td>
     <td class="inverted">col-n</td>
   </tr>
   <tr>
     <th>Date</th>
     <td>2012</td>
     <td>2001</td>
   </tr>
   <tr>
     <th>Banana</th>
     <td>val-1</td>
     <td>val-n</td>
   </tr>
</table>
</body>
</html>

Answer 1

您需要使用 XPath 表达式查看节点的文本内容。

这应该能做这个把戏

my $table = scraper {
  process  //table[tr[1]/th[1][normalize-space(text())="Content"]]/tr ,  rows[]  => scraper {
    process  th ,  header  =>  TEXT ;
    process  td ,  cols[]  =>  TEXT ;
  };
};

可能看起来很复杂,但如果你把它拆了,那也没关系。

它选择所有元素,这些元素是任何元素的子元素; 根底以下的元素, 其中第一个元素包含一个等同于以下编码 > "Content" 的文本元素, 当它被常规化时( 引导和跟踪空格被剥除) 。

<强 > 输出

---
rows:
  - cols:
      - col-1
      - col-n
    header: Content
  - cols:
      - 2012
      - 2001
    header: Date
  - cols:
      - val-1
      - val-n
    header: Banana

Answer 2

HTML:表Exptract 似乎有助于解决这个问题。

试试看

#!/usr/bin/Perl 

use strict;
use warnings;
use lib qw( ..); 
use HTML::TableExtract; 
use LWP::Simple; 

my $te = HTML::TableExtract->new( headers => [qw(Content)] );
my $content = get("http://www.example.com");
 $te->parse($content);

foreach my $ts ($te->tables) {
   print "Table (", join( , , $ts->coords), "):
";
   foreach my $row ($ts->rows) {
      print join( , , @$row), "
";
   }
 }

如果您更改此直线

 my $te = HTML::TableExtract->new( headers => [qw(Content)] );

至

 my $te = HTML::TableExtract->new();

它会返回所有表格。所以如果上面的代码块没有给出您正在寻找的准确信息, 您可以使用该直线来玩耍。

Answer 3

和往常一样, < a href=" "http://p3rl.org/Web%3%a%3a3aQuery" rel="no follow" >web:: Query 赢得紧凑性。与Scraper不同的是, 不需要命名结果, 但如果你想, 它只是一条额外的线。

use Web::Query qw();
Web::Query->new_from_html($html)
->find( th:contains("Content") )
->parent->parent->find( tr )->map(sub {
    my (undef, $tr) = @_;
    +{ $tr->find( th )->text => [$tr->find( td )->text] }
})

表达式返回

[
    {Content => [ col-1 ,  col-n ]},
    {Date    => [2012,    2001]},
    {Banana  => [ val-1 ,  val-n ]}
]

友情链接