Perlサンプル22 サイトの情報を取得~その2~
投稿:2020-03-08
このツイートを見て作ってみました。
NHKのサイトからのど自慢の曲目リストを取得してCSVファイルに保存します。 CSVファイルの文字コードはutf8です。 エクセルで扱う場合は開かずに(「ファイルを開く」でなく)、テキスト読み込み(「データ|外部データの取り込み|テキストファイル」を使用)してください。
#!/usr/bin/env perl
use v5.26;
use utf8;
use warnings;
use strict;
use feature "say";
use open IO => ":utf8";
use DateTime::Format::Strptime;
use Encode::Locale;
use HTML::TagParser;
use Text::CSV qw/ csv /;
use WWW::Mechanize;
binmode STDOUT, ":encoding(console_out)";
binmode STDERR, ":encoding(console_out)";
$| = 1;
my $mc = WWW::Mechanize->new;
use constant URL_BASE => "http://www6.nhk.or.jp";
use constant CSV_OUT => "perlsample_022.csv";
use constant HEADERS => [qw / 放送日 地域 会場 曲目 歌手 /];
say "トップページを開く";
$mc->get(URL_BASE . "/nodojiman");
# 正常に開いたはず
$mc->success or die;
say "URL:" . $mc->uri;
say "タイトル:" . $mc->title;
say "これまでの放送をクリック";
$mc->follow_link(text => "これまでの放送");
# 正常に開いたはず
$mc->success or die;
say "URL:" . $mc->uri;
say "タイトル:" . $mc->title;
# 過去の放送一覧
my $tag = HTML::TagParser->new($mc->content);
# 「class="listEach clearfix"」と書かれたタグすべてを取得
my @report = $tag->getElementsByClassName("listEach clearfix");
say "放送:@{[scalar @report]}回";
my $strp = DateTime::Format::Strptime->new(
pattern => "%Y年%m月%d日",
time_zone => "local",
);
my @program;
for (@report) {
my $tag_dt = $_->firstChild;
my $dt = $strp->parse_datetime($tag_dt->innerText);
my $tag_dd = $tag_dt->nextSibling;
my ($firstline) = split /\n/, $tag_dd->innerText;
my ($place, $hall) = split /\s/, $firstline;
my $href = $tag_dd->firstChild->firstChild->firstChild->getAttribute("href");
push @program, {
dt => $dt,
place => $place,
hall => $hall,
href => $href,
};
}
print "放送毎の曲目取得:";
for (@program) {
print ".";
$mc->get(URL_BASE . $_->{href});
$mc->success or die;
$tag = HTML::TagParser->new($mc->content);
my @tag_tr = $tag->getElementsByTagName("tr");
my @song_list;
for (@tag_tr) {
my $tag_td = $_->firstChild;
my $title = $tag_td->innerText;
$title =~ s/\r\n//g;
$tag_td = $tag_td->nextSibling;
my $singer = $tag_td->innerText;
$singer =~ s/\r\n//g;
push @song_list, {
title => $title,
singer => $singer,
};
}
$_->{song_list} = \@song_list;
}
print "\n";
my $csv;
for my $program (sort { DateTime::compare($a->{dt}, $b->{dt}) } @program) {
push @{$csv}, {
放送日 => $program->{dt}->strftime('%F'),
地域 => $program->{place},
会場 => $program->{hall},
曲目 => $_->{title},
歌手 => $_->{singer},
} for @{$program->{song_list}};
}
say "CSV書き出し";
$\ = "\012";
csv(
encoding => "utf8",
headers => HEADERS,
eol => "\n",
in => $csv,
out => CSV_OUT,
);
トップページを開く URL:http://www6.nhk.or.jp/nodojiman/ タイトル:NHKのど自慢|NHK 総合テレビ・ラジオ第1 これまでの放送をクリック URL:http://www6.nhk.or.jp/nodojiman/list/index.html タイトル:これまでの放送|NHKのど自慢 放送:175回 放送毎の曲目取得:............................................................................................................................................................................... CSV書き出し
行数が多いので途中を省略しました。
"放送日","地域","会場","曲目","歌手" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","サウスポー","ピンク・レディー" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","それが大事","大事MANブラザーズバンド" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","Let It Go~ありのままで~(Heartfull Ver.)","May J." 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","愛しのテキーロ","氷川きよし" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","花は咲く","花は咲くプロジェクト" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ギンギラギンにさりげなく","近藤真彦" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","365日の紙飛行機",AKB48 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","花","中孝介" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","いつでも夢を","橋幸夫吉永小百合" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ふるさと","嵐" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","生きてこそ","May J." 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","Rising Sun",EXILE 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","白雲の城","氷川きよし" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","峠越え","福田こうへい" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ヒカレ","ゆず" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","飛び方を忘れた小さな鳥",MISIA 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","人恋酒場","三山ひろし" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","夜明け","天童よしみ" 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館",Story,AI 2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","涙そうそう","夏川りみ" 2016-04-10,"岩手県久慈市","久慈市文化会館","一番星","一番星プロジェクト" 2016-04-10,"岩手県久慈市","久慈市文化会館","桜木町","ゆず" 2016-04-10,"岩手県久慈市","久慈市文化会館","春一番","キャンディーズ" (中略) 2020-02-16,"岐阜県可児市","可児市文化創造センター","for you...","髙橋真梨子" 2020-02-16,"岐阜県可児市","可児市文化創造センター","もしもピアノが弾けたなら","西田敏行" 2020-02-16,"岐阜県可児市","可児市文化創造センター","手をつなごう ※チャンピオン※","絢香" 2020-02-23,"長野県辰野町","辰野町民会館","君は薔薇より美しい","布施明" 2020-02-23,"長野県辰野町","辰野町民会館","喝采","ちあきなおみ" 2020-02-23,"長野県辰野町","辰野町民会館",MARIONETTE,BOФWY 2020-02-23,"長野県辰野町","辰野町民会館","手紙","由紀さおり" 2020-02-23,"長野県辰野町","辰野町民会館","古い日記 ※特別賞※","和田アキ子" 2020-02-23,"長野県辰野町","辰野町民会館","これから","平原綾香" 2020-02-23,"長野県辰野町","辰野町民会館","心のこり","細川たかし" 2020-02-23,"長野県辰野町","辰野町民会館","恋のバカンス","ザ・ピーナッツ" 2020-02-23,"長野県辰野町","辰野町民会館",糸,"中島みゆき" 2020-02-23,"長野県辰野町","辰野町民会館","月がとっても青いから","菅原都々子" 2020-02-23,"長野県辰野町","辰野町民会館","限界突破×サバイバー","氷川きよし" 2020-02-23,"長野県辰野町","辰野町民会館","ハナミズキ","一青 窈" 2020-02-23,"長野県辰野町","辰野町民会館","高校三年生","舟木一夫" 2020-02-23,"長野県辰野町","辰野町民会館",Story,AI 2020-02-23,"長野県辰野町","辰野町民会館","だから僕は音楽を辞めた ※チャンピオン※","ヨルシカ" 2020-02-23,"長野県辰野町","辰野町民会館","勘太郎月夜唄","小畑実" 2020-02-23,"長野県辰野町","辰野町民会館","ひまわりの約束","秦基博" 2020-02-23,"長野県辰野町","辰野町民会館","悲しき口笛","美空ひばり" 2020-02-23,"長野県辰野町","辰野町民会館","酒と泪と男と女","河島英五" 2020-02-23,"長野県辰野町","辰野町民会館","いのちの歌","茉奈佳奈"