Adjust Font Size: A A       Guest settings   Register

Building a pitch-by-pitch database

Discussion in the Ask the Commish forum
Building a pitch-by-pitch database
Hi.

Since the start of the season I have been downloading the individual plate appearance pages (such as http://baseball.yahoo.co.jp/npb/live?id=2008032001&key=01_1_01). I have been doing this by creating a list of all possible urls for the date (within reason) by counting the ".." from the boxscores on this site and then using a download manager to download them.

I tried writing a script to automate the downloading of the files. However, I have no knowledge of any programming language, so---while being a learning experience---it has been unsuccessful in producing anything useful.

Since the url for each plate appearance has the same format http://baseball.yahoo.co.jp/npb/live?id=2008".@month.@day.@game."&key=".@inning."_".@frame."_".@batter"; .

So something such as the below works in printing the urls:


#!/usr/bin/perl

use LWP 5.64;

@batter = (1..9);

for (my $i = 0; $i < @batter; $i++) {
print "http://baseball.yahoo.co.jp/npb/live?id==2008032001&key=01_1_0$batter[$ii]\n";
}


I have one of these for each array; however, I have no idea how to combine them. I also have no idea how to pad the arrays for numbers greater than 9. Finally, this only prints the urls but does not get them. This one gets the pages:


#!/usr/bin/perl

use warnings;
use strict;
use LWP::Simple;

my $i;
my @batter = ('1', '2', '3');
my $page = ("http://baseball.yahoo.co.jp/npb/live?id=2008032001&key=01_1_0".@batter."");
foreach $i (@batter)
{
getprint($page);
}


However, it only works if I only list pages that exist. Lastly, I have this one which asks for user input for an array value (game) and then sees if the url exists:


#!/usr/bin/perl

use LWP 5.64;

my $browser = LWP::UserAgent->new;
$browser->timeout(10);

print "game? "; # Ask for input
$a = ; # Get input
chop $a; # Remove the newline at end

my $url = 'http://baseball.yahoo.co.jp/npb/live?id=200803200' . $a . '&key=01_1_01';

my $response = $browser->get( $url );

if($response->is_success) {
print "Exists -- $url";
} else {
print "Does not exist -- $url";
}


It seems as though I have the pieces to make something work, I just cannot put them together. Any help would be appreciated.

Applications
Despite my troubles in downloading the pages, I have started to examine the data in some detail. An application of the data is the creation of "tendency" pages. For example here are "tendency" pages for batters (Norihiro Akahoshi), pitchers (Tetsuya Utsumi).

The top section (orange/red) shows basic player information and data totals. For instance Akahoshi's page shows that in his 259 plate appearances he has faced 1728 pitches for an average of 6.672 per plate appearance. Utsumi has thrown an average of 6.159 pitches per batter.

The second section shows the pitches faced or used in each count "class" by how much they favor each player. The numbers were from this comment based on Linear Weights. The table below shows the LWTS by count and the classification of each count:

S-B	LWTS	TYPE
3-0 0.207 BAT++
3-1 0.137 BAT++
2-0 0.097 BAT++
3-2 0.062 BAT+
2-1 0.035 BAT+
1-0 0.034 BAT+
0-0 0.000 NEU
1-1 -0.016 PIT+
2-2 -0.037 PIT+
0-1 -0.043 PIT+
1-2 -0.083 PIT++
0-2 -0.104 PIT++


The pitch types are (in order top-bottom); Straight, Curve, Forkball, Slider, Shuuto, Sinker, Changeup, Cutter, Special (may be any), and unknown. Akahoshi faces alot of straights and sliders (combined ~75%), although on good hitters' counts the pitches tend to be grooved in with straights. Utsumi is primarily a three pitch pitcher (straight, slider, changeup) who uses his off-speed pitches in the middle of at-bats.

The next four sections are split into "versus left" and "versus right".

The first section ("Stats") is blank and I am not sure if it should be kept as these stats are available everywhere. The next section, "Zones" shows zonal tendencies. For pitchers the "PIT#" column shows where a batter would be standing. So we can see that Utsumi likes to throw low and away to left-handed batters as well as right-handed batters, but also throws low and inside to righties. PIT# is the number of pitches ans the op number under zone indicates the percentage of pitches inside the strike-zone; the other percentage is outside. The last row shows the percentage of pitches that resulted in strikes or fouls (str.+foul), balls or passed balls (ball+PB), and everything else (other) such as hits, in-play outs, etc. For batters, the section is the same except for to "location" of where the batter is standing, for instance since Akahoshi is a left-hander we can see that he tend to be pitched to outside by both lefties and righties. Left handers tend to throw more off the lower outside corner than right-handers, who throw low and low-inside a bit more---likely mostly from the number of sliders that he sees. Right-handers also throw high and outside a bit more. I think I have batter handedness by pitch recorded, although for the most part it would seem batters would go for the platoon split.

The "Spray" section shows were batted balls ended up on a poor representation of a field. The four cell positions show (from top-left, clockwise) total batted balls, grounders, line drives, and fly balls. The positions are typical, although the shortstop is above the third-baseman, with green representing infielders, pitchers and catchers, blue is for outfielders, red is for home runs, and foul outs are in the orange section near the guide. The totals-by section are shaded by frequency. Akahoshi is a push-hitter, especially against lefties, while Utsumi gets righties to pull the ball. The surrounding cells in the corners show the total number of batted balls of each type, and the last row shows batted balls by field-thirds: left (left field, 3rd base, shortstop), center (center field, pitcher, catcher), and right (right field, 2nd base, 1st base). [The "created by" section should be vertically aligned]

The last section, "Type", shows the results of plays ("H"its or "O"uts) by location and batted ball type. The sections are the same as the field thirs although pitchers and catchers are not included. The batted ball types are grounders (G), line drives (L), and fly balls (F). The final row shows the player's Groundball to Flyball Ratio (GB/FB), the percentage of flyballs that are infield flies (IF/F), and percentage of outfield flyballs that are home runs (HR/OF). The section shows that Akahoshi tends to push balls low for grounders and line drives, and is a ground ball hitter. Akahoshi is a push-hitter, especially against lefties, while Utsumi is a groundball-pitcher, especially versus lefties.

The data can also used to determine Win Probability Added as seen here.

Also does anyone have a yahooNPB to WestbayID table?

Thanks.
Michael Eng
Comments
Re: Building a pitch-by-pitch database
[ Author: westbaystars | Posted: Aug 7, 2008 11:18 PM | YBS Fan ]

- Also does anyone have a yahooNPB to WestbayID table?

I had created one last year. I could probably do it again (and/or expand last year's). If I don't have something for you by the end of next week, please remind me.
Re: Building a pitch-by-pitch database
[ Author: Deanna | Posted: Aug 8, 2008 1:51 AM | NIP Fan ]

I wrote a Perl script to download full Yahoo games play-by-play last year:
use npbtools;

my $gamenum = shift;
my $dirname = shift;
my $maxinnings = shift || 9;

for (my $inning = 1; $inning for (my $team = 1; $team for (my $pos = 1; $pos
my $browser = LWP::UserAgent->new;
my $realinn = sprintf "%02d", $inning;
my $realpos = sprintf "%02d", $pos;

my $boxurl = "http://baseball.yahoo.co.jp/npb/live\?id\=$gamenum\&key\=$realinn\_$team\_$realpos";
my $boxpage = $browser->get($boxurl);
next
if $boxpage->code eq 404;
my $boxscore = $boxpage->content;
my $outfname = "$dirname/$gamenum\_$realinn\_$team\_$realpos.html";
my @outlines = ($boxscore);
print "processing $outfname\n";
filePut ($outfname, \@outlines);
}
}
}

It should be noted that my "npbtools" Perl library is just what I wrote with utilities like filePut and fileGet and doesn't actually bear on this script. This just grabs all the individual frames and dumps them to a directory.

I also have something that processes each individual page, but it doesn't do everything I'd like yet, and I simply don't have time to work on this stuff during the season, unless someone particularly wants to hire me to do it.

If you want to work with me on some Perl stuff, though, feel free to shoot an e-mail, I've probably got the same ideas you have, and just absolutely no time to implement them. (I actually did some pitch speed and percentage stuff a while back for my own curiosity, basically.) And then you could work directly with the scoresheets rather than reprocessing the translated output.

I have a YahooNPB to full name converter -- that's how I display the names in the translated box scores on this site -- but I think something got messed up when Westbay tried to make me a Yahoo->Westbay list last time.
Re: Building a pitch-by-pitch database
[ Author: westbaystars | Posted: Aug 8, 2008 10:33 AM | YBS Fan ]

[...] but I think something got messed up when Westbay tried to make me a Yahoo->Westbay list last time.

Hey, I never got a bug report saying such. :P
Re: Building a pitch-by-pitch database
[ Author: Guest: Michael Eng | Posted: Apr 22, 2009 12:52 AM ]

Sorry to revive and old thread, but I wanted to provide an update. I have a working pitch-by-pitch/play-by-play extractor from Yahoo! NPB which I have published here on Google Docs. It uses python and will create two csv files, a pitch file and a play file.

The pitches file's format is:

  • pa_id (YYYYMMDDgame(2)_inning_half-inning_batter)

  • pitch to this batter

  • pitches by pitcher

  • pitch location

  • speed

  • pitch symbol

  • pitch type (straight, slider, etc.; also not translated)

  • result class (hit, out, etc.)

  • result of this pitch (not translated)

  • strikes after this pitch

  • balls after this pitch

  • outs after this pitch

  • away team score after this pitch

  • home team score after this pitch



The play file's format is:

  • pa_id (YYYYMMDDgame(2)_inning_half-inning_batter)

  • base runners image (i.e. b000 means no base runners)

  • result image (i.e. o07f means "fly out to left fielder)

  • base runners on first (not translated)

  • base runners on second (not translated)

  • base runners on third(not translated)

  • base runners on third(not translated)

  • pitcher (YahooID)

  • pitcher handedness

  • batter (YahooID)

  • batter handedness

  • pitches in this at-bat

  • [next 18] away team lineup in the format: player (YahooID), position (scorecard number)

  • blank

  • [next 18] home team lineup in the format: player(YahooID), position (scorecard number)

  • blank



Still there are some things could be worked out. Since base runners are not listed by their Yahoo! ID, but rather kanji or katakana and since I do not know how players with the same last name are differentiated by the site I have left them as is. I did the same for pitch_result and pitch_type since result_names is incomplete and pitch_names may also be incomplete.

Also, the script will probably break down on postponed or suspended games.

Anyway, I posted this in the hope that people find it useful.

Michael Eng
About

This is a site about Pro Yakyu (Japanese Baseball), not about who the next player to go over to MLB is. It's a community of Pro Yakyu fans who have come together to share their knowledge and opinions with the world. It's a place to follow teams and individuals playing baseball in Japan (and Asia), and to learn about Japanese (and Asian) culture through baseball.

It is my sincere hope that once you learn a bit about what we're about here that you will join the community of contributors.

Michael Westbay
(aka westbaystars)
Founder

Search for Pro Yakyu news and information
Copyright (c) 1995-2024 JapaneseBaseball.com.
This work is licensed under a Creative Commons License.
Some rights reserved.