Bookmarks » Corpus & Dataset: 2017

2017-09-02

Download free history data on 16 currencies, gold and silver. Backtest trading strategies on historical data.

http://www.forextester.com/data/datasources

Forex Tester allows you to import an unlimited number of currency pairs and years of history data in almost any possible text format (ASCII *.csv, *.txt) and in MetaTrader4 history format (*.hst).

為替レートのデータ。
米ドル/円は1分足のデータ。期間は2001～2017年。CSVフォーマット。

2017-08-15

CHISE / 漢字構造情報データベース - CHaracter Information Service Environment

http://www.chise.org/ids/

ISO/IEC 10646-1:2000 の IDS 形式に基づく漢字の構造情報データベースを開発中です。

漢字（文字）の構成情報（偏と旁）が掲載されている。

2017-06-28

MegaFace and MF2: Million-Scale Face Recognition

http://megaface.cs.washington.edu/

Challenge 1: Train on any dataset, test your method with 1 million distractors
Challenge 2: Training on 672K identities (4.7 Million photos), test at Million scale

顔認識用の67万人470万枚の画像データをダウンロードできる。

2017-06-09

音声同期Epub - 青空朗読

http://aozoraroudoku.jp/kensaku/kensaku-epub.html

青空文庫の文章に合わせて朗読を聴くことができます。

音声 (mp3) と文 (xhtml) の対応情報 (smil) が入っているので、音声認識モデルの学習用データとして利用できそう。

2017-05-19

GitHub - googlecreativelab/quickdraw-dataset: Documentation on how to access and use the Quick, Draw! Dataset.

https://github.com/googlecreativelab/quickdraw-dataset

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

飛行機やイヌなど 345 カテゴリの手書きベクトルデータが ndjson 形式ファイル等で公開されている。

2017-05-11

GitHub - mdeff/fma: FMA: A Dataset For Music Analysis

https://github.com/mdeff/fma

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.

Then, you got various sizes of MP3-encoded audio data:

1. fma_small.zip: 8,000 tracks of 30s, 8 balanced genres (GTZAN-like) (7.2 GiB)
2. fma_medium.zip: 25,000 tracks of 30s, 16 unbalanced genres (22 GiB)
3. fma_large.zip: 106,574 tracks of 30s, 161 unbalanced genres (93 GiB)
4. fma_full.zip: 106,574 untrimmed tracks, 161 unbalanced genres (879 GiB)

2017-05-01

京都大学ウェブ文書リードコーパス - KWDLC - KUROHASHI-KAWAHARA LAB

http://nlp.ist.i.kyoto-u.ac.jp/index.php?KWDLC

本コーパスは、さまざまなウェブ文書のリード(冒頭)3文に各種言語情報を人手で付与したテキストコーパスです。ウェブ文書のリード3文を収集することによって、ニュース記事、百科事典記事、ブログ、商用ページなど多様なジャンル、文体の文書を含んでいます。コーパスの規模は約5,000文書です。

言語情報としては、形態素・固有表現・構文・格関係、照応・省略関係、共参照、談話関係の情報を付与しています。談話関係以外の情報は、形態素解析システムJUMAN、構文・格・照応解析システムKNPで自動解析を行い、その結果を専門家が修正したものです。談話関係については、クラウドソーシングを利用して付与しています。

2017-04-26

Datasets · arXivTimes/arXivTimes Wiki · GitHub

https://github.com/arXivTimes/arXivTimes/wiki/Datasets

言語コーパスや画像・音声データセットのリンク。

2017-04-13

GitHub - visipedia/inat_comp: iNaturalist competition details

https://github.com/visipedia/inat_comp

There are a total of 5,089 categories in the dataset, with 579,184 training images and 95,986 validation images.

Training and validation images [186GB]

Training and validation annotations [26MB]

5,089個のカテゴリーの画像データセットをダウンロードできる。
画像の書庫サイズが 186 GBもある。

2017-03-19

雑談対話コーパス - 対話破綻検出チャレンジ

https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus

本コーパスはNTTドコモが一般公開している雑談対話APIを用いた雑談対話システムとユーザが21発話からなるやりとりを行った対話データで，116名の話者による1,146対話が収録されています．