Bookmarks » Corpus & Dataset: May 2017

2017-05-19

GitHub - googlecreativelab/quickdraw-dataset: Documentation on how to access and use the Quick, Draw! Dataset.

https://github.com/googlecreativelab/quickdraw-dataset

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

飛行機やイヌなど 345 カテゴリの手書きベクトルデータが ndjson 形式ファイル等で公開されている。

2017-05-11

GitHub - mdeff/fma: FMA: A Dataset For Music Analysis

https://github.com/mdeff/fma

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.

Then, you got various sizes of MP3-encoded audio data:

1. fma_small.zip: 8,000 tracks of 30s, 8 balanced genres (GTZAN-like) (7.2 GiB)
2. fma_medium.zip: 25,000 tracks of 30s, 16 unbalanced genres (22 GiB)
3. fma_large.zip: 106,574 tracks of 30s, 161 unbalanced genres (93 GiB)
4. fma_full.zip: 106,574 untrimmed tracks, 161 unbalanced genres (879 GiB)

2017-05-01

京都大学ウェブ文書リードコーパス - KWDLC - KUROHASHI-KAWAHARA LAB

http://nlp.ist.i.kyoto-u.ac.jp/index.php?KWDLC

本コーパスは、さまざまなウェブ文書のリード(冒頭)3文に各種言語情報を人手で付与したテキストコーパスです。ウェブ文書のリード3文を収集することによって、ニュース記事、百科事典記事、ブログ、商用ページなど多様なジャンル、文体の文書を含んでいます。コーパスの規模は約5,000文書です。

言語情報としては、形態素・固有表現・構文・格関係、照応・省略関係、共参照、談話関係の情報を付与しています。談話関係以外の情報は、形態素解析システムJUMAN、構文・格・照応解析システムKNPで自動解析を行い、その結果を専門家が修正したものです。談話関係については、クラウドソーシングを利用して付与しています。