Bookmarks » Corpus & Dataset

2018-08-10

UAX #38: Unicode Han Database (Unihan)

http://www.unicode.org/reports/tr38/#Unihan.zip

2.2 Unihan.zip

Included with the Unicode Character Database is a file called Unihan.zip.

Unihan.zip ファイルは下記 URL からダウンロードできる。
ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip

日本語の漢字と中国語の繁体字 (Traditional)、簡体字 (Simplified) の変換方法

[1] のテーブルの漢字「電」 (electricity) を例に説明する。
日本語の「電」の Unicode は U+96FB である。
Unihan.zip の Unihan_Variants.txt から U+96FB を探すと、下記の行がヒットする。

U+96FB kSimplifiedVariant U+7535

Unicode の U+7535 は中国語の簡体字の「电」である。
同様にして逆変換も可能だ。

関連情報として [2] も参照されたい。

[1] https://en.wikipedia.org/wiki/Chinese_characters#Comparisons_of_traditional_Chinese.2C_simplified_Chinese.2C_and_Japanese
[2] https://www.reddit.com/r/LearnJapanese/comments/57wa0c/where_can_i_find_a_mapping_of_which_kanji/

2018-07-24

GitHub - aozorabunko/aozorabunko

https://github.com/aozorabunko/aozorabunko

GitHub に青空文庫の全データがある。

情報元
知らなかった……“青空文庫”の全データは“GitHub”から一括ダウンロードできる！ - やじうまの杜 - 窓の杜
https://forest.watch.impress.co.jp/docs/serial/yajiuma/1134357.html

2017-09-02

Download free history data on 16 currencies, gold and silver. Backtest trading strategies on historical data.

http://www.forextester.com/data/datasources

Forex Tester allows you to import an unlimited number of currency pairs and years of history data in almost any possible text format (ASCII *.csv, *.txt) and in MetaTrader4 history format (*.hst).

為替レートのデータ。
米ドル/円は1分足のデータ。期間は2001～2017年。CSVフォーマット。

2017-08-15

CHISE / 漢字構造情報データベース - CHaracter Information Service Environment

http://www.chise.org/ids/

ISO/IEC 10646-1:2000 の IDS 形式に基づく漢字の構造情報データベースを開発中です。

漢字（文字）の構成情報（偏と旁）が掲載されている。

2017-06-28

MegaFace and MF2: Million-Scale Face Recognition

http://megaface.cs.washington.edu/

Challenge 1: Train on any dataset, test your method with 1 million distractors
Challenge 2: Training on 672K identities (4.7 Million photos), test at Million scale

顔認識用の67万人470万枚の画像データをダウンロードできる。

2017-06-09

音声同期Epub - 青空朗読

http://aozoraroudoku.jp/kensaku/kensaku-epub.html

青空文庫の文章に合わせて朗読を聴くことができます。

音声 (mp3) と文 (xhtml) の対応情報 (smil) が入っているので、音声認識モデルの学習用データとして利用できそう。

2017-05-19

GitHub - googlecreativelab/quickdraw-dataset: Documentation on how to access and use the Quick, Draw! Dataset.

https://github.com/googlecreativelab/quickdraw-dataset

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

飛行機やイヌなど 345 カテゴリの手書きベクトルデータが ndjson 形式ファイル等で公開されている。

2017-05-11

GitHub - mdeff/fma: FMA: A Dataset For Music Analysis

https://github.com/mdeff/fma

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.

Then, you got various sizes of MP3-encoded audio data:

1. fma_small.zip: 8,000 tracks of 30s, 8 balanced genres (GTZAN-like) (7.2 GiB)
2. fma_medium.zip: 25,000 tracks of 30s, 16 unbalanced genres (22 GiB)
3. fma_large.zip: 106,574 tracks of 30s, 161 unbalanced genres (93 GiB)
4. fma_full.zip: 106,574 untrimmed tracks, 161 unbalanced genres (879 GiB)

2017-05-01

京都大学ウェブ文書リードコーパス - KWDLC - KUROHASHI-KAWAHARA LAB

http://nlp.ist.i.kyoto-u.ac.jp/index.php?KWDLC

本コーパスは、さまざまなウェブ文書のリード(冒頭)3文に各種言語情報を人手で付与したテキストコーパスです。ウェブ文書のリード3文を収集することによって、ニュース記事、百科事典記事、ブログ、商用ページなど多様なジャンル、文体の文書を含んでいます。コーパスの規模は約5,000文書です。

言語情報としては、形態素・固有表現・構文・格関係、照応・省略関係、共参照、談話関係の情報を付与しています。談話関係以外の情報は、形態素解析システムJUMAN、構文・格・照応解析システムKNPで自動解析を行い、その結果を専門家が修正したものです。談話関係については、クラウドソーシングを利用して付与しています。

2017-04-26

Datasets · arXivTimes/arXivTimes Wiki · GitHub

https://github.com/arXivTimes/arXivTimes/wiki/Datasets

言語コーパスや画像・音声データセットのリンク。