truelearn.datasets
.load_peek_dataset_raw#
- truelearn.datasets.load_peek_dataset_raw(*, dirname: str = '.', train_limit: Optional[int] = None, test_limit: Optional[int] = None, verbose: bool = True) Tuple[List[List[str]], List[List[str]], Dict[int, Tuple[str, str, str]]] [source]#
Download and Load the raw PEEKDataset.
Examples
To load the data:
>>> from truelearn.datasets import load_peek_dataset_raw >>> train, test, mapping = load_peek_dataset_raw(verbose=False) >>> len(train) 203590 >>> train[0] ['4248', '1', '1', '172.0', ..., '1'] >>> len(test) 86945 >>> test[0] ['12730', '1', '1', '0.0', ..., '0'] >>> len(mapping) 30367 >>> mapping[0] ('https://en.wikipedia.org/wiki/"Hello,_World!"_program', '"Hello, World!" program', "Traditional beginners' computer program")
- Parameters:
* – Use to reject positional arguments.
dirname – The directory name.
train_limit – An optional non-negative integer specifying the maximum number of lines to read from the train file. If None, it means no limit.
test_limit – An optional non-negative integer specifying the maximum number of lines to read from the test file. If None, it means no limit.
verbose – If True and the downloaded file doesn’t exist, this function outputs some information about the downloaded file.
- Returns:
A tuple of (train, test, mapping) where train and test are list of lines and mapping is a dict mapping topic_id to (url, title, description). Each line in train and test is a list of strings where each string is the value of the cell in the original csv.
The data looks like this:
( [ [slug, vid_id, ...],... # follow the structure of header ], [ ... ], { 0: (url, title, description),... # 0 is wiki id } )
- Raises:
TrueLearnValueError – If the train_limit or test_limit is less than 0.