This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.<div><br></div><div>The <i>experiment_doc_labels_clean</i> documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: </div><div>id: Document id</div><div>ner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.</div><div>el_tags: List of entity ids based on the majority label assigned using crowdsourcing.</div><div>all_ner_tags: List of lists of IOB tags assigned by each of the users.</div><div>all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.</div><div>tokens: List of tokens from the text.</div><div><br></div><div>The <i>experiment_doc_labels_clean-U.tsv </i>contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.</div><div><br></div><div>While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the <i>all_docs_complete_labels_clean.json</i> and <i>all_docs_complete_labels_clean-U.tsv datasets. </i>The <i>all_docs_complete_labels_clean.json</i> and <i>all_docs_complete_labels_clean-U.tsv</i> documents take the same form as the <i>experiment_doc_labels_clean.json </i>and<i> </i><i>experiment_doc_labels_clean-U.tsv.</i></div><div><i><br></i></div><div>Each of the documents described above contain an entity id. The IDs match to the entities stored in the <i>entity_descriptions</i> CSV files. Each of row in these files corresponds to a mention for an entity and take the form:</div><div>{ID}${Mention}${Context}[N]</div><div>Three sets of entity descriptions are available:</div><div>1. <i>entity_descriptions_experiments.csv</i>: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.</div><div>2. <i>entity_descriptions_experiments_clean.csv</i>: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the <i>experiment_doc_labels_clean </i>files.</div><div>3. <i>entity_descriptions_all.csv:</i> The entities in this file correspond to the data in the <i>all_docs_complete_labels_clean. </i>Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.</div>