Sled Data

What Action Causes This? Towards Naive Physical Action-Effect Prediction

  • This dataset contains action-effect information for 140 verb-noun pairs. It has two parts: effects described by natural language, and effects depicted in images.
  • The language data contains verb-noun pairs and their effects described in natural language. For each verb-noun pair, its possible effects are described by 10 different annotators. The format for each line is “verb noun, effect_sentence[, effect_phrase_1, effect_phrase_2, effect_phrase_3 …]”. Effect_phrases were automatically extracted from their corresponding effect_sentences. 
  • The image data contains images depicting action effects. For each verb-noun pair, an average of 15 positive images and 15 negative images were collected. Positive images are those deemed to capture the resulting world state of the action. And negative images are those deemed to capture some state of the related object (i.e., the nouns in the verb-noun pairs), but are not the resulting state of the corresponding action.

Physical Causality of Action Verbs in Grounded Language Understanding

  • This dataset contains verb causality information of 4391 sentences. Each sentence was annotated by three different annotators through crowdsourcing. 
  • In the data file, each line contains the verb-object pair in one of the 4391 sentences and followed by a 18-dimension causality vector. In the vector, an element is 1 if at least two annotators labeled the corresponding causality attribute as true, 0 otherwise. The 18 causality attributes are: 
    1) AttachmentOfPart 
    2) Color 
    3) Containment 
    4) FlavorSmell 
    5) Location 
    6) OcclusionBySecondObject 
    7) Orientation 
    8) PresenceOfObject 
    9) Quantity 
    10) Shape 
    11) Size 
    12) Solidity 
    13) SurfaceIntegrity 
    14) Temperature 
    15) Texture 
    16) Visibility 
    17) Weight 
    18) Wetness

Grounded Semantic Role Labeling

The Dataset contains the video recognition/tracking annotation and text’s semantic role annotation.

Interior Decoration Domain (Gaze)

In this study, a static 3D bedroom scene is shown to the user. The system verbally asks the user a list of questions one at a time about the bedroom and the user answers the questions by speaking to the system.

  • The collected data includes users’ speech and accompanying gaze fixations (with possibly selected objects identified by the system).
  • Users’ speech was transcribed.
  • Data used for Languge Processing and Acquisition:
    • data w/o audio (496K)
    • data w/ audio (87.5M)
    • data description available in README
  • Data used for Reference Resolution:
    • data w/o audio (357K)

Treasure Hunting Domain

In this domain, the user’s task is to find some treasures that are hidden in a 3D castle. The user can walk around inside the castle and move objects. The user needs to consult with a remote “expert” (i.e., an artificial agent) to find the treasures. The expert has some knowledge about the treasures but can not see the castle. The user has to talk to the expert for advice about finding the treasures.

  • The collected data includes users’ speech and accompanying gaze fixations (with possibly selected objects identified by the system).
  • Users’ speech was transcribed.
  • Each speech-gaze instance (speech and its accompanying gaze fixations) was annotated whether any word in the speech refers to a fixated object in the accompanying gaze fixations.
  • Data used for Languge Processing and Acquisition:
    • data w/o audio (28.1M)
    • data w/ audio (564M)
    • data description is available in README
  • Data used for Reference Resolution:
    • data w/o audio (11.8M)

Interior Decoration Domain (Gesture)

In this domain, users were asked to accomplish tasks in two scenarios. Each scenario put the user into a specific role (e.g., college student, professor, etc.), and the task had to be completed with a set of constraints (e.g., budget of furnishings, bed size, number of domestic products, etc.). Users interacted with the system through a touchscreen using speech and deictic gesture.

  • The collected data includes users’ speech and accompanying gestures (with possibly selected objects identified by the system).
  • Users’ speech was transcribed; the intention (action to perform on an object) of each utterance and the true gesture selection were annotated.
  • Data:
    • data w/o audio (598K)
    • data w/ audio (43.6M)
  • Description of the data is available in README.

Conversation Entailment

The data used in EMNLP 2010 paper can be downloaded here.

Nominal Semantic Role Labeling

Our annotated data is available here.