:doc:`GlueDataBrew <../../databrew>` / Client / create_dataset

**************
create_dataset
**************



.. py:method:: GlueDataBrew.Client.create_dataset(**kwargs)

  

  Creates a new DataBrew dataset.

  

  See also: `AWS API Documentation <https://docs.aws.amazon.com/goto/WebAPI/databrew-2017-07-25/CreateDataset>`_  


  **Request Syntax**
  ::

    response = client.create_dataset(
        Name='string',
        Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
        FormatOptions={
            'Json': {
                'MultiLine': True|False
            },
            'Excel': {
                'SheetNames': [
                    'string',
                ],
                'SheetIndexes': [
                    123,
                ],
                'HeaderRow': True|False
            },
            'Csv': {
                'Delimiter': 'string',
                'HeaderRow': True|False
            }
        },
        Input={
            'S3InputDefinition': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'DataCatalogInputDefinition': {
                'CatalogId': 'string',
                'DatabaseName': 'string',
                'TableName': 'string',
                'TempDirectory': {
                    'Bucket': 'string',
                    'Key': 'string',
                    'BucketOwner': 'string'
                }
            },
            'DatabaseInputDefinition': {
                'GlueConnectionName': 'string',
                'DatabaseTableName': 'string',
                'TempDirectory': {
                    'Bucket': 'string',
                    'Key': 'string',
                    'BucketOwner': 'string'
                },
                'QueryString': 'string'
            },
            'Metadata': {
                'SourceArn': 'string'
            }
        },
        PathOptions={
            'LastModifiedDateCondition': {
                'Expression': 'string',
                'ValuesMap': {
                    'string': 'string'
                }
            },
            'FilesLimit': {
                'MaxFiles': 123,
                'OrderedBy': 'LAST_MODIFIED_DATE',
                'Order': 'DESCENDING'|'ASCENDING'
            },
            'Parameters': {
                'string': {
                    'Name': 'string',
                    'Type': 'Datetime'|'Number'|'String',
                    'DatetimeOptions': {
                        'Format': 'string',
                        'TimezoneOffset': 'string',
                        'LocaleCode': 'string'
                    },
                    'CreateColumn': True|False,
                    'Filter': {
                        'Expression': 'string',
                        'ValuesMap': {
                            'string': 'string'
                        }
                    }
                }
            }
        },
        Tags={
            'string': 'string'
        }
    )
    
  :type Name: string
  :param Name: **[REQUIRED]** 

    The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

    

  
  :type Format: string
  :param Format: 

    The file format of a dataset that is created from an Amazon S3 file or folder.

    

  
  :type FormatOptions: dict
  :param FormatOptions: 

    Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

    

  
    - **Json** *(dict) --* 

      Options that define how JSON input is to be interpreted by DataBrew.

      

    
      - **MultiLine** *(boolean) --* 

        A value that specifies whether JSON input contains embedded new line characters.

        

      
    
    - **Excel** *(dict) --* 

      Options that define how Excel input is to be interpreted by DataBrew.

      

    
      - **SheetNames** *(list) --* 

        One or more named sheets in the Excel file that will be included in the dataset.

        

      
        - *(string) --* 

        
    
      - **SheetIndexes** *(list) --* 

        One or more sheet numbers in the Excel file that will be included in the dataset.

        

      
        - *(integer) --* 

        
    
      - **HeaderRow** *(boolean) --* 

        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

        

      
    
    - **Csv** *(dict) --* 

      Options that define how CSV input is to be interpreted by DataBrew.

      

    
      - **Delimiter** *(string) --* 

        A single character that specifies the delimiter being used in the CSV file.

        

      
      - **HeaderRow** *(boolean) --* 

        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

        

      
    
  
  :type Input: dict
  :param Input: **[REQUIRED]** 

    Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

    

  
    - **S3InputDefinition** *(dict) --* 

      The Amazon S3 location where the data is stored.

      

    
      - **Bucket** *(string) --* **[REQUIRED]** 

        The Amazon S3 bucket name.

        

      
      - **Key** *(string) --* 

        The unique name of the object in the bucket.

        

      
      - **BucketOwner** *(string) --* 

        The Amazon Web Services account ID of the bucket owner.

        

      
    
    - **DataCatalogInputDefinition** *(dict) --* 

      The Glue Data Catalog parameters for the data.

      

    
      - **CatalogId** *(string) --* 

        The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

        

      
      - **DatabaseName** *(string) --* **[REQUIRED]** 

        The name of a database in the Data Catalog.

        

      
      - **TableName** *(string) --* **[REQUIRED]** 

        The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

        

      
      - **TempDirectory** *(dict) --* 

        Represents an Amazon location where DataBrew can store intermediate results.

        

      
        - **Bucket** *(string) --* **[REQUIRED]** 

          The Amazon S3 bucket name.

          

        
        - **Key** *(string) --* 

          The unique name of the object in the bucket.

          

        
        - **BucketOwner** *(string) --* 

          The Amazon Web Services account ID of the bucket owner.

          

        
      
    
    - **DatabaseInputDefinition** *(dict) --* 

      Connection information for dataset input files stored in a database.

      

    
      - **GlueConnectionName** *(string) --* **[REQUIRED]** 

        The Glue Connection that stores the connection information for the target database.

        

      
      - **DatabaseTableName** *(string) --* 

        The table within the target database.

        

      
      - **TempDirectory** *(dict) --* 

        Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

        

      
        - **Bucket** *(string) --* **[REQUIRED]** 

          The Amazon S3 bucket name.

          

        
        - **Key** *(string) --* 

          The unique name of the object in the bucket.

          

        
        - **BucketOwner** *(string) --* 

          The Amazon Web Services account ID of the bucket owner.

          

        
      
      - **QueryString** *(string) --* 

        Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

        

      
    
    - **Metadata** *(dict) --* 

      Contains additional resource information needed for specific datasets.

      

    
      - **SourceArn** *(string) --* 

        The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

        

      
    
  
  :type PathOptions: dict
  :param PathOptions: 

    A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

    

  
    - **LastModifiedDateCondition** *(dict) --* 

      If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

      

    
      - **Expression** *(string) --* **[REQUIRED]** 

        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

        

      
      - **ValuesMap** *(dict) --* **[REQUIRED]** 

        The map of substitution variable names to their values used in this filter expression.

        

      
        - *(string) --* 

        
          - *(string) --* 

          
    
  
    
    - **FilesLimit** *(dict) --* 

      If provided, this structure imposes a limit on a number of files that should be selected.

      

    
      - **MaxFiles** *(integer) --* **[REQUIRED]** 

        The number of Amazon S3 files to select.

        

      
      - **OrderedBy** *(string) --* 

        A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.

        

      
      - **Order** *(string) --* 

        A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

        

      
    
    - **Parameters** *(dict) --* 

      A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

      

    
      - *(string) --* 

      
        - *(dict) --* 

          Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

          

        
          - **Name** *(string) --* **[REQUIRED]** 

            The name of the parameter that is used in the dataset's Amazon S3 path.

            

          
          - **Type** *(string) --* **[REQUIRED]** 

            The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.

            

          
          - **DatetimeOptions** *(dict) --* 

            Additional parameter options such as a format and a timezone. Required for datetime parameters.

            

          
            - **Format** *(string) --* **[REQUIRED]** 

              Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".

              

            
            - **TimezoneOffset** *(string) --* 

              Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

              

            
            - **LocaleCode** *(string) --* 

              Optional value for a non-US locale code, needed for correct interpretation of some date formats.

              

            
          
          - **CreateColumn** *(boolean) --* 

            Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

            

          
          - **Filter** *(dict) --* 

            The optional filter expression structure to apply additional matching criteria to the parameter.

            

          
            - **Expression** *(string) --* **[REQUIRED]** 

              The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

              

            
            - **ValuesMap** *(dict) --* **[REQUIRED]** 

              The map of substitution variable names to their values used in this filter expression.

              

            
              - *(string) --* 

              
                - *(string) --* 

                
          
        
          
        
  

  
  :type Tags: dict
  :param Tags: 

    Metadata tags to apply to this dataset.

    

  
    - *(string) --* 

    
      - *(string) --* 

      


  
  :rtype: dict
  :returns: 
    
    **Response Syntax**

    
    ::

      {
          'Name': 'string'
      }
      
    **Response Structure**

    

    - *(dict) --* 
      

      - **Name** *(string) --* 

        The name of the dataset that you created.

        
  
  **Exceptions**
  
  *   :py:class:`GlueDataBrew.Client.exceptions.AccessDeniedException`

  
  *   :py:class:`GlueDataBrew.Client.exceptions.ConflictException`

  
  *   :py:class:`GlueDataBrew.Client.exceptions.ServiceQuotaExceededException`

  
  *   :py:class:`GlueDataBrew.Client.exceptions.ValidationException`

  