text_to_word_sequence

keras.preprocessing.text.text_to_word_sequence(text,
                                               filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                               lower=True,
                                               split=" ")

Split a sentence into a list of words.

  • Return: List of words (str).

  • Arguments:

    • text: str.
    • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.
    • lower: boolean. Whether to set the text to lowercase.
    • split: str. Separator for word splitting.

one_hot

keras.preprocessing.text.one_hot(text,
                                 n,
                                 filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                 lower=True,
                                 split=" ")

One-hot encodes a text into a list of word indexes in a vocabulary of size n.

This is a wrapper to the hashing_trick function using hash as the hashing function.

  • Return: List of integers in [1, n]. Each integer encodes a word (unicity non-guaranteed).

  • Arguments:

    • text: str.
    • n: int. Size of vocabulary.
    • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.
    • lower: boolean. Whether to set the text to lowercase.
    • split: str. Separator for word splitting.

hashing_trick

keras.preprocessing.text.hashing_trick(text, 
                                       n,
                                       hash_function=None,
                                       filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                       lower=True,
                                       split=' ')

Converts a text to a sequence of indices in a fixed-size hashing space

  • Return: A list of integer word indices (unicity non-guaranteed).
  • Arguments:
    • text: str.
    • n: Dimension of the hashing space.
    • hash_function: defaults to python hash function, can be 'md5' or any function that takes in input a string and returns a int. Note that 'hash' is not a stable hashing function, so it is not consistent across different runs, while 'md5' is a stable hashing function.
    • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.
    • lower: boolean. Whether to set the text to lowercase.
    • split: str. Separator for word splitting.

Tokenizer

keras.preprocessing.text.Tokenizer(num_words=None,
                                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                   lower=True,
                                   split=" ",
                                   char_level=False)

Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).

  • Arguments: Same as text_to_word_sequence above.

    • num_words: None or int. Maximum number of words to work with (if set, tokenization will be restricted to the top num_words most common words in the dataset).
    • char_level: if True, every character will be treated as a token.
  • Methods:

    • fit_on_texts(texts):

      • Arguments:
        • texts: list of texts to train on.
    • texts_to_sequences(texts)

      • Arguments:
        • texts: list of texts to turn to sequences.
      • Return: list of sequences (one per text input).
    • texts_to_sequences_generator(texts): generator version of the above.

      • Return: yield one sequence per input text.
    • texts_to_matrix(texts):

      • Return: numpy array of shape (len(texts), num_words).
      • Arguments:
        • texts: list of texts to vectorize.
        • mode: one of "binary", "count", "tfidf", "freq" (default: "binary").
    • fit_on_sequences(sequences):

      • Arguments:
        • sequences: list of sequences to train on.
    • sequences_to_matrix(sequences):

      • Return: numpy array of shape (len(sequences), num_words).
      • Arguments:
        • sequences: list of sequences to vectorize.
        • mode: one of "binary", "count", "tfidf", "freq" (default: "binary").
  • Attributes:

    • word_counts: dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.
    • word_docs: dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.
    • word_index: dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.
    • document_count: int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.