Génération de Ngrams (Unigrams, Bigrams etc) à partir d'un grand corpus de fichiers .txt et de leur fréquence

J'ai besoin d'écrire un programme dans NLTK qui casse un corpus (une grande collection de fichiers txt) dans unigrams, bigrams, trigrammes, fourgrams et fivegrams. J'ai déjà écrit le code pour entrer mes fichiers dans le programme.

L'entrée est de 300 .les fichiers txt écrit en anglais et je veux la sortie en forme de Ngrams et spécialement de la fréquence de comptage.

Je sais que NLTK a Bigram et Trigramme modules : http://www.nltk.org/_modules/nltk/model/ngram.html

mais je ne suis pas du tout avancé pour entrer dans mon programme.

d'entrée: fichiers txt PAS simple phrase

exemple de sortie:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

Mon code jusqu'à maintenant est:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

Toute aide à ce qui devrait être fait par la suite?

source d'informationauteur Arash

nltk python

Suffit d'utiliser ntlk.ngrams.

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)
print Counter(bigrams)
Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
(',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

Mise à JOUR (avec de la pure python):

import os
corpus = []
path = '.'
for i in os.walk(path).next()[2]:
if i.endswith('.txt'):
f = open(os.path.join(path,i))
corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
token = nltk.word_tokenize(text)
bigrams = ngrams(token, 2)
frequencies += Counter(bigrams)

2

Ok, alors, puisque vous l'avez demandé un NLTK solution est peut-être pas exactement ce que vous où regarder pour, mais, avez-vous considéré TextBlob? Il a un NLTK backend, mais il a une syntaxe plus simple. Il ressemblerait à quelque chose comme ceci:
```
from textblob import TextBlob
text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)
Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]
```
Bien entendu, vous devez utiliser le Compteur ou d'une autre méthode pour ajouter un compte par ngram.

Cependant, l'approche plus rapide (et de loin) que j'ai pu trouver à la fois de créer de toute ngram vous le souhaitez et également compter dans une seule fonction émane de cette post de 2012 et utilise Itertools. C'est génial.

Si l'efficacité est un problème et que vous avez à construire plusieurs types de n-grammes, mais que vous souhaitez utiliser pur python, je le ferais:

from itertools import chain
def n_grams(tokens, n=1):
"""Returns an iterator over the n-grams given a list of tokens"""
shiftToken = lambda i: (el for j,el in enumerate(tokens) if j>=i)
shiftedTokens = (shiftToken(i) for i in range(n))
tupleNGrams = zip(*shiftedTokens)
return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)
def range_ngrams(tokens, ngramRange=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngramRange) given a list of tokens."""
return chain(*(n_grams(tokens, i) for i in range(*ngramRange)))

Utilisation :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Même vitesse que NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost de mon réponse précédente.

Voici un exemple simple d'utilisation de pur Python pour générer de l' ngram:

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'
>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]
>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]
>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

peut-être que ça aide. voir lien

import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]

Vous devez vous connecter pour publier un commentaire.