Skip to main content
Warning: You are using the test version of PyPI. This is a pre-production deployment of Warehouse. Changes made here affect the production instance of TestPyPI (testpypi.python.org).
Help us improve Python packaging - Donate today!

A wrapper for jieba segmentation

Project Description
Segmentation wrapper of Jieba Chinese segmentation.
  • Lazy initialization.
  • Initialization with user defined dict.
  • Build-in stop-words dict, punctuations dict.
  • Output control of stopwords, punctuations, minimum word length, output delimiters etc..
  • Support ngram.

API

init(stopwords_file, puncs_file, user_dict, silent, main_dict)
– Initialize the segmentation utility instance.
  • return: void.
  • stopwords_file: stopword dictionary. Use “” to disable. [SegJb.DEFAULT_STPW]
  • puncs_file: punctuation dictionary. Use “” to disable. [SegJb.DEFAULT_PUNC]
  • user_dict: load user customized dictionary. Use “” to disable. [SegJb.DEFAULT_DICT]
  • silent: whether print initializing log. [True]

set_param(delim, min_word_len, ngram, keep_stopwords, keep_puncs)
– Set one or more parameters of the segmentation utility instance. Refer to parameter description.
  • return: void

cut2list(corp)
– Cut a sentence to list due to configuration.
  • return: list
  • corp: unicode or utf8 sentence.

cut2str(corp)
– Cut a sentence to a delimeter(can be set by set_param) joined string.
  • return: unicode string.
  • corp: unicode or utf8 sentence.

Parameters

  • delim [‘ ‘]
    the delimeter used to constuct the segmentation result in string.
  • min_word_len [1]
    word with length less than min_word_len will not in segmentation result.
  • ngram [1]
    result can be ngram.
  • keep_stopwords [True]
    whether to keep stopwords in result.
  • keep_puncs [True]
    whether to keep stopwords in result.

Example:

from segjb import SegJb
hdl_seg = Segjb()
hdl_seg.init(stopwords_file=SegJb.DEFAULT_STPW,
             puncs_file=SegJb.DEFAULT_PUNC,
             user_dict=SegJb.DEFAULT_DICT)
hdl_seg.set_param(delim=' ', ngram=2, keep_stopwords=True, keep_puncs=False)
print hdl_seg.cut2str('这是一场精彩的比赛')
Release History

Release History

This version
History Node

0.1.3

History Node

0.1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
segjb-0.1.3.tar.gz (5.9 MB) Copy SHA256 Checksum SHA256 Source Dec 8, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting