Spacy - v3.4.0


✨ New features and improvements

  • Support for mypy 0.950+ and pydantic v1.9 (#10786).
  • Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
  • Min/max {n,m} operator for Matcher patterns (#10981).
  • Language updates:
    • Improve tokenization for Cyrillic combining diacritics (#10837).
    • Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
  • Improved speed of vector lookups (#10992).
  • For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
  • Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
  • Improved speed of StringStore lookups (#10938).
  • Updated spacy project clone to try both main and master branches by default (#10843).
  • Added confidence threshold for named entity linker (#11016).
  • Improved handling of Typer optional default values for init_config_cli (#10788).
  • Added cycle detection in parser projectivization methods (#10877).
  • Added counts for NER labels in debug data (#10960).
  • Support for adding NVTX ranges to TrainablePipe components (#10965).
  • Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

📦 Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

| Package | UPOS | Parser LAS | NER F |
| ----------------------------------------------- | ---: | ---------: | ----: |
| hr_core_news_sm | 96.6 | 77.5 | 76.1 |
| hr_core_news_md | 97.3 | 80.1 | 81.8 |
| hr_core_news_lg | 97.5 | 80.4 | 83.0 |

🙏 Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

| Package | Model Version | TAG | Parser LAS | NER F |
| ----------------------------------------------- | ------------- | ---: | ---------: | ----: |
| en_core_news_md | v3.3.0 | 97.3 | 90.1 | 84.6 |
| en_core_news_md | v3.4.0 | 97.2 | 90.3 | 85.5 |
| en_core_news_lg | v3.3.0 | 97.4 | 90.1 | 85.3 |
| en_core_news_lg | v3.4.0 | 97.3 | 90.2 | 85.6 |

All CNN pipelines have been extended to add whitespace augmentation.

🔴 Bug fixes

  • Fix issue #10960: Support hyphens in NER labels.
  • Fix issue #10994: Fix horizontal spacing for spans in displaCy.
  • Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
  • Fix issue #11056: Don't use get_array_module in textcat.
  • Fix issue #11092: Fix vertical alignment for spans in displaCy.

🚀 Notes about upgrading from v3.3

  • Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

📖 Documentation and examples

  • spaCy universe additions:
    • Aim-spacy: An Aim-based spaCy experiment tracker.
    • Asent: Fast, flexible and transparent sentiment analysis.
    • spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
    • spacy-report: Generates interactive reports for spaCy models.

👥 Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere


Details

date
July 12, 2022, 6:16 a.m.
name
v3.4.0: Updated types, speed improvements and pipelines for Croatian
type
Minor
👇
Register or login to:
  • 🔍View and search all Spacy releases.
  • 🛠️Create and share lists to track your tools.
  • 🚨Setup notifications for major, security, feature or patch updates.
  • 🚀Much more coming soon!
Continue with GitHub
Continue with Google
or