Spacy - v3.3.0


✨ New features and improvements

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

| Package | Language | UPOS | Parser LAS | NER F |
| --------------------------------------------------------------- | -------- | ---: | ---------: | ----: |
| fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 |
| fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 |
| fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 |
| ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 |
| ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 |
| ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 |
| sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 |
| sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 |
| sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
| ----------------------------------------------- | -------------: | -------------: |
| da_core_news_md | 84.9 | 94.8 |
| de_core_news_md | 73.4 | 97.7 |
| el_core_news_md | 56.5 | 88.9 |
| fi_core_news_md | - | 86.2 |
| it_core_news_md | 86.6 | 97.2 |
| ko_core_news_md | - | 90.0 |
| lt_core_news_md | 71.1 | 84.8 |
| nb_core_news_md | 76.7 | 97.1 |
| nl_core_news_md | 81.5 | 94.0 |
| pl_core_news_md | 87.1 | 93.7 |
| pt_core_news_md | 76.7 | 96.9 |
| ro_core_news_md | 81.8 | 95.5 |
| sv_core_news_md | - | 95.5 |

🔴 Bug fixes

  • Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
  • Fix issue #9443: Fix Scorer.score_cats for missing labels.
  • Fix issue #9669: Fix entity linker batching.
  • Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
  • Fix issue #9904: Fix textcat loss scaling.
  • Fix issue #9956: Compare all Span attributes consistently.
  • Fix issue #10073: Add "spans" to the output of doc.to_json.
  • Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
  • Fix issue #10189: Allow Example to align whitespace annotation.
  • Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
  • Fix issue #10324: Fix Tok2Vec for empty batches.
  • Fix issue #10347: Update basic functionality for rehearse.
  • Fix issue #10394: Fix Vectors.n_keys for floret vectors.
  • Fix issue #10400: Use meta in util.load_model_from_config.
  • Fix issue #10451: Fix Example.get_matching_ents.
  • Fix issue #10460: Fix initial special cases for Tokenizer.explain.
  • Fix issue #10521: Stream large assets on download in spaCy projects.
  • Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
  • Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

  • To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
  • Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
  • Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
  • Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

👥 Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996


Details

date
April 29, 2022, 7:49 a.m.
name
v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
type
Minor
👇
Register or login to:
  • 🔍View and search all Spacy releases.
  • 🛠️Create and share lists to track your tools.
  • 🚨Setup notifications for major, security, feature or patch updates.
  • 🚀Much more coming soon!
Continue with GitHub
Continue with Google
or